Document Extraction is the process by which Catalio reads an uploaded file or meeting transcript and automatically identifies structured requirements, capabilities, processes, and integrations — producing Change Proposals that your team reviews before anything enters the catalog.
Instead of manually reading a forty-page spec and transcribing requirements by hand, you upload the document and let the AI do the first pass. You then spend your time reviewing, refining, and accepting the results — rather than creating from scratch.
The Pipeline
When you upload an Artifact with supported content, the following steps happen automatically:
Upload → DocumentExtractionWorker → GapAnalysisWorker → Change Proposals ready
Step 1: Upload and Trigger
Uploading an artifact of kind :attachment triggers the extraction pipeline immediately after the database transaction commits. This prevents the background job from starting before the file is fully saved.
The artifact’s status moves to :processing while extraction is running. You can watch this status in the UI — a spinner indicates the job is in progress.
Step 2: Entity Extraction (LLM)
Catalio reads the file content and uses the AI to extract structured information. The AI returns:
- A summary — a one-sentence description of the document (stored in the artifact’s
file_metadataasai_summary) - An entities array — each entity has a name, description, entity type, confidence score, and source reference (the section or page where it was found)
Valid entity types extracted from documents: requirement, capability, process, integration, application, technology.
For each extracted entity, a Change Proposal is created with:
action_type: :createsource: :artifact_extractionconfidence_scorefrom the LLMevidencelinking back to the artifact by ID
Step 3: Gap Analysis (LLM)
After proposals are created, the GapAnalysisWorker runs a two-part enrichment:
Quality Assessment — scores each proposal for clarity, testability, and completeness.
Gap Classification — compares each proposal against existing requirements in your catalog using cosine similarity. A gap_classification is assigned:
- New requirement (no match found)
- Duplicate of an existing requirement
- Enhancement to an existing requirement
The similarity_score and matched_requirement_id fields on the proposal show which existing requirement the AI compared against.
After the gap analysis completes, the artifact returns to Active status and the Change Proposals view updates to show the new proposals ready for review.
Supported Content Types
| Content Type | Extension | Notes |
|---|---|---|
text/markdown |
.md |
Full extraction |
text/plain |
.txt |
Full extraction |
| Other (PDF, DOCX) | — | Skipped — specialized parsing not yet available |
Documents are truncated at 50,000 characters before being sent to the LLM. Content beyond this limit is not analyzed.
Confidence Scoring
Document-sourced proposals use a 0.0–0.90 confidence scale. The maximum is capped at 0.90 to leave room for human validation — the AI never claims certainty.
| Confidence Range | Signal |
|---|---|
| 0.70–0.90 | Explicit requirement language: “must”, “shall”, numbered specs, compliance mandates |
| 0.50–0.70 | Clear needs: “should”, described behaviors, capability descriptions |
| 0.30–0.50 | Implied: background context, aspirational goals, general descriptions |
Meeting Transcript Extraction
Catalio also extracts requirements from meeting transcripts using the same pipeline but a different prompt tuned for conversational language.
Meeting transcripts are inherently more ambiguous than formal documents. The extraction prompt distinguishes:
- Firm requirements — modal verbs (“must”, “will”, “shall”), explicit acceptance criteria, stakeholder agreements with action items
- Stakeholder agreements — decisions with clear action implications
- Aspirational discussion — “it would be nice if…”, “what if we could…” — these are skipped
Meeting-sourced proposals use a 0.0–0.55 confidence scale. The lower cap reflects conversational ambiguity — a conversation can express a need but rarely with the precision of a formal specification.
Each meeting proposal includes:
requirement_text— a clean, standalone requirement statement (not a raw quote)excerpt— the relevant passage from the transcript (max 120 characters)speaker— the speaker’s name if identifiablecategory— functional, non_functional, constraint, or integration
How to Review Results
After extraction completes, navigate to the artifact’s detail page to see the generated proposals. From there:
- Accept a proposal to create the entity in your catalog
- Modify a proposal before accepting to correct names or descriptions
- Dismiss proposals that are duplicates, irrelevant, or incorrect
See Change Proposals for a full walkthrough of the review workflow.
Differences from Manual Authoring
| Aspect | Document Extraction | Manual Authoring |
|---|---|---|
| Speed | Seconds per document | Minutes per requirement |
| Coverage | Catches items easily missed | Author-dependent |
| Accuracy | Requires human review | Direct ownership |
| Confidence | LLM-assigned, 0.0–0.90 | Always high by default |
| Source traceability | source_reference links to document section |
Manually set |
Best Practices
Use well-structured source documents. Documents with numbered lists, headings, and explicit “must”/“shall” language produce higher-confidence proposals. Narrative prose generates more low-confidence proposals that require more review effort.
Upload one document at a time. Extraction is sequential and runs in the background. Large batches create a queue — monitor the artifact status to know when each document is ready.
Review gap analysis classifications. The “duplicate” and “enhancement” classifications are the most actionable. Duplicates reveal redundancy in your catalog. Enhancements show where existing requirements might need expansion.
Dismiss liberally. It is faster to dismiss noise than to fear missing something. You can always re-read the source document if you dismiss too aggressively.
Next Steps
- Learn about Artifacts and the different kinds Catalio supports
- Understand Change Proposals and the full review workflow
- Explore AI Chat for interactive, conversational requirement capture
Support
If extraction does not start after upload, verify that the file content type is text/markdown or text/plain. PDF and DOCX files are silently skipped in the current release. Check Settings > LLM Providers if the job starts but fails — LLM errors during extraction are retried up to three times before the artifact returns to :active status.