Document Extraction

Document Extraction is the process by which Catalio reads an uploaded file or meeting transcript and automatically identifies structured requirements, capabilities, processes, and integrations — producing Change Proposals that your team reviews before anything enters the catalog.

Instead of manually reading a forty-page spec and transcribing requirements by hand, you upload the document and let the AI do the first pass. You then spend your time reviewing, refining, and accepting the results — rather than creating from scratch.

The Pipeline

When you upload an Artifact with supported content, the following steps happen automatically:

Plaintext

Upload → DocumentExtractionWorker → GapAnalysisWorker → Change Proposals ready

Step 1: Upload and Trigger

Uploading an artifact of kind :attachment triggers the extraction pipeline immediately after the database transaction commits. This prevents the background job from starting before the file is fully saved.

The artifact’s status moves to :processing while extraction is running. You can watch this status in the UI — a spinner indicates the job is in progress.

Step 2: Entity Extraction (LLM)

Catalio reads the file content and uses the AI to extract structured information. The AI returns:

A summary — a one-sentence description of the document (stored in the artifact’s file_metadata as ai_summary)
An entities array — each entity has a name, description, entity type, confidence score, and source reference (the section or page where it was found)

Valid entity types extracted from documents: requirement, capability, process, integration, application, technology.

For each extracted entity, a Change Proposal is created with:

action_type: :create
source: :artifact_extraction
confidence_score from the LLM
evidence linking back to the artifact by ID

Step 3: Gap Analysis (LLM)

After proposals are created, the GapAnalysisWorker runs a two-part enrichment:

Quality Assessment — scores each proposal for clarity, testability, and completeness.

Gap Classification — compares each proposal against existing requirements in your catalog using cosine similarity. A gap_classification is assigned:

New requirement (no match found)
Duplicate of an existing requirement
Enhancement to an existing requirement

The similarity_score and matched_requirement_id fields on the proposal show which existing requirement the AI compared against.

After the gap analysis completes, the artifact returns to Active status and the Change Proposals view updates to show the new proposals ready for review.

Supported Content Types

Content Type	Extension	Notes
`text/markdown`	`.md`	Full extraction
`text/plain`	`.txt`	Full extraction
Other (PDF, DOCX)	—	Skipped — specialized parsing not yet available

Documents are truncated at 50,000 characters before being sent to the LLM. Content beyond this limit is not analyzed.

Confidence Scoring

Document-sourced proposals use a 0.0–0.90 confidence scale. The maximum is capped at 0.90 to leave room for human validation — the AI never claims certainty.

Confidence Range	Signal
0.70–0.90	Explicit requirement language: “must”, “shall”, numbered specs, compliance mandates
0.50–0.70	Clear needs: “should”, described behaviors, capability descriptions
0.30–0.50	Implied: background context, aspirational goals, general descriptions

Meeting Transcript Extraction

Catalio also extracts requirements from meeting transcripts using the same pipeline but a different prompt tuned for conversational language.

Meeting transcripts are inherently more ambiguous than formal documents. The extraction prompt distinguishes:

Firm requirements — modal verbs (“must”, “will”, “shall”), explicit acceptance criteria, stakeholder agreements with action items
Stakeholder agreements — decisions with clear action implications
Aspirational discussion — “it would be nice if…”, “what if we could…” — these are skipped

Meeting-sourced proposals use a 0.0–0.55 confidence scale. The lower cap reflects conversational ambiguity — a conversation can express a need but rarely with the precision of a formal specification.

Each meeting proposal includes:

requirement_text — a clean, standalone requirement statement (not a raw quote)
excerpt — the relevant passage from the transcript (max 120 characters)
speaker — the speaker’s name if identifiable
category — functional, non_functional, constraint, or integration

How to Review Results

After extraction completes, navigate to the artifact’s detail page to see the generated proposals. From there:

Accept a proposal to create the entity in your catalog
Modify a proposal before accepting to correct names or descriptions
Dismiss proposals that are duplicates, irrelevant, or incorrect

See Change Proposals for a full walkthrough of the review workflow.

Differences from Manual Authoring

Aspect	Document Extraction	Manual Authoring
Speed	Seconds per document	Minutes per requirement
Coverage	Catches items easily missed	Author-dependent
Accuracy	Requires human review	Direct ownership
Confidence	LLM-assigned, 0.0–0.90	Always high by default
Source traceability	`source_reference` links to document section	Manually set

Best Practices

Use well-structured source documents. Documents with numbered lists, headings, and explicit “must”/“shall” language produce higher-confidence proposals. Narrative prose generates more low-confidence proposals that require more review effort.

Upload one document at a time. Extraction is sequential and runs in the background. Large batches create a queue — monitor the artifact status to know when each document is ready.

Review gap analysis classifications. The “duplicate” and “enhancement” classifications are the most actionable. Duplicates reveal redundancy in your catalog. Enhancements show where existing requirements might need expansion.

Dismiss liberally. It is faster to dismiss noise than to fear missing something. You can always re-read the source document if you dismiss too aggressively.

Next Steps

Learn about Artifacts and the different kinds Catalio supports
Understand Change Proposals and the full review workflow
Explore AI Chat for interactive, conversational requirement capture

Support

If extraction does not start after upload, verify that the file content type is text/markdown or text/plain. PDF and DOCX files are silently skipped in the current release. Check Settings > LLM Providers if the job starts but fails — LLM errors during extraction are retried up to three times before the artifact returns to :active status.