Banner image for Document Extraction
Core Concepts 5 min read

Document Extraction

Upload documents and meeting transcripts and let Catalio's AI extract structured requirements, capabilities, and processes automatically

Updated
On this page

Document Extraction is the process by which Catalio reads an uploaded file or meeting transcript and automatically identifies structured requirements, capabilities, processes, and integrations — producing Change Proposals that your team reviews before anything enters the catalog.

Instead of manually reading a forty-page spec and transcribing requirements by hand, you upload the document and let the AI do the first pass. You then spend your time reviewing, refining, and accepting the results — rather than creating from scratch.

The Pipeline

When you upload an Artifact with supported content, the following steps happen automatically:

Plaintext
Upload → DocumentExtractionWorker → GapAnalysisWorker → Change Proposals ready

Step 1: Upload and Trigger

Uploading an artifact of kind :attachment triggers the extraction pipeline immediately after the database transaction commits. This prevents the background job from starting before the file is fully saved.

The artifact’s status moves to :processing while extraction is running. You can watch this status in the UI — a spinner indicates the job is in progress.

Step 2: Entity Extraction (LLM)

Catalio reads the file content and uses the AI to extract structured information. The AI returns:

  • A summary — a one-sentence description of the document (stored in the artifact’s file_metadata as ai_summary)
  • An entities array — each entity has a name, description, entity type, confidence score, and source reference (the section or page where it was found)

Valid entity types extracted from documents: requirement, capability, process, integration, application, technology.

For each extracted entity, a Change Proposal is created with:

  • action_type: :create
  • source: :artifact_extraction
  • confidence_score from the LLM
  • evidence linking back to the artifact by ID

Step 3: Gap Analysis (LLM)

After proposals are created, the GapAnalysisWorker runs a two-part enrichment:

Quality Assessment — scores each proposal for clarity, testability, and completeness.

Gap Classification — compares each proposal against existing requirements in your catalog using cosine similarity. A gap_classification is assigned:

  • New requirement (no match found)
  • Duplicate of an existing requirement
  • Enhancement to an existing requirement

The similarity_score and matched_requirement_id fields on the proposal show which existing requirement the AI compared against.

After the gap analysis completes, the artifact returns to Active status and the Change Proposals view updates to show the new proposals ready for review.

Supported Content Types

Content Type Extension Notes
text/markdown .md Full extraction
text/plain .txt Full extraction
Other (PDF, DOCX) Skipped — specialized parsing not yet available

Documents are truncated at 50,000 characters before being sent to the LLM. Content beyond this limit is not analyzed.

Confidence Scoring

Document-sourced proposals use a 0.0–0.90 confidence scale. The maximum is capped at 0.90 to leave room for human validation — the AI never claims certainty.

Confidence Range Signal
0.70–0.90 Explicit requirement language: “must”, “shall”, numbered specs, compliance mandates
0.50–0.70 Clear needs: “should”, described behaviors, capability descriptions
0.30–0.50 Implied: background context, aspirational goals, general descriptions

Meeting Transcript Extraction

Catalio also extracts requirements from meeting transcripts using the same pipeline but a different prompt tuned for conversational language.

Meeting transcripts are inherently more ambiguous than formal documents. The extraction prompt distinguishes:

  • Firm requirements — modal verbs (“must”, “will”, “shall”), explicit acceptance criteria, stakeholder agreements with action items
  • Stakeholder agreements — decisions with clear action implications
  • Aspirational discussion — “it would be nice if…”, “what if we could…” — these are skipped

Meeting-sourced proposals use a 0.0–0.55 confidence scale. The lower cap reflects conversational ambiguity — a conversation can express a need but rarely with the precision of a formal specification.

Each meeting proposal includes:

  • requirement_text — a clean, standalone requirement statement (not a raw quote)
  • excerpt — the relevant passage from the transcript (max 120 characters)
  • speaker — the speaker’s name if identifiable
  • category — functional, non_functional, constraint, or integration

How to Review Results

After extraction completes, navigate to the artifact’s detail page to see the generated proposals. From there:

  • Accept a proposal to create the entity in your catalog
  • Modify a proposal before accepting to correct names or descriptions
  • Dismiss proposals that are duplicates, irrelevant, or incorrect

See Change Proposals for a full walkthrough of the review workflow.

Differences from Manual Authoring

Aspect Document Extraction Manual Authoring
Speed Seconds per document Minutes per requirement
Coverage Catches items easily missed Author-dependent
Accuracy Requires human review Direct ownership
Confidence LLM-assigned, 0.0–0.90 Always high by default
Source traceability source_reference links to document section Manually set

Best Practices

Use well-structured source documents. Documents with numbered lists, headings, and explicit “must”/“shall” language produce higher-confidence proposals. Narrative prose generates more low-confidence proposals that require more review effort.

Upload one document at a time. Extraction is sequential and runs in the background. Large batches create a queue — monitor the artifact status to know when each document is ready.

Review gap analysis classifications. The “duplicate” and “enhancement” classifications are the most actionable. Duplicates reveal redundancy in your catalog. Enhancements show where existing requirements might need expansion.

Dismiss liberally. It is faster to dismiss noise than to fear missing something. You can always re-read the source document if you dismiss too aggressively.

Next Steps

  • Learn about Artifacts and the different kinds Catalio supports
  • Understand Change Proposals and the full review workflow
  • Explore AI Chat for interactive, conversational requirement capture

Support

If extraction does not start after upload, verify that the file content type is text/markdown or text/plain. PDF and DOCX files are silently skipped in the current release. Check Settings > LLM Providers if the job starts but fails — LLM errors during extraction are retried up to three times before the artifact returns to :active status.