Skip to content

Document Processing and Media

How Documents Become Searchable

Before questions can be answered, documents are converted into searchable records:

  1. PDF files are parsed into structured outputs.
  2. Text, metadata, images, and figures are extracted.
  3. OCR is used when pages are scanned.
  4. Text and captions are split into chunks.
  5. Chunks are converted to embeddings.
  6. Embeddings are stored in PostgreSQL with pgvector for retrieval.
flowchart LR
    A["PDF uploaded"] --> B["Parse PDF into structured JSON"]
    B --> C["Extract text blocks and metadata page section source"]
    B --> D["Extract images and figures"]
    B --> E{"Scanned PDF"}
    E -->|Yes| F["Run OCR to recover text"]
    E -->|No| C
    F --> C
    D --> G["Generate captions for figures and images"]
    C --> H["Split text into chunks"]
    G --> I["Create figure caption records"]
    H --> J["Convert text chunks to embeddings"]
    I --> K["Convert caption records to embeddings"]
    J --> L[(PostgreSQL plus pgvector index)]
    K --> L
    L --> N["Admin reviews records"]
    N --> O["Edit Title and Reference fields and set hide or show"]
    O --> L
    L --> M["Ready for retrieval during question answering"]

Open this diagram full size

Figure and Chart Retrieval

The system can also answer figure-specific questions by indexing image captions and related metadata.

flowchart LR
    A["PDF figure or image detected"] --> B["Extract image region and metadata page and figure id"]
    B --> C["Generate image caption"]
    C --> D["Create caption embedding"]
    D --> E[(PostgreSQL plus pgvector)]
    F["User asks about a chart or figure"] --> G["Question embedding"]
    G --> E
    E --> H["Retrieve matching figure captions"]
    H --> I["LLM answers with figure grounded context"]
    E --> J["Admin reviews figure captions"]
    J --> K["Edit captions and hide nonsensical figures"]
    K --> E

Open this diagram full size

Why Admin Curation Matters

  • Title and reference edits make results easier to read and verify.
  • Hide/show controls reduce noise from low-value or incorrect records.
  • Caption review improves accuracy for chart- and figure-based questions.