Document Processing and Media
How Documents Become Searchable
Before questions can be answered, documents are converted into searchable records:
- PDF files are parsed into structured outputs.
- Text, metadata, images, and figures are extracted.
- OCR is used when pages are scanned.
- Text and captions are split into chunks.
- Chunks are converted to embeddings.
- Embeddings are stored in PostgreSQL with pgvector for retrieval.
flowchart LR
A["PDF uploaded"] --> B["Parse PDF into structured JSON"]
B --> C["Extract text blocks and metadata page section source"]
B --> D["Extract images and figures"]
B --> E{"Scanned PDF"}
E -->|Yes| F["Run OCR to recover text"]
E -->|No| C
F --> C
D --> G["Generate captions for figures and images"]
C --> H["Split text into chunks"]
G --> I["Create figure caption records"]
H --> J["Convert text chunks to embeddings"]
I --> K["Convert caption records to embeddings"]
J --> L[(PostgreSQL plus pgvector index)]
K --> L
L --> N["Admin reviews records"]
N --> O["Edit Title and Reference fields and set hide or show"]
O --> L
L --> M["Ready for retrieval during question answering"]
Figure and Chart Retrieval
The system can also answer figure-specific questions by indexing image captions and related metadata.
flowchart LR
A["PDF figure or image detected"] --> B["Extract image region and metadata page and figure id"]
B --> C["Generate image caption"]
C --> D["Create caption embedding"]
D --> E[(PostgreSQL plus pgvector)]
F["User asks about a chart or figure"] --> G["Question embedding"]
G --> E
E --> H["Retrieve matching figure captions"]
H --> I["LLM answers with figure grounded context"]
E --> J["Admin reviews figure captions"]
J --> K["Edit captions and hide nonsensical figures"]
K --> E
Why Admin Curation Matters
- Title and reference edits make results easier to read and verify.
- Hide/show controls reduce noise from low-value or incorrect records.
- Caption review improves accuracy for chart- and figure-based questions.