Backend Ingestion Flow (Full Size)
Use browser zoom if needed for presentation or screenshots.
flowchart LR
A["PDF uploaded"] --> B["Parse PDF into structured JSON"]
B --> C["Extract text blocks and metadata page section source"]
B --> D["Extract images and figures"]
B --> E{"Scanned PDF"}
E -->|Yes| F["Run OCR to recover text"]
E -->|No| C
F --> C
D --> G["Generate captions for figures and images"]
C --> H["Split text into chunks"]
G --> I["Create figure caption records"]
H --> J["Convert text chunks to embeddings"]
I --> K["Convert caption records to embeddings"]
J --> L[(PostgreSQL plus pgvector index)]
K --> L
L --> N["Admin reviews records"]
N --> O["Edit Title and Reference fields and set hide or show"]
O --> L
L --> M["Ready for retrieval during question answering"]