Skip to content

Backend Ingestion Flow (Full Size)

Use browser zoom if needed for presentation or screenshots.

flowchart LR
    A["PDF uploaded"] --> B["Parse PDF into structured JSON"]
    B --> C["Extract text blocks and metadata page section source"]
    B --> D["Extract images and figures"]
    B --> E{"Scanned PDF"}
    E -->|Yes| F["Run OCR to recover text"]
    E -->|No| C
    F --> C
    D --> G["Generate captions for figures and images"]
    C --> H["Split text into chunks"]
    G --> I["Create figure caption records"]
    H --> J["Convert text chunks to embeddings"]
    I --> K["Convert caption records to embeddings"]
    J --> L[(PostgreSQL plus pgvector index)]
    K --> L
    L --> N["Admin reviews records"]
    N --> O["Edit Title and Reference fields and set hide or show"]
    O --> L
    L --> M["Ready for retrieval during question answering"]