| OCR for Scanned PDFs | ✅ via `force_ocr` | ✅ via `force_ocr` |
| Java Required | Yes (11+) | Yes (11+) |
| GPU Required | No | No (but hybrid server needs LLM)
## Components
@@ -86,10 +87,10 @@ Converts TDoc to markdown using full pipeline.
1. Fetch TDoc files via `fetch_tdoc_files()`
1. Convert to PDF if needed (via convert-lo / LibreOffice)
1. Extract text using docling (Standard or VLM pipeline)
1. Extract text using OpenDataLoader (Standard or Hybrid mode)
1. Cache markdown to `.ai/<id>.md`
**Table Structure Detection:** Available in both Standard and VLM modes. The Standard pipeline uses `do_table_structure=True` with `TableStructureOptions(do_cell_matching=True)`. The VLM pipeline provides enhanced detection.
**Table Structure Detection:** Available in both Standard and Hybrid modes. The Standard mode uses local extraction. The Hybrid mode provides AI-enhanced detection for complex pages.
**Caching:**
@@ -148,10 +149,15 @@ To force re-conversion:
| Library | Purpose |
|---------|---------|
| `convert-lo` | DOCX/DOC to PDF conversion via LibreOffice |
| `docling` | PDF/DOCX to markdown text extraction |
| `opendataloader-pdf` | PDF/DOCX to markdown text extraction (#1 in benchmarks, 0.907 accuracy) |
| `opendataloader-pdf[hybrid]` | AI-enhanced extraction for complex pages (optional) |