Commit f6be862d authored by Jan Reimes's avatar Jan Reimes
Browse files

📝 docs(ai): update installation instructions for opendataloader_pdf and hybrid mode

parent da31f588
Loading
Loading
Loading
Loading
+6 −2
Original line number Diff line number Diff line
@@ -426,7 +426,7 @@ ______________________________________________________________________

### Installation Issues

**Problem:** `ModuleNotFoundError: No module named 'docling'`
**Problem:** `ModuleNotFoundError: No module named 'opendataloader_pdf'`

**Solution:** Install the AI optional dependencies:

@@ -434,6 +434,10 @@ ______________________________________________________________________
uv add 3gpp-crawler[ai]
```

**Problem:** `Java not found` or `opendataloader_pdf requires Java 11+`

**Solution:** Install Java 11 or later and ensure it's on your system PATH. Download from https://adoptium.net/ or use your system's package manager.

**Problem:** `lancedb not available`

**Solution:** LanceDB is included in the `[ai]` extra. Reinstall:
@@ -578,5 +582,5 @@ ______________________________________________________________________
## Additional Resources

- [LiteLLM Provider Documentation](https://docs.litellm.ai/docs/providers) - Complete list of 100+ supported LLM providers
- [Docling Documentation](https://github.com/docling-project/docling) - Document extraction library
- [OpenDataLoader PDF Documentation](https://github.com/opendataloader-project/opendataloader-pdf) - PDF extraction library (#1 in benchmarks)
- [LanceDB Documentation](https://lancedb.github.io/lancedb/) - Vector database
+25 −19
Original line number Diff line number Diff line
@@ -16,9 +16,9 @@ fetch_tdoc_files() ──► resolve_via_whatthespec()

convert_tdoc_to_markdown()

        ├──► Already PDF? ──► docling (DocumentConverter)
        ├──► Already PDF? ──► opendataloader_pdf

        └──► DOCX/DOC? ──► convert-lo (LibreOffice) ──► PDF ──► docling
        └──► DOCX/DOC? ──► convert-lo (LibreOffice) ──► PDF ──► opendataloader_pdf


                     Cache to .ai/<id>.md
@@ -29,35 +29,36 @@ LLM Summarization

### VLM Pipeline (Optional)

When `--vlm` flag is used, the pipeline uses the VLM-powered extraction:
When `--vlm` flag is used, the pipeline enables hybrid AI mode for complex PDF pages:

```
workspace process --vlm


        ▼ (for each document)


convert_tdoc_to_markdown(vlm_options=VlmOptions(...))
convert_tdoc_to_markdown(vlm_options=VlmOptions(enable_hybrid=True, ...))


VlmPipeline (Granite Docling)
        ├──► Enhanced picture descriptions (VLM-generated)
        └──► Enhanced formula enrichment (VLM-enhanced)
OpenDataLoader with Hybrid Mode
        ├──► Local extraction for simple pages (fast, deterministic)
        └──► AI backend (SmolVLM 256M) for complex pages (tables, formulas, pictures)


Standard Extraction Outputs + VLM Artifacts
Standard Extraction Outputs + AI-Enhanced Artifacts
```

**Key Differences:**

| Feature | Standard Pipeline | VLM Pipeline |
|---------|-------------------|---------------|
| Pipeline | `StandardPdfPipeline` | `VlmPipeline` |
| Table Structure | ✅ `do_table_structure=True` | ✅ Enabled |
| Formula Enrichment | ✅ `CodeFormulaVlmOptions` | ✅ Enhanced |
| Picture Description | ❌ Not available | ✅ VLM-generated |
| GPU Required | No | Yes |
| Feature | Standard Pipeline | Hybrid Mode |
|---------|-------------------|-------------|
| Backend | `opendataloader_pdf` (local) | `opendataloader_pdf[hybrid]` |
| Table Structure | ✅ Enabled | ✅ Enabled |
| Formula Enrichment | ✅ Enabled | ✅ Enhanced |
| Picture Description | ✅ Enabled | ✅ AI-generated (SmolVLM) |
| OCR for Scanned PDFs | ✅ via `force_ocr` | ✅ via `force_ocr` |
| Java Required | Yes (11+) | Yes (11+) |
| GPU Required | No | No (but hybrid server needs LLM)

## Components

@@ -86,10 +87,10 @@ Converts TDoc to markdown using full pipeline.

1. Fetch TDoc files via `fetch_tdoc_files()`
1. Convert to PDF if needed (via convert-lo / LibreOffice)
1. Extract text using docling (Standard or VLM pipeline)
1. Extract text using OpenDataLoader (Standard or Hybrid mode)
1. Cache markdown to `.ai/<id>.md`

**Table Structure Detection:** Available in both Standard and VLM modes. The Standard pipeline uses `do_table_structure=True` with `TableStructureOptions(do_cell_matching=True)`. The VLM pipeline provides enhanced detection.
**Table Structure Detection:** Available in both Standard and Hybrid modes. The Standard mode uses local extraction. The Hybrid mode provides AI-enhanced detection for complex pages.

**Caching:**

@@ -148,10 +149,15 @@ To force re-conversion:
| Library | Purpose |
|---------|---------|
| `convert-lo` | DOCX/DOC to PDF conversion via LibreOffice |
| `docling` | PDF/DOCX to markdown text extraction |
| `opendataloader-pdf` | PDF/DOCX to markdown text extraction (#1 in benchmarks, 0.907 accuracy) |
| `opendataloader-pdf[hybrid]` | AI-enhanced extraction for complex pages (optional) |
| `litellm` | LLM summarization |
| `whatthespec` | TDoc metadata lookup |

**Requirements:**

- Java 11+ (required by OpenDataLoader)

## CLI Commands

```bash