📝 docs(ai): update installation instructions for opendataloader_pdf and hybrid mode (f6be862d) · Commits · Jan Reimes / 3gpp-crawler

docs/ai.md

+6 −2

Original line number	Diff line number	Diff line
		@@ -426,7 +426,7 @@ ______________________________________________________________________

		### Installation Issues

		Problem: `ModuleNotFoundError: No module named 'docling'`
		Problem: `ModuleNotFoundError: No module named 'opendataloader_pdf'`

		Solution: Install the AI optional dependencies:

		@@ -434,6 +434,10 @@ ______________________________________________________________________
		uv add 3gpp-crawler[ai]
		```

		Problem: `Java not found` or `opendataloader_pdf requires Java 11+`

		Solution: Install Java 11 or later and ensure it's on your system PATH. Download from https://adoptium.net/ or use your system's package manager.

		Problem: `lancedb not available`

		Solution: LanceDB is included in the `[ai]` extra. Reinstall:
		@@ -578,5 +582,5 @@ ______________________________________________________________________
		## Additional Resources

		- [LiteLLM Provider Documentation](https://docs.litellm.ai/docs/providers) - Complete list of 100+ supported LLM providers
		- [Docling Documentation](https://github.com/docling-project/docling) - Document extraction library
		- [OpenDataLoader PDF Documentation](https://github.com/opendataloader-project/opendataloader-pdf) - PDF extraction library (#1 in benchmarks)
		- [LanceDB Documentation](https://lancedb.github.io/lancedb/) - Vector database

packages/3gpp-ai/docs/PIPELINE.md

+25 −19

Original line number	Diff line number	Diff line
		@@ -16,9 +16,9 @@ fetch_tdoc_files() ──► resolve_via_whatthespec()
		▼
		convert_tdoc_to_markdown()
		│
		├──► Already PDF? ──► docling (DocumentConverter)
		├──► Already PDF? ──► opendataloader_pdf
		│
		└──► DOCX/DOC? ──► convert-lo (LibreOffice) ──► PDF ──► docling
		└──► DOCX/DOC? ──► convert-lo (LibreOffice) ──► PDF ──► opendataloader_pdf
		│
		▼
		Cache to .ai/<id>.md
		@@ -29,35 +29,36 @@ LLM Summarization

		### VLM Pipeline (Optional)

		When `--vlm` flag is used, the pipeline uses the VLM-powered extraction:
		When `--vlm` flag is used, the pipeline enables hybrid AI mode for complex PDF pages:

		```
		workspace process --vlm
		│
		▼
		▼ (for each document)
		│
		▼
		convert_tdoc_to_markdown(vlm_options=VlmOptions(...))
		convert_tdoc_to_markdown(vlm_options=VlmOptions(enable_hybrid=True, ...))
		│
		▼
		VlmPipeline (Granite Docling)
		├──► Enhanced picture descriptions (VLM-generated)
		└──► Enhanced formula enrichment (VLM-enhanced)
		OpenDataLoader with Hybrid Mode
		├──► Local extraction for simple pages (fast, deterministic)
		└──► AI backend (SmolVLM 256M) for complex pages (tables, formulas, pictures)
		│
		▼
		Standard Extraction Outputs + VLM Artifacts
		Standard Extraction Outputs + AI-Enhanced Artifacts
		```

		Key Differences:

		\| Feature \| Standard Pipeline \| VLM Pipeline \|
		\|---------\|-------------------\|---------------\|
		\| Pipeline \| `StandardPdfPipeline` \| `VlmPipeline` \|
		\| Table Structure \| ✅ `do_table_structure=True` \| ✅ Enabled \|
		\| Formula Enrichment \| ✅ `CodeFormulaVlmOptions` \| ✅ Enhanced \|
		\| Picture Description \| ❌ Not available \| ✅ VLM-generated \|
		\| GPU Required \| No \| Yes \|
		\| Feature \| Standard Pipeline \| Hybrid Mode \|
		\|---------\|-------------------\|-------------\|
		\| Backend \| `opendataloader_pdf` (local) \| `opendataloader_pdf[hybrid]` \|
		\| Table Structure \| ✅ Enabled \| ✅ Enabled \|
		\| Formula Enrichment \| ✅ Enabled \| ✅ Enhanced \|
		\| Picture Description \| ✅ Enabled \| ✅ AI-generated (SmolVLM) \|
		\| OCR for Scanned PDFs \| ✅ via `force_ocr` \| ✅ via `force_ocr` \|
		\| Java Required \| Yes (11+) \| Yes (11+) \|
		\| GPU Required \| No \| No (but hybrid server needs LLM)

		## Components

		@@ -86,10 +87,10 @@ Converts TDoc to markdown using full pipeline.

		1. Fetch TDoc files via `fetch_tdoc_files()`
		1. Convert to PDF if needed (via convert-lo / LibreOffice)
		1. Extract text using docling (Standard or VLM pipeline)
		1. Extract text using OpenDataLoader (Standard or Hybrid mode)
		1. Cache markdown to `.ai/<id>.md`

		Table Structure Detection: Available in both Standard and VLM modes. The Standard pipeline uses `do_table_structure=True` with `TableStructureOptions(do_cell_matching=True)`. The VLM pipeline provides enhanced detection.
		Table Structure Detection: Available in both Standard and Hybrid modes. The Standard mode uses local extraction. The Hybrid mode provides AI-enhanced detection for complex pages.

		Caching:

		@@ -148,10 +149,15 @@ To force re-conversion:
		\| Library \| Purpose \|
		\|---------\|---------\|
		\| `convert-lo` \| DOCX/DOC to PDF conversion via LibreOffice \|
		\| `docling` \| PDF/DOCX to markdown text extraction \|
		\| `opendataloader-pdf` \| PDF/DOCX to markdown text extraction (#1 in benchmarks, 0.907 accuracy) \|
		\| `opendataloader-pdf[hybrid]` \| AI-enhanced extraction for complex pages (optional) \|
		\| `litellm` \| LLM summarization \|
		\| `whatthespec` \| TDoc metadata lookup \|

		Requirements:

		- Java 11+ (required by OpenDataLoader)

		## CLI Commands

		```bash