feat(ai): add VLM pipeline integration with optional picture description and formula enrichment (e6fe1e33) · Commits · Jan Reimes / 3gpp-crawler

docs/ai.md

+48 −6

Original line number	Diff line number	Diff line
		@@ -5,7 +5,7 @@ The AI module provides intelligent document processing capabilities for 3GPP doc
		Key Features:

		- Classification - Identify main documents in multi-file TDoc folders
		- Extraction - Convert DOCX/PDF to Markdown with keyword extraction and language detection (via Kreuzberg)
		- Extraction - Convert DOCX/PDF to Markdown with keyword extraction and language detection (via Docling)
		- Structured Elements - Preserve tables, figures, and equations with stable markers and metadata
		- Embeddings - Generate semantic vector representations for similarity search
		- Summarization - Create AI-powered abstracts
		@@ -40,7 +40,7 @@ cd 3gpp-crawler
		uv sync --extra ai
		```

		All required dependencies (Kreuzberg, LiteLLM, sentence-transformers, LanceDB) are installed automatically.
		All required dependencies (Docling, LiteLLM, sentence-transformers, LanceDB) are installed automatically.

		Internally, AI capabilities are provided by the optional `3gpp-ai` package, which is pulled in by `3gpp-crawler[ai]`.

		@@ -179,6 +179,45 @@ When structured extraction is enabled, conversion and workspace processing may g
		- `*_figures.json`
		- `*_equations.json`

		### VLM Features (Optional)

		The AI module supports optional Vision-Language Model (VLM) features for enhanced document processing. These features are disabled by default and must be explicitly enabled.

		#### What VLM Provides

		\| Feature \| Description \| Model \|
		\|---------\|-------------\|-------\|
		\| Picture Description \| Generates detailed natural language descriptions of figures and diagrams \| Granite Docling VLM \|
		\| Formula Enrichment \| Provides enhanced LaTeX/MathML representation of mathematical formulas \| Granite Docling VLM \|

		#### GPU Requirements

		VLM features require a GPU with sufficient VRAM. If no GPU is available, the processing will fail or run very slowly. The standard pipeline (without VLM) works on CPU.

		#### Enabling VLM

		Use the `--vlm` flag with the workspace process command:

		```bash
		# Process with VLM features enabled
		tdoc-crawler ai workspace process -w my-project --vlm

		# Force reprocess with VLM
		tdoc-crawler ai workspace process -w my-project --vlm --force
		```

		When `--vlm` is specified, both `enable_picture_description` and `enable_formula_enrichment` are activated.

		#### Standard vs VLM Pipeline

		\| Aspect \| Standard Pipeline \| VLM Pipeline \|
		\|--------\|-------------------\|--------------\|
		\| Table Detection \| ✅ Enabled (Docling) \| ✅ Enabled \|
		\| Formula Enrichment \| ✅ Basic (CodeFormula) \| ✅ Enhanced (VLM) \|
		\| Picture Description \| ❌ Not available \| ✅ VLM-generated descriptions \|
		\| GPU Required \| No \| Yes \|
		\| Processing Speed \| Faster \| Slower \|

		______________________________________________________________________

		## CLI Commands
		@@ -277,6 +316,9 @@ tdoc-crawler ai workspace process
		# Process with options
		tdoc-crawler ai workspace process -w my-project --force

		# Process with VLM features (requires GPU)
		tdoc-crawler ai workspace process -w my-project --vlm

		# Get workspace information with member counts
		tdoc-crawler ai workspace info my-project

		@@ -359,8 +401,8 @@ Legacy batch-processing helpers are removed. Use the LightRAG interfaces exposed

		## Supported File Types

		- DOCX - Primary format for extraction (via Kreuzberg)
		- PDF - Supported via Kreuzberg
		- DOCX - Primary format for extraction (via Docling)
		- PDF - Supported via Docling
		- XLSX - Handled as secondary files
		- PPTX - Handled as secondary files

		@@ -384,7 +426,7 @@ ______________________________________________________________________

		### Installation Issues

		Problem: `ModuleNotFoundError: No module named 'kreuzberg'`
		Problem: `ModuleNotFoundError: No module named 'docling'`

		Solution: Install the AI optional dependencies:

		@@ -536,5 +578,5 @@ ______________________________________________________________________
		## Additional Resources

		- [LiteLLM Provider Documentation](https://docs.litellm.ai/docs/providers) - Complete list of 100+ supported LLM providers
		- [Kreuzberg Documentation](https://docs.kreuzberg.dev/) - Document extraction library
		- [Docling Documentation](https://github.com/docling-project/docling) - Document extraction library
		- [LanceDB Documentation](https://lancedb.github.io/lancedb/) - Vector database

packages/3gpp-ai/AGENTS.md

+1 −1

Original line number	Diff line number	Diff line
		@@ -153,7 +153,7 @@ from tdoc_ai import LightRAGConfig, TDocRAG, TDocProcessor

		## Extraction

		LightRAG uses `kreuzberg` for text extraction before chunking and ingestion.
		LightRAG uses `docling` for text, table, and figure extraction before chunking and ingestion.

		## Deprecated/Removed

packages/3gpp-ai/docs/PIPELINE.md

+52 −18

Original line number	Diff line number	Diff line
		@@ -16,9 +16,9 @@ fetch_tdoc_files() ──► resolve_via_whatthespec()
		▼
		convert_tdoc_to_markdown()
		│
		├──► Already PDF? ──► kreuzberg.extract_file_sync()
		├──► Already PDF? ──► docling (DocumentConverter)
		│
		└──► DOCX/DOC? ──► convert-lo (LibreOffice) ──► PDF ──► kreuzberg
		└──► DOCX/DOC? ──► convert-lo (LibreOffice) ──► PDF ──► docling
		│
		▼
		Cache to .ai/<id>.md
		@@ -27,6 +27,38 @@ convert_tdoc_to_markdown()
		LLM Summarization
		```

		### VLM Pipeline (Optional)

		When `--vlm` flag is used, the pipeline uses the VLM-powered extraction:

		```
		workspace process --vlm
		│
		▼
		▼ (for each document)
		│
		▼
		convert_tdoc_to_markdown(vlm_options=VlmOptions(...))
		│
		▼
		VlmPipeline (Granite Docling)
		├──► Enhanced picture descriptions (VLM-generated)
		└──► Enhanced formula enrichment (VLM-enhanced)
		│
		▼
		Standard Extraction Outputs + VLM Artifacts
		```

		Key Differences:

		\| Feature \| Standard Pipeline \| VLM Pipeline \|
		\|---------\|-------------------\|---------------\|
		\| Pipeline \| `StandardPdfPipeline` \| `VlmPipeline` \|
		\| Table Structure \| ✅ `do_table_structure=True` \| ✅ Enabled \|
		\| Formula Enrichment \| ✅ `CodeFormulaVlmOptions` \| ✅ Enhanced \|
		\| Picture Description \| ❌ Not available \| ✅ VLM-generated \|
		\| GPU Required \| No \| Yes \|

		## Components

		### 1. fetch_tdoc_files()
		@@ -54,9 +86,11 @@ Converts TDoc to markdown using full pipeline.

		1. Fetch TDoc files via `fetch_tdoc_files()`
		1. Convert to PDF if needed (via convert-lo / LibreOffice)
		1. Extract text using kreuzberg
		1. Extract text using docling (Standard or VLM pipeline)
		1. Cache markdown to `.ai/<id>.md`

		Table Structure Detection: Available in both Standard and VLM modes. The Standard pipeline uses `do_table_structure=True` with `TableStructureOptions(do_cell_matching=True)`. The VLM pipeline provides enhanced detection.

		Caching:

		- Checks for existing `.md` file in `.ai` subdirectory
		@@ -107,7 +141,7 @@ To force re-conversion:
		\| Library \| Purpose \|
		\|---------\|---------\|
		\| `convert-lo` \| DOCX/DOC to PDF conversion via LibreOffice \|
		\| `kreuzberg` \| PDF/DOCX to markdown text extraction \|
		\| `docling` \| PDF/DOCX to markdown text extraction \|
		\| `litellm` \| LLM summarization \|
		\| `whatthespec` \| TDoc metadata lookup \|

packages/3gpp-ai/pyproject.toml

+4 −1

Original line number	Diff line number	Diff line
		@@ -17,11 +17,14 @@ dependencies = [
		"convert-lo",
		"doc2txt>=1.0.8",
		#"doc2txt>=1.0.8 @ git+https://github.com/Quantatirsk/doc2txt-pypi.git"
		"kreuzberg[all]>=4.0.0",
		"litellm>=1.81.15",
		"lightrag-hku[offline]>=1.4.9.3",
		"pg0-embedded>=0.12.0",
		"pydantic-settings>=2.13.1",
		"liteparse>=1.2.0",
		"docling[vlm]>=2.82.0",
		"transformers>=4.57.6",
		"docling-core[chunking]>=2.70.2",
		]

		[project.urls]

packages/3gpp-ai/ruff.toml

+1 −0

Original line number	Diff line number	Diff line
		@@ -66,6 +66,7 @@ max-locals = 20
		[lint.per-file-ignores]
		"tests/*.py" = ["S101", "S106", "PLR6301", "S603", "PLW1510"]
		"tests/*/.py" = ["S101", "S106", "PLR6301", "S603", "PLW1510"]
		"threegpp_ai/operations/chunking.py" = ["PLC0415"]

		[lint.pydocstyle]
		convention = "google"