Commit e6fe1e33 authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(ai): add VLM pipeline integration with optional picture description and formula enrichment

- Add VlmOptions dataclass with enable_picture_description and enable_formula_enrichment
- Integrate VlmPipeline with Granite Docling for picture description
- Add StandardPdfPipeline with formula enrichment via CodeFormulaVlmOptions
- Add --vlm flag to workspace process CLI command
- Remove TYPE_CHECKING guard, fix ConvertResult -> ConversionResult
- Update docs/ai.md and PIPELINE.md with VLM feature documentation
- Add HybridChunker wrapper for semantic chunking
parent 115e2321
Loading
Loading
Loading
Loading
+48 −6
Original line number Diff line number Diff line
@@ -5,7 +5,7 @@ The AI module provides intelligent document processing capabilities for 3GPP doc
**Key Features:**

- **Classification** - Identify main documents in multi-file TDoc folders
- **Extraction** - Convert DOCX/PDF to Markdown with keyword extraction and language detection (via Kreuzberg)
- **Extraction** - Convert DOCX/PDF to Markdown with keyword extraction and language detection (via Docling)
- **Structured Elements** - Preserve tables, figures, and equations with stable markers and metadata
- **Embeddings** - Generate semantic vector representations for similarity search
- **Summarization** - Create AI-powered abstracts
@@ -40,7 +40,7 @@ cd 3gpp-crawler
uv sync --extra ai
```

All required dependencies (Kreuzberg, LiteLLM, sentence-transformers, LanceDB) are installed automatically.
All required dependencies (Docling, LiteLLM, sentence-transformers, LanceDB) are installed automatically.

Internally, AI capabilities are provided by the optional `3gpp-ai` package, which is pulled in by `3gpp-crawler[ai]`.

@@ -179,6 +179,45 @@ When structured extraction is enabled, conversion and workspace processing may g
- `*_figures.json`
- `*_equations.json`

### VLM Features (Optional)

The AI module supports optional Vision-Language Model (VLM) features for enhanced document processing. These features are disabled by default and must be explicitly enabled.

#### What VLM Provides

| Feature | Description | Model |
|---------|-------------|-------|
| **Picture Description** | Generates detailed natural language descriptions of figures and diagrams | Granite Docling VLM |
| **Formula Enrichment** | Provides enhanced LaTeX/MathML representation of mathematical formulas | Granite Docling VLM |

#### GPU Requirements

VLM features require a GPU with sufficient VRAM. If no GPU is available, the processing will fail or run very slowly. The standard pipeline (without VLM) works on CPU.

#### Enabling VLM

Use the `--vlm` flag with the workspace process command:

```bash
# Process with VLM features enabled
tdoc-crawler ai workspace process -w my-project --vlm

# Force reprocess with VLM
tdoc-crawler ai workspace process -w my-project --vlm --force
```

When `--vlm` is specified, both `enable_picture_description` and `enable_formula_enrichment` are activated.

#### Standard vs VLM Pipeline

| Aspect | Standard Pipeline | VLM Pipeline |
|--------|-------------------|--------------|
| Table Detection | ✅ Enabled (Docling) | ✅ Enabled |
| Formula Enrichment | ✅ Basic (CodeFormula) | ✅ Enhanced (VLM) |
| Picture Description | ❌ Not available | ✅ VLM-generated descriptions |
| GPU Required | No | Yes |
| Processing Speed | Faster | Slower |

______________________________________________________________________

## CLI Commands
@@ -277,6 +316,9 @@ tdoc-crawler ai workspace process
# Process with options
tdoc-crawler ai workspace process -w my-project --force

# Process with VLM features (requires GPU)
tdoc-crawler ai workspace process -w my-project --vlm

# Get workspace information with member counts
tdoc-crawler ai workspace info my-project

@@ -359,8 +401,8 @@ Legacy batch-processing helpers are removed. Use the LightRAG interfaces exposed

## Supported File Types

- **DOCX** - Primary format for extraction (via Kreuzberg)
- **PDF** - Supported via Kreuzberg
- **DOCX** - Primary format for extraction (via Docling)
- **PDF** - Supported via Docling
- **XLSX** - Handled as secondary files
- **PPTX** - Handled as secondary files

@@ -384,7 +426,7 @@ ______________________________________________________________________

### Installation Issues

**Problem:** `ModuleNotFoundError: No module named 'kreuzberg'`
**Problem:** `ModuleNotFoundError: No module named 'docling'`

**Solution:** Install the AI optional dependencies:

@@ -536,5 +578,5 @@ ______________________________________________________________________
## Additional Resources

- [LiteLLM Provider Documentation](https://docs.litellm.ai/docs/providers) - Complete list of 100+ supported LLM providers
- [Kreuzberg Documentation](https://docs.kreuzberg.dev/) - Document extraction library
- [Docling Documentation](https://github.com/docling-project/docling) - Document extraction library
- [LanceDB Documentation](https://lancedb.github.io/lancedb/) - Vector database
+1 −1
Original line number Diff line number Diff line
@@ -153,7 +153,7 @@ from tdoc_ai import LightRAGConfig, TDocRAG, TDocProcessor

## Extraction

LightRAG uses `kreuzberg` for text extraction before chunking and ingestion.
LightRAG uses `docling` for text, table, and figure extraction before chunking and ingestion.

## Deprecated/Removed

+52 −18
Original line number Diff line number Diff line
@@ -16,9 +16,9 @@ fetch_tdoc_files() ──► resolve_via_whatthespec()

convert_tdoc_to_markdown()

       ├──► Already PDF? ──► kreuzberg.extract_file_sync()
        ├──► Already PDF? ──► docling (DocumentConverter)

       └──► DOCX/DOC? ──► convert-lo (LibreOffice) ──► PDF ──► kreuzberg
        └──► DOCX/DOC? ──► convert-lo (LibreOffice) ──► PDF ──► docling


                     Cache to .ai/<id>.md
@@ -27,6 +27,38 @@ convert_tdoc_to_markdown()
LLM Summarization
```

### VLM Pipeline (Optional)

When `--vlm` flag is used, the pipeline uses the VLM-powered extraction:

```
workspace process --vlm


        ▼ (for each document)


convert_tdoc_to_markdown(vlm_options=VlmOptions(...))


VlmPipeline (Granite Docling)
        ├──► Enhanced picture descriptions (VLM-generated)
        └──► Enhanced formula enrichment (VLM-enhanced)


Standard Extraction Outputs + VLM Artifacts
```

**Key Differences:**

| Feature | Standard Pipeline | VLM Pipeline |
|---------|-------------------|---------------|
| Pipeline | `StandardPdfPipeline` | `VlmPipeline` |
| Table Structure | ✅ `do_table_structure=True` | ✅ Enabled |
| Formula Enrichment | ✅ `CodeFormulaVlmOptions` | ✅ Enhanced |
| Picture Description | ❌ Not available | ✅ VLM-generated |
| GPU Required | No | Yes |

## Components

### 1. fetch_tdoc_files()
@@ -54,9 +86,11 @@ Converts TDoc to markdown using full pipeline.

1. Fetch TDoc files via `fetch_tdoc_files()`
1. Convert to PDF if needed (via convert-lo / LibreOffice)
1. Extract text using kreuzberg
1. Extract text using docling (Standard or VLM pipeline)
1. Cache markdown to `.ai/<id>.md`

**Table Structure Detection:** Available in both Standard and VLM modes. The Standard pipeline uses `do_table_structure=True` with `TableStructureOptions(do_cell_matching=True)`. The VLM pipeline provides enhanced detection.

**Caching:**

- Checks for existing `.md` file in `.ai` subdirectory
@@ -107,7 +141,7 @@ To force re-conversion:
| Library | Purpose |
|---------|---------|
| `convert-lo` | DOCX/DOC to PDF conversion via LibreOffice |
| `kreuzberg` | PDF/DOCX to markdown text extraction |
| `docling` | PDF/DOCX to markdown text extraction |
| `litellm` | LLM summarization |
| `whatthespec` | TDoc metadata lookup |

+4 −1
Original line number Diff line number Diff line
@@ -17,11 +17,14 @@ dependencies = [
    "convert-lo",
    "doc2txt>=1.0.8",
    #"doc2txt>=1.0.8 @ git+https://github.com/Quantatirsk/doc2txt-pypi.git"
    "kreuzberg[all]>=4.0.0",
    "litellm>=1.81.15",
    "lightrag-hku[offline]>=1.4.9.3",
    "pg0-embedded>=0.12.0",
    "pydantic-settings>=2.13.1",
    "liteparse>=1.2.0",
    "docling[vlm]>=2.82.0",
    "transformers>=4.57.6",
    "docling-core[chunking]>=2.70.2",
]

[project.urls]
+1 −0
Original line number Diff line number Diff line
@@ -66,6 +66,7 @@ max-locals = 20
[lint.per-file-ignores]
"tests/*.py" = ["S101", "S106", "PLR6301", "S603", "PLW1510"]
"tests/**/*.py" = ["S101", "S106", "PLR6301", "S603", "PLW1510"]
"threegpp_ai/operations/chunking.py" = ["PLC0415"]

[lint.pydocstyle]
convention = "google"
Loading