Commit 964cf302 authored by Jan Reimes's avatar Jan Reimes
Browse files

🔥 chore: remove 3gpp-ai package and AI test suite

parent 40d7411e
Loading
Loading
Loading
Loading

packages/3gpp-ai/AGENTS.md

deleted100644 → 0
+0 −63
Original line number Diff line number Diff line
# 3gpp-ai

AI package for wiki-first processing of 3GPP TDocs and specs.

## Architecture Rule

Use wiki-first as the only supported compile/query contract.

- Do not introduce additional query modes.
- Do not add fallback flags or alternate retrieval-mode toggles.
- Keep query metadata deterministic: `query_mode = "wiki-first"`.

## Project Structure

Generate structure on demand from repository root:

```shell
rg --files | tree-cli --fromfile
```

## Commands

| Task | Command |
|------|---------|
| Lint | `ruff check packages/3gpp-ai tests/ai` |
| Test (package) | `uv run pytest tests/ai -v` |
| Test (single) | `uv run pytest tests/ai/test_wiki_contracts.py -v` |

## Configuration

Read settings from `TDC_AI_*` environment variables and use `CacheManager` for path resolution.

Required pattern:

```python
from tdoc_crawler.config import resolve_cache_manager
from threegpp_ai.config import AiConfig

manager = resolve_cache_manager()
config = AiConfig.from_env()
```

Never hardcode paths such as `~/.3gpp-crawler`.

## Code Guidelines

- Use type hints on all public functions.
- Keep imports at module top level.
- Use `logging` for diagnostics.
- Avoid introducing new dependencies unless required.

## Testing Expectations

When changing contracts, update tests in `tests/ai/` in the same change set.

- Contract model updates: `tests/ai/test_wiki_contracts.py`
- CLI surface updates: `tests/ai/test_extraction_profiles.py` and relevant CLI tests

## Never Do

- Add query contract values other than `wiki-first`.
- Add config or CLI switches that change the fixed query contract mode.
- Reintroduce retrieval-mode configuration that changes wiki-first behavior.

packages/3gpp-ai/README.md

deleted100644 → 0
+0 −16
Original line number Diff line number Diff line
# 3gpp-ai

Optional AI extension package for `3gpp-crawler`.

This package contains AI-focused capabilities including:

- Document extraction and conversion
- Deterministic wiki compilation from extraction artifacts
- Citation-grounded wiki querying and summarization
- AI workspace management

Install via `3gpp-crawler` extras:

```bash
uv add "3gpp-crawler[ai]"
```

packages/3gpp-ai/docs/PIPELINE.md

deleted100644 → 0
+0 −180
Original line number Diff line number Diff line
# Document Processing Pipeline

This document describes the document processing pipeline for converting and summarizing 3GPP TDocs.

## Pipeline Flow

```
summarize S4-250001 --force


fetch_tdoc_files() ──► resolve_via_whatthespec()
        │                      │
        │                      ▼
        │               checkout_tdoc() ──► Downloads if missing


convert_tdoc_to_markdown()

        ├──► Already PDF? ──► opendataloader_pdf

        └──► DOCX/DOC? ──► convert-lo (LibreOffice) ──► PDF ──► opendataloader_pdf


                     Cache to .ai/<id>.md


LLM Summarization
```

### VLM Pipeline (Optional)

When `--vlm` flag is used, the pipeline enables hybrid AI mode for complex PDF pages:

```
workspace process --vlm

        ▼ (for each document)


convert_tdoc_to_markdown(vlm_options=VlmOptions(enable_hybrid=True, ...))


OpenDataLoader with Hybrid Mode
        ├──► Local extraction for simple pages (fast, deterministic)
        └──► AI backend (SmolVLM 256M) for complex pages (tables, formulas, pictures)


Standard Extraction Outputs + AI-Enhanced Artifacts
```

## Extraction

Extraction always enables all artifact types: tables, figures, and equations.

| Feature | Standard Mode | Hybrid Mode |
|---------|---------------|-------------|
| Backend | `opendataloader_pdf` (local) | `opendataloader_pdf[hybrid]` |
| Table Structure | ✅ Enabled | ✅ Enabled |
| Formula Enrichment | ✅ Enabled | ✅ Enhanced |
| Picture Description | ✅ Enabled | ✅ AI-generated (SmolVLM) |
| OCR for Scanned PDFs | ✅ Automatic | ✅ Automatic |
| Java Required | Yes (11+) | Yes (11+) |
| GPU Required | No | No (but hybrid server needs LLM)

## Components

### 1. fetch_tdoc_files()

**Location:** `threegpp_ai/operations/fetch_tdoc.py`

Fetches TDoc files from checkout directory or downloads from 3GPP FTP.

**Pipeline:**

1. Resolve TDoc ID to metadata via WhatTheSpec
1. Calculate checkout path
1. Download via `checkout_tdoc()` if not in checkout
1. Find available file types (PDF, DOCX, DOC)

**Returns:** `TDocFiles` dataclass with paths to available documents

### 2. convert_tdoc_to_markdown()

**Location:** `threegpp_ai/operations/convert.py`

Converts TDoc to markdown using full pipeline.

**Pipeline:**

1. Fetch TDoc files via `fetch_tdoc_files()`
1. Convert to PDF if needed (via convert-lo / LibreOffice)
1. Extract text using OpenDataLoader (Standard or Hybrid mode)
1. Cache markdown to `.ai/<id>.md`

**Table Structure Detection:** Available in both Standard and Hybrid modes. The Standard mode uses local extraction. The Hybrid mode provides AI-enhanced detection for complex pages.

**Caching:**

- Checks for existing `.md` file in `.ai` subdirectory
- Only re-converts if `force=True` or cache miss

### 3. summarize_tdoc()

**Location:** `threegpp_ai/operations/summarize.py`

Generates LLM-powered summary of TDoc content.

**Pipeline:**

1. Get structured extraction via `extract_tdoc_structured()`
1. Prefer structured context (equations/tables/figures with provenance) and fall back to markdown-only when structured artifacts are unavailable
1. Truncate to `SUMMARY_INPUT_LIMIT` (8000 chars)
1. Generate summary via LiteLLM
1. Extract keywords via LiteLLM
1. Return `SummarizeResult`

**Output mode selector:**

- `--output-mode standard` (default): current summarize output shape
- `--output-mode wiki`: wiki-ready rendering with section headers and citation-friendly layout
- Mode only changes CLI rendering shape; summarize operation contract remains unchanged

## File Type Priority

The pipeline prefers formats in this order:

| Priority | Format | Source |
|----------|--------|--------|
| 1 | PDF | Already PDF, extract directly |
| 2 | DOCX | Convert to PDF via LibreOffice |
| 3 | DOC | Convert to PDF via LibreOffice |

## Caching

Converted markdown files are cached in the TDoc checkout directory:

```
~/.3gpp-crawler/checkout/<wg>/<meeting>/Docs/<tdoc_id>/.ai/
└── <tdoc_id>.md
```

To force re-conversion:

```bash
3gpp-ai convert S4-250001 --force
3gpp-ai summarize S4-250001 --force
```

## Dependencies

| Library | Purpose |
|---------|---------|
| `convert-lo` | DOCX/DOC to PDF conversion via LibreOffice |
| `opendataloader-pdf` | PDF/DOCX to markdown text extraction (#1 in benchmarks, 0.907 accuracy) |
| `opendataloader-pdf[hybrid]` | AI-enhanced extraction for complex pages (optional) |
| `litellm` | LLM summarization |
| `whatthespec` | TDoc metadata lookup |

**Requirements:**

- Java 11+ (required by OpenDataLoader)

## CLI Commands

```bash
# Convert TDoc to markdown
3gpp-ai convert <tdoc_id> [--output FILE] [--force]

# Summarize TDoc
3gpp-ai summarize <tdoc_id> [--words N] [--force]
3gpp-ai summarize <tdoc_id> [--output-mode standard|wiki]
```

## Error Handling

| Error | Cause | Solution |
|-------|-------|----------|
| `TDocNotFoundError` | TDoc not found via WhatTheSpec | Check TDoc ID spelling |
| `ExtractionError` | No document files found | Run `tdoc-crawler crawl-tdocs` |
| `LlmConfigError` | LLM endpoint unreachable | Check `TDC_AI_LLM_API_KEY` |

packages/3gpp-ai/docs/config.md

deleted100644 → 0
+0 −135
Original line number Diff line number Diff line
# 3GPP-AI Configuration

This document describes configuration for the 3gpp-ai package, which provides AI-powered document processing including embeddings, knowledge graphs, and LLM-based analysis.

## Shared Configuration

The 3gpp-ai package shares cache paths with the main 3gpp-crawler:

| Path | Description |
|------|-------------|
| `<cache_dir>/lightrag/` | AI cache directory |
| `<cache_dir>/lightrag/<model>/` | Embedding model-specific storage |

These paths are managed by `CacheManager` (from `tdoc_crawler.config`) and are the **single source of truth** for all file paths.

**Cache directory:** Determined by `TDC_CACHE_DIR` or `path.cache_dir` in `3gpp-crawler.toml`

## Configuration Methods

The 3gpp-ai package supports two configuration approaches:

### 1. Environment Variables (Default)

3gpp-ai reads `TDC_AI_*` environment variables directly:

| Variable | Description | Default |
|----------|-------------|---------|
| `TDC_AI_LLM_MODEL` | LLM model in `<provider>/<model>` format | `openrouter/openrouter/free` |
| `TDC_AI_LLM_API_BASE` | Custom LLM API base URL | (none) |
| `TDC_AI_LLM_API_KEY` | LLM API key (overrides provider-specific env vars) | (none) |
| `TDC_AI_EMBEDDING_MODEL` | Embedding model ID | `sentence-transformers/all-MiniLM-L6-v2` |
| `TDC_AI_MAX_CHUNK_SIZE` | Max tokens per chunk | `1000` |
| `TDC_AI_CHUNK_OVERLAP` | Token overlap between chunks | `100` |
| `TDC_AI_ABSTRACT_MIN_WORDS` | Minimum abstract word count | `150` |
| `TDC_AI_ABSTRACT_MAX_WORDS` | Maximum abstract word count | `250` |
| `TDC_AI_PARALLELISM` | Parallel workers for processing | `4` |
| `TDC_AI_CONVERT_PDF` | Convert Office docs to PDF | `false` |
| `TDC_AI_CONVERT_MD` | Extract markdown from PDFs | `false` |
| `TDC_AI_VLM` | Enable vision for figure descriptions | `false` |
| `TDC_GRAPH_QUERY_LEVEL` | Graph query level | `simple` |
| `TDC_LIGHTRAG_SHARED_STORAGE` | Shared embedding storage | `true` |

### 2. Config File Approach

You can use `3gpp-crawler.toml` as base config and `3gpp-ai.toml` for AI-specific overrides:

**3gpp-crawler.toml (base):**

```toml
[path]
cache_dir = "~/.3gpp-crawler"

[http]
timeout = 30
```

**3gpp-ai.toml (override):**

```toml
[ai]
llm_model = "openrouter/anthropic/claude-3-sonnet"
embedding_model = "ollama/nomic-embed-text"
```

## Path Configuration

All paths use `CacheManager` from `tdoc_crawler.config`:

```python
from tdoc_crawler.config import resolve_cache_manager

manager = resolve_cache_manager()
manager.ai_cache_dir       # ~/.3gpp-crawler/lightrag/
manager.ai_embed_dir("qwen3-embedding:0.6b")  # ~/.3gpp-crawler/lightrag/qwen3-embedding:0.6b/
```

**NEVER hardcode paths** like `~/.3gpp-crawler` - always use `CacheManager`.

## Model Formats

### LLM Models

Format: `<provider>/<model_name>`

Examples:

- `openrouter/openrouter/free` - Free tier
- `openrouter/anthropic/claude-3-sonnet` - Anthropic via OpenRouter
- `ollama/llama3` - Local Ollama

### Embedding Models

Format: `<provider>/<model_name>`

Examples:

- `sentence-transformers/all-MiniLM-L6-v2` - Default
- `ollama/nomic-embed-text` - Local Ollama
- `ollama/qwen3-embedding:0.6b` - Qwen embedding

## Processing Options

### Document Conversion

| Option | Description |
|--------|-------------|
| `TDC_AI_CONVERT_PDF` | Convert Office docs (Word, Excel, PowerPoint) to PDF |
| `TDC_AI_CONVERT_MD` | Extract markdown from PDFs using Docling |
| `TDC_AI_VLM` | Use vision model for figure descriptions |

### Chunking

| Option | Description | Default |
|--------|-------------|---------|
| `TDC_AI_MAX_CHUNK_SIZE` | Maximum tokens per chunk | 1000 |
| `TDC_AI_CHUNK_OVERLAP` | Overlap between chunks | 100 |

### Graph Query Levels

| Level | Behavior |
|-------|----------|
| `simple` | Return count and list without synthesis |
| `medium` | Parse query keywords, filter nodes, generate simple text summary |
| `advanced` | Use LLM to synthesize answer from graph + embeddings (GraphRAG) |

## Decoupled Design

The 3gpp-ai package is designed to be **independently installable**:

- It reads `TDC_AI_*` env vars directly (not `TDocCrawlerConfig`)
- It uses `CacheManager` from tdoc_crawler for paths only
- This keeps packages decoupled while sharing infrastructure

For shared settings, use the main `3gpp-crawler.toml` file.
For AI-specific settings, use `TDC_AI_*` env vars or `3gpp-ai.toml`.

packages/3gpp-ai/pyproject.toml

deleted100644 → 0
+0 −48
Original line number Diff line number Diff line
[project]
name = "3gpp-ai"
version = "0.1.0"
description = "Optional AI/RAG extension package for 3gpp-crawler"
authors = [{ name = "Jan Reimes", email = "jan.reimes@head-acoustics.com" }]
readme = "README.md"
keywords = ["python", "3gpp", "rag", "ai"]
requires-python = ">=3.14,<4.0"
classifiers = [
    "Intended Audience :: Developers",
    "Programming Language :: Python",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3.14",
    "Topic :: Software Development :: Libraries :: Python Modules",
]
dependencies = [
    "convert-lo",
    "doc2txt>=1.0.8",
    "litellm>=1.81.15",
    "pydantic-settings>=2.13.1",
    "liteparse>=1.2.0",
    "opendataloader-pdf[hybrid]>=2.2.0",
]

[project.urls]
Repository = "https://forge.3gpp.org/rep/reimes/3gpp-crawler"

[project.scripts]
3gpp-ai = "threegpp_ai.cli:app"

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.metadata]
allow-direct-references = true

[tool.hatch.build.targets.wheel]
packages = ["threegpp_ai"]

[tool.uv.sources]
# doc2txt repository contains pyproject.toml AND setup.py/setup.cfg
# this causes installation of unnecessary additional dependencies.
# If compiler issues arise due to this, consider switching to ...
# - the git+https installation method (commented out above).
# - or an own local workspace package (copy/improve from doc2txt) with a simplified pyproject.toml that only includes the necessary dependencies for 3gpp-ai.
doc2txt = { git = "https://github.com/Quantatirsk/doc2txt-pypi.git" }
convert-lo = { workspace = true }
Loading