🔥 chore: remove 3gpp-ai package and AI test suite (964cf302) · Commits · Jan Reimes / 3gpp-crawler

packages/3gpp-ai/AGENTS.md

deleted100644 → 0

+0 −63

Original line number	Diff line number	Diff line
		# 3gpp-ai

		AI package for wiki-first processing of 3GPP TDocs and specs.

		## Architecture Rule

		Use wiki-first as the only supported compile/query contract.

		- Do not introduce additional query modes.
		- Do not add fallback flags or alternate retrieval-mode toggles.
		- Keep query metadata deterministic: `query_mode = "wiki-first"`.

		## Project Structure

		Generate structure on demand from repository root:

		```shell
		rg --files \| tree-cli --fromfile
		```

		## Commands

		\| Task \| Command \|
		\|------\|---------\|
		\| Lint \| `ruff check packages/3gpp-ai tests/ai` \|
		\| Test (package) \| `uv run pytest tests/ai -v` \|
		\| Test (single) \| `uv run pytest tests/ai/test_wiki_contracts.py -v` \|

		## Configuration

		Read settings from `TDC_AI_*` environment variables and use `CacheManager` for path resolution.

		Required pattern:

		```python
		from tdoc_crawler.config import resolve_cache_manager
		from threegpp_ai.config import AiConfig

		manager = resolve_cache_manager()
		config = AiConfig.from_env()
		```

		Never hardcode paths such as `~/.3gpp-crawler`.

		## Code Guidelines

		- Use type hints on all public functions.
		- Keep imports at module top level.
		- Use `logging` for diagnostics.
		- Avoid introducing new dependencies unless required.

		## Testing Expectations

		When changing contracts, update tests in `tests/ai/` in the same change set.

		- Contract model updates: `tests/ai/test_wiki_contracts.py`
		- CLI surface updates: `tests/ai/test_extraction_profiles.py` and relevant CLI tests

		## Never Do

		- Add query contract values other than `wiki-first`.
		- Add config or CLI switches that change the fixed query contract mode.
		- Reintroduce retrieval-mode configuration that changes wiki-first behavior.

packages/3gpp-ai/README.md

deleted100644 → 0

+0 −16

Original line number	Diff line number	Diff line
		# 3gpp-ai

		Optional AI extension package for `3gpp-crawler`.

		This package contains AI-focused capabilities including:

		- Document extraction and conversion
		- Deterministic wiki compilation from extraction artifacts
		- Citation-grounded wiki querying and summarization
		- AI workspace management

		Install via `3gpp-crawler` extras:

		```bash
		uv add "3gpp-crawler[ai]"
		```

packages/3gpp-ai/docs/PIPELINE.md

deleted100644 → 0

+0 −180

Original line number	Diff line number	Diff line
		# Document Processing Pipeline

		This document describes the document processing pipeline for converting and summarizing 3GPP TDocs.

		## Pipeline Flow

		```
		summarize S4-250001 --force
		│
		▼
		fetch_tdoc_files() ──► resolve_via_whatthespec()
		│ │
		│ ▼
		│ checkout_tdoc() ──► Downloads if missing
		│
		▼
		convert_tdoc_to_markdown()
		│
		├──► Already PDF? ──► opendataloader_pdf
		│
		└──► DOCX/DOC? ──► convert-lo (LibreOffice) ──► PDF ──► opendataloader_pdf
		│
		▼
		Cache to .ai/<id>.md
		│
		▼
		LLM Summarization
		```

		### VLM Pipeline (Optional)

		When `--vlm` flag is used, the pipeline enables hybrid AI mode for complex PDF pages:

		```
		workspace process --vlm
		│
		▼ (for each document)
		│
		▼
		convert_tdoc_to_markdown(vlm_options=VlmOptions(enable_hybrid=True, ...))
		│
		▼
		OpenDataLoader with Hybrid Mode
		├──► Local extraction for simple pages (fast, deterministic)
		└──► AI backend (SmolVLM 256M) for complex pages (tables, formulas, pictures)
		│
		▼
		Standard Extraction Outputs + AI-Enhanced Artifacts
		```

		## Extraction

		Extraction always enables all artifact types: tables, figures, and equations.

		\| Feature \| Standard Mode \| Hybrid Mode \|
		\|---------\|---------------\|-------------\|
		\| Backend \| `opendataloader_pdf` (local) \| `opendataloader_pdf[hybrid]` \|
		\| Table Structure \| ✅ Enabled \| ✅ Enabled \|
		\| Formula Enrichment \| ✅ Enabled \| ✅ Enhanced \|
		\| Picture Description \| ✅ Enabled \| ✅ AI-generated (SmolVLM) \|
		\| OCR for Scanned PDFs \| ✅ Automatic \| ✅ Automatic \|
		\| Java Required \| Yes (11+) \| Yes (11+) \|
		\| GPU Required \| No \| No (but hybrid server needs LLM)

		## Components

		### 1. fetch_tdoc_files()

		Location: `threegpp_ai/operations/fetch_tdoc.py`

		Fetches TDoc files from checkout directory or downloads from 3GPP FTP.

		Pipeline:

		1. Resolve TDoc ID to metadata via WhatTheSpec
		1. Calculate checkout path
		1. Download via `checkout_tdoc()` if not in checkout
		1. Find available file types (PDF, DOCX, DOC)

		Returns: `TDocFiles` dataclass with paths to available documents

		### 2. convert_tdoc_to_markdown()

		Location: `threegpp_ai/operations/convert.py`

		Converts TDoc to markdown using full pipeline.

		Pipeline:

		1. Fetch TDoc files via `fetch_tdoc_files()`
		1. Convert to PDF if needed (via convert-lo / LibreOffice)
		1. Extract text using OpenDataLoader (Standard or Hybrid mode)
		1. Cache markdown to `.ai/<id>.md`

		Table Structure Detection: Available in both Standard and Hybrid modes. The Standard mode uses local extraction. The Hybrid mode provides AI-enhanced detection for complex pages.

		Caching:

		- Checks for existing `.md` file in `.ai` subdirectory
		- Only re-converts if `force=True` or cache miss

		### 3. summarize_tdoc()

		Location: `threegpp_ai/operations/summarize.py`

		Generates LLM-powered summary of TDoc content.

		Pipeline:

		1. Get structured extraction via `extract_tdoc_structured()`
		1. Prefer structured context (equations/tables/figures with provenance) and fall back to markdown-only when structured artifacts are unavailable
		1. Truncate to `SUMMARY_INPUT_LIMIT` (8000 chars)
		1. Generate summary via LiteLLM
		1. Extract keywords via LiteLLM
		1. Return `SummarizeResult`

		Output mode selector:

		- `--output-mode standard` (default): current summarize output shape
		- `--output-mode wiki`: wiki-ready rendering with section headers and citation-friendly layout
		- Mode only changes CLI rendering shape; summarize operation contract remains unchanged

		## File Type Priority

		The pipeline prefers formats in this order:

		\| Priority \| Format \| Source \|
		\|----------\|--------\|--------\|
		\| 1 \| PDF \| Already PDF, extract directly \|
		\| 2 \| DOCX \| Convert to PDF via LibreOffice \|
		\| 3 \| DOC \| Convert to PDF via LibreOffice \|

		## Caching

		Converted markdown files are cached in the TDoc checkout directory:

		```
		~/.3gpp-crawler/checkout/<wg>/<meeting>/Docs/<tdoc_id>/.ai/
		└── <tdoc_id>.md
		```

		To force re-conversion:

		```bash
		3gpp-ai convert S4-250001 --force
		3gpp-ai summarize S4-250001 --force
		```

		## Dependencies

		\| Library \| Purpose \|
		\|---------\|---------\|
		\| `convert-lo` \| DOCX/DOC to PDF conversion via LibreOffice \|
		\| `opendataloader-pdf` \| PDF/DOCX to markdown text extraction (#1 in benchmarks, 0.907 accuracy) \|
		\| `opendataloader-pdf[hybrid]` \| AI-enhanced extraction for complex pages (optional) \|
		\| `litellm` \| LLM summarization \|
		\| `whatthespec` \| TDoc metadata lookup \|

		Requirements:

		- Java 11+ (required by OpenDataLoader)

		## CLI Commands

		```bash
		# Convert TDoc to markdown
		3gpp-ai convert <tdoc_id> [--output FILE] [--force]

		# Summarize TDoc
		3gpp-ai summarize <tdoc_id> [--words N] [--force]
		3gpp-ai summarize <tdoc_id> [--output-mode standard\|wiki]
		```

		## Error Handling

		\| Error \| Cause \| Solution \|
		\|-------\|-------\|----------\|
		\| `TDocNotFoundError` \| TDoc not found via WhatTheSpec \| Check TDoc ID spelling \|
		\| `ExtractionError` \| No document files found \| Run `tdoc-crawler crawl-tdocs` \|
		\| `LlmConfigError` \| LLM endpoint unreachable \| Check `TDC_AI_LLM_API_KEY` \|

packages/3gpp-ai/docs/config.md

deleted100644 → 0

+0 −135

Original line number	Diff line number	Diff line
		# 3GPP-AI Configuration

		This document describes configuration for the 3gpp-ai package, which provides AI-powered document processing including embeddings, knowledge graphs, and LLM-based analysis.

		## Shared Configuration

		The 3gpp-ai package shares cache paths with the main 3gpp-crawler:

		\| Path \| Description \|
		\|------\|-------------\|
		\| `<cache_dir>/lightrag/` \| AI cache directory \|
		\| `<cache_dir>/lightrag/<model>/` \| Embedding model-specific storage \|

		These paths are managed by `CacheManager` (from `tdoc_crawler.config`) and are the single source of truth for all file paths.

		Cache directory: Determined by `TDC_CACHE_DIR` or `path.cache_dir` in `3gpp-crawler.toml`

		## Configuration Methods

		The 3gpp-ai package supports two configuration approaches:

		### 1. Environment Variables (Default)

		3gpp-ai reads `TDC_AI_*` environment variables directly:

		\| Variable \| Description \| Default \|
		\|----------\|-------------\|---------\|
		\| `TDC_AI_LLM_MODEL` \| LLM model in `<provider>/<model>` format \| `openrouter/openrouter/free` \|
		\| `TDC_AI_LLM_API_BASE` \| Custom LLM API base URL \| (none) \|
		\| `TDC_AI_LLM_API_KEY` \| LLM API key (overrides provider-specific env vars) \| (none) \|
		\| `TDC_AI_EMBEDDING_MODEL` \| Embedding model ID \| `sentence-transformers/all-MiniLM-L6-v2` \|
		\| `TDC_AI_MAX_CHUNK_SIZE` \| Max tokens per chunk \| `1000` \|
		\| `TDC_AI_CHUNK_OVERLAP` \| Token overlap between chunks \| `100` \|
		\| `TDC_AI_ABSTRACT_MIN_WORDS` \| Minimum abstract word count \| `150` \|
		\| `TDC_AI_ABSTRACT_MAX_WORDS` \| Maximum abstract word count \| `250` \|
		\| `TDC_AI_PARALLELISM` \| Parallel workers for processing \| `4` \|
		\| `TDC_AI_CONVERT_PDF` \| Convert Office docs to PDF \| `false` \|
		\| `TDC_AI_CONVERT_MD` \| Extract markdown from PDFs \| `false` \|
		\| `TDC_AI_VLM` \| Enable vision for figure descriptions \| `false` \|
		\| `TDC_GRAPH_QUERY_LEVEL` \| Graph query level \| `simple` \|
		\| `TDC_LIGHTRAG_SHARED_STORAGE` \| Shared embedding storage \| `true` \|

		### 2. Config File Approach

		You can use `3gpp-crawler.toml` as base config and `3gpp-ai.toml` for AI-specific overrides:

		3gpp-crawler.toml (base):

		```toml
		[path]
		cache_dir = "~/.3gpp-crawler"

		[http]
		timeout = 30
		```

		3gpp-ai.toml (override):

		```toml
		[ai]
		llm_model = "openrouter/anthropic/claude-3-sonnet"
		embedding_model = "ollama/nomic-embed-text"
		```

		## Path Configuration

		All paths use `CacheManager` from `tdoc_crawler.config`:

		```python
		from tdoc_crawler.config import resolve_cache_manager

		manager = resolve_cache_manager()
		manager.ai_cache_dir # ~/.3gpp-crawler/lightrag/
		manager.ai_embed_dir("qwen3-embedding:0.6b") # ~/.3gpp-crawler/lightrag/qwen3-embedding:0.6b/
		```

		NEVER hardcode paths like `~/.3gpp-crawler` - always use `CacheManager`.

		## Model Formats

		### LLM Models

		Format: `<provider>/<model_name>`

		Examples:

		- `openrouter/openrouter/free` - Free tier
		- `openrouter/anthropic/claude-3-sonnet` - Anthropic via OpenRouter
		- `ollama/llama3` - Local Ollama

		### Embedding Models

		Format: `<provider>/<model_name>`

		Examples:

		- `sentence-transformers/all-MiniLM-L6-v2` - Default
		- `ollama/nomic-embed-text` - Local Ollama
		- `ollama/qwen3-embedding:0.6b` - Qwen embedding

		## Processing Options

		### Document Conversion

		\| Option \| Description \|
		\|--------\|-------------\|
		\| `TDC_AI_CONVERT_PDF` \| Convert Office docs (Word, Excel, PowerPoint) to PDF \|
		\| `TDC_AI_CONVERT_MD` \| Extract markdown from PDFs using Docling \|
		\| `TDC_AI_VLM` \| Use vision model for figure descriptions \|

		### Chunking

		\| Option \| Description \| Default \|
		\|--------\|-------------\|---------\|
		\| `TDC_AI_MAX_CHUNK_SIZE` \| Maximum tokens per chunk \| 1000 \|
		\| `TDC_AI_CHUNK_OVERLAP` \| Overlap between chunks \| 100 \|

		### Graph Query Levels

		\| Level \| Behavior \|
		\|-------\|----------\|
		\| `simple` \| Return count and list without synthesis \|
		\| `medium` \| Parse query keywords, filter nodes, generate simple text summary \|
		\| `advanced` \| Use LLM to synthesize answer from graph + embeddings (GraphRAG) \|

		## Decoupled Design

		The 3gpp-ai package is designed to be independently installable:

		- It reads `TDC_AI_*` env vars directly (not `TDocCrawlerConfig`)
		- It uses `CacheManager` from tdoc_crawler for paths only
		- This keeps packages decoupled while sharing infrastructure

		For shared settings, use the main `3gpp-crawler.toml` file.
		For AI-specific settings, use `TDC_AI_*` env vars or `3gpp-ai.toml`.

packages/3gpp-ai/pyproject.toml

deleted100644 → 0

+0 −48

Original line number	Diff line number	Diff line
		[project]
		name = "3gpp-ai"
		version = "0.1.0"
		description = "Optional AI/RAG extension package for 3gpp-crawler"
		authors = [{ name = "Jan Reimes", email = "jan.reimes@head-acoustics.com" }]
		readme = "README.md"
		keywords = ["python", "3gpp", "rag", "ai"]
		requires-python = ">=3.14,<4.0"
		classifiers = [
		"Intended Audience :: Developers",
		"Programming Language :: Python",
		"Programming Language :: Python :: 3",
		"Programming Language :: Python :: 3.14",
		"Topic :: Software Development :: Libraries :: Python Modules",
		]
		dependencies = [
		"convert-lo",
		"doc2txt>=1.0.8",
		"litellm>=1.81.15",
		"pydantic-settings>=2.13.1",
		"liteparse>=1.2.0",
		"opendataloader-pdf[hybrid]>=2.2.0",
		]

		[project.urls]
		Repository = "https://forge.3gpp.org/rep/reimes/3gpp-crawler"

		[project.scripts]
		3gpp-ai = "threegpp_ai.cli:app"

		[build-system]
		requires = ["hatchling"]
		build-backend = "hatchling.build"

		[tool.hatch.metadata]
		allow-direct-references = true

		[tool.hatch.build.targets.wheel]
		packages = ["threegpp_ai"]

		[tool.uv.sources]
		# doc2txt repository contains pyproject.toml AND setup.py/setup.cfg
		# this causes installation of unnecessary additional dependencies.
		# If compiler issues arise due to this, consider switching to ...
		# - the git+https installation method (commented out above).
		# - or an own local workspace package (copy/improve from doc2txt) with a simplified pyproject.toml that only includes the necessary dependencies for 3gpp-ai.
		doc2txt = { git = "https://github.com/Quantatirsk/doc2txt-pypi.git" }
		convert-lo = { workspace = true }