🔥 docs: remove deprecated ai documentation (b273ecf2) · Commits · Jan Reimes / 3gpp-crawler

docs/ai.md

deleted100644 → 0

+0 −598

Original line number	Diff line number	Diff line
		# AI Document Processing

		> ⚠️ Deprecated: The `3gpp-ai` package has been removed from this repository. This documentation is kept for historical reference only. AI features (semantic search, knowledge graphs, summarization) are no longer available in the current codebase.

		## Current Architecture

		The project now uses a wiki-first architecture. Extraction artifacts are written directly to
		`~/.3gpp-crawler/wiki/<workspace>/` during workspace processing. These artifacts can be consumed
		by external wiki compiler tools such as `atomicmemory/llm-wiki-compiler` or `lucasastorian/llmwiki`.

		The AI module previously provided intelligent document processing capabilities for 3GPP document data, including semantic search, knowledge graph construction, and AI-powered summarization.

		Key Features:

		- Classification - Identify main documents in multi-file TDoc folders
		- Extraction - Convert DOCX/PDF to Markdown with keyword extraction and language detection (via Docling)
		- Structured Elements - Preserve tables, figures, and equations with stable markers and metadata
		- Embeddings - Generate semantic vector representations for similarity search
		- Summarization - Create AI-powered abstracts
		- Knowledge Graph - Build relationships between TDocs
		- Workspaces - Organize TDocs into logical groups for focused analysis

		______________________________________________________________________

		## Table of Contents

		- [Installation](#installation)
		- [Configuration](#configuration)
		- [Workflow Guide](#workflow-guide)
		- [CLI Commands](#cli-commands)
		- [Model Providers](#model-providers)
		- [Python API](#python-api)
		- [Troubleshooting](#troubleshooting)

		______________________________________________________________________

		## Installation

		The AI module is available as an optional dependency. Install it with:

		```bash
		# Install 3gpp-crawler with AI support
		uv add 3gpp-crawler[ai]

		# Or install from source
		git clone https://forge.3gpp.org/rep/reimes/3gpp-crawler.git
		cd 3gpp-crawler
		uv sync --extra ai
		```

		All required dependencies (Docling, LiteLLM, sentence-transformers, LanceDB) are installed automatically.

		Internally, AI capabilities are provided by the optional `3gpp-ai` package, which is pulled in by `3gpp-crawler[ai]`.

		______________________________________________________________________

		## Configuration

		### Environment Variables

		Configure AI processing via environment variables (see `.env.example`):

		```bash
		# LLM Configuration
		TDC_AI_LLM_MODEL=openrouter/openrouter/free # Default: free tier via OpenRouter
		TDC_AI_LLM_API_KEY=your-api-key # Optional: takes precedence over provider-specific keys
		TDC_AI_LLM_API_BASE= # Optional: custom endpoint

		# Embedding Model (HuggingFace sentence-transformers)
		TDC_AI_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 # Default: popular 384-dim model
		TDC_AI_EMBEDDING_BACKEND=torch # torch \| onnx \| openvino (default: torch)

		# Storage
		TDC_AI_STORE_PATH= # Defaults to <cache_dir>/.ai/lancedb
		TDC_AI_MAX_CHUNK_SIZE=1000 # Chunk size for embeddings
		TDC_AI_CHUNK_OVERLAP=100 # Overlap between chunks

		# Summary Constraints
		TDC_AI_ABSTRACT_MIN_WORDS=150
		TDC_AI_ABSTRACT_MAX_WORDS=250

		# Processing
		TDC_AI_PARALLELISM=4 # Parallel workers
		```

		### Model Identifier Format

		LLMs use the `<provider>/<model_name>` format (handled by LiteLLM):

		```bash
		TDC_AI_LLM_MODEL=openai/gpt-4o-mini
		TDC_AI_LLM_MODEL=openrouter/anthropic/claude-3-sonnet
		```

		Embedding models use HuggingFace model IDs (handled by sentence-transformers):

		```bash
		TDC_AI_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
		TDC_AI_EMBEDDING_MODEL=sentence-transformers/bert-base-nli-mean-tokens
		TDC_AI_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
		```

		Embedding models are local-only (downloaded from HuggingFace) and require the model ID to be a valid HuggingFace model that works with sentence-transformers.

		______________________________________________________________________

		## Workflow Guide

		The AI module follows a workspace-based workflow for organizing and querying your document collection:

		All examples below use the `3gpp-ai` CLI entrypoint. AI commands are provided by the standalone `3gpp-ai` package (installed via `3gpp-crawler[ai]`).

		### 1. Create and Activate Workspace

		```bash
		# Create a new workspace for your project
		3gpp-ai workspace create my-project

		# Activate it so you don't need --workspace for other commands
		3gpp-ai workspace activate my-project
		```

		Once activated, all workspace commands use the active workspace by default. No need to pass `-w` every time.

		### 2. Add TDocs and Process

		After adding TDocs to your workspace, process them to generate RAG/GraphRAG embeddings:

		```bash
		# Add TDocs to the active workspace
		3gpp-ai workspace add-members --kind tdoc S4-251971 S4-251972

		# Process all TDocs in workspace (only new ones)
		3gpp-ai workspace process -w my-project

		# Force reprocess all TDocs
		3gpp-ai workspace process -w my-project --force
		```

		Note: If you created the workspace with `--auto-build`, documents are processed automatically when added.

		### 3. Query Your Knowledge Base

		Once you have a workspace with documents, query using the single RAG command that searches enriched text plus preserved table/figure/equation context:

		```bash
		# Query a workspace
		3gpp-ai workspace query --workspace my-project "What are the bit rates in Table 3?"

		# Same command for figure/equation questions
		3gpp-ai workspace query --workspace my-project "Describe the architecture figure"
		3gpp-ai workspace query --workspace my-project "What is the throughput equation?"
		```

		Note: `workspace query` is the only query entrypoint. Do not use separate table/figure/equation query commands.

		### 4. Workspace Maintenance

		Keep your workspace clean and manage artifacts:

		```bash
		# Get detailed workspace information (member counts by type)
		3gpp-ai workspace info my-project

		# Remove invalid/inactive members
		3gpp-ai workspace clear-invalid -w my-project

		# Clear all AI artifacts (embeddings, summaries) while preserving members
		3gpp-ai workspace clear -w my-project

		# After clearing, re-process to regenerate artifacts
		3gpp-ai workspace process -w my-project --force
		```

		### 5. Single TDoc Operations

		Process a single TDoc through the pipeline (classification, extraction, embeddings, graph). Use `--accelerate` to choose the sentence-transformers backend.

		```bash
		3gpp-ai convert SP-240001 --output ./SP-240001.md
		3gpp-ai summarize SP-240001 --words 200
		```

		When structured extraction is enabled, conversion and workspace processing may generate sidecars next to markdown artifacts:

		- `*_tables.json`
		- `*_figures.json`
		- `*_equations.json`

		### VLM Features (Optional)

		The AI module supports optional Vision-Language Model (VLM) features for enhanced document processing. These features are disabled by default and must be explicitly enabled.

		#### What VLM Provides

		\| Feature \| Description \| Model \|
		\|---------\|-------------\|-------\|
		\| Picture Description \| Generates detailed natural language descriptions of figures and diagrams \| Granite Docling VLM \|
		\| Formula Enrichment \| Provides enhanced LaTeX/MathML representation of mathematical formulas \| Granite Docling VLM \|

		#### GPU Requirements

		VLM features require a GPU with sufficient VRAM. If no GPU is available, the processing will fail or run very slowly. The standard pipeline (without VLM) works on CPU.

		#### Enabling VLM

		Use the `--vlm` flag with the workspace process command:

		```bash
		# Process with VLM features enabled
		3gpp-ai workspace process -w my-project --vlm

		# Force reprocess with VLM
		3gpp-ai workspace process -w my-project --vlm --force
		```

		When `--vlm` is specified, both `enable_picture_description` and `enable_formula_enrichment` are activated.

		#### Standard vs VLM Pipeline

		\| Aspect \| Standard Pipeline \| VLM Pipeline \|
		\|--------\|-------------------\|--------------\|
		\| Table Detection \| ✅ Enabled (Docling) \| ✅ Enabled \|
		\| Formula Enrichment \| ✅ Basic (CodeFormula) \| ✅ Enhanced (VLM) \|
		\| Picture Description \| ❌ Not available \| ✅ VLM-generated descriptions \|
		\| GPU Required \| No \| Yes \|
		\| Processing Speed \| Faster \| Slower \|

		______________________________________________________________________

		## CLI Commands

		### Workspace Management

		````bash
		# Create a new workspace
		3gpp-ai workspace create <name> [--auto-build]

		Options:
		- `name`: Workspace name
		- `--auto-build`: Automatically process documents when added to workspace

		# List all workspaces
		# Shows (*) next to the active workspace
		3gpp-ai workspace list

		# Activate a workspace (sets as default for workspace commands)
		3gpp-ai workspace activate <name>

		# Deactivate the active workspace
		3gpp-ai workspace deactivate

		# Get workspace details (name, status, member counts)
		3gpp-ai workspace info <name>

		# Remove invalid/inactive members from workspace
		3gpp-ai workspace clear-invalid [-w <name>]

		# Clear all AI artifacts while preserving members
		3gpp-ai workspace clear [-w <name>]

		# Delete a workspace
		3gpp-ai workspace delete <name>
		### Querying

		Query the knowledge base using semantic embeddings and knowledge graph (RAG + GraphRAG).

		```bash
		# Query a specific workspace (single query command)
		3gpp-ai workspace query --workspace <workspace_name> "your query here"
		````

		Note: Keep `workspace query` as the single query interface. The query is a positional argument (no `--query` flag).

		#### Summarize a TDoc

		Summarize a single TDoc with specified word count.

		```bash
		3gpp-ai summarize <tdoc_id> [--words N] [--format markdown\|json\|yaml] [--json-output]
		```

		Options:

		- `tdoc_id`: TDoc identifier (e.g., "RP-240001")
		- `--words N`: Target word count for summary (default: 200)
		- `--format`: Output format - markdown (default), json, or yaml
		- `--json-output`: Output raw JSON

		#### Convert a TDoc

		Convert a single TDoc to markdown format.

		```bash
		3gpp-ai convert <tdoc_id> [--output FILE.md] [--json-output]
		```

		Options:

		- `tdoc_id`: TDoc identifier
		- `--output FILE.md`: Write output to file (prints to stdout if not specified)
		- `--json-output`: Output raw JSON

		### Workspace Members and Processing

		Add TDocs to workspaces and process them to generate embeddings and knowledge graph.

		```bash
		# Add members to the active workspace
		3gpp-ai workspace add-members --kind tdoc S4-251971 S4-251972

		# Add members to a specific workspace
		3gpp-ai workspace add-members -w my-project --kind tdoc S4-251971 S4-251972

		# List members in the active workspace
		3gpp-ai workspace list-members

		# List members including inactive ones
		3gpp-ai workspace list-members --include-inactive

		# Process all TDocs in the active workspace
		3gpp-ai workspace process

		# Process with options
		3gpp-ai workspace process -w my-project --force

		# Process with VLM features (requires GPU)
		3gpp-ai workspace process -w my-project --vlm

		# Get workspace information with member counts
		3gpp-ai workspace info my-project

		# Remove invalid members (failed checkouts, etc.)
		3gpp-ai workspace clear-invalid -w my-project

		# Clear AI artifacts (keep members, remove embeddings/summaries)
		3gpp-ai workspace clear -w my-project
		```

		______________________________________________________________________

		## Model Providers

		## Model Providers

		### Supported LLM Providers

		\| Provider \| Example Model \| API Key Env Var \| Notes \|
		\|----------\|---------------\|-----------------\|-------\|
		\| `openai` \| `gpt-4o-mini`, `gpt-4o` \| `OPENAI_API_KEY` \| Industry standard \|
		\| `anthropic` \| `claude-3-haiku`, `claude-3-sonnet` \| `ANTHROPIC_API_KEY` \| High-quality reasoning \|
		\| `openrouter` \| `openrouter/free` \| `OPENROUTER_API_KEY` \| Recommended - Free tier available \|
		\| `github_copilot` \| `gpt-4o` \| `GITHUB_COPILOT_API_KEY` \| GitHub Copilot endpoint \|
		\| `nvidia` \| `meta/llama3-70b` \| `NVIDIA_API_KEY` \| NVIDIA NIM platform \|
		\| `google` \| `gemini-1.5-flash` \| `GOOGLE_API_KEY` \| Google AI Studio \|
		\| `azure` \| `gpt-4o` \| `AZURE_API_KEY` \| Azure OpenAI Service \|
		\| `vertex_ai` \| `gemini-pro` \| `VERTEX_AI_API_KEY` \| Google Cloud Vertex \|
		\| `groq` \| `llama-3.1-70b` \| `GROQ_API_KEY` \| Fast inference \|
		\| `mistral` \| `mistral-large` \| `MISTRAL_API_KEY` \| Mistral AI \|
		\| `together_ai` \| `meta-llama/Llama-3-70b` \| `TOGETHER_API_KEY` \| Together AI platform \|
		\| `huggingface` \| `mistralai/Mistral-7B` \| `HF_API_KEY` \| Hugging Face Inference \|
		\| `ollama` \| `llama3.2`, `mistral` \| (none) \| Local - No API key needed \|
		\| `sambanova` \| `Meta-Llama-3.1-70B` \| `SAMBANOVA_API_KEY` \| SambaNova Cloud \|
		\| `fireworks_ai` \| `accounts/fireworks/models/llama-v3-70b` \| `FIREWORKS_API_KEY` \| Fireworks AI \|
		\| `anyscale` \| `meta-llama/Llama-3-70b` \| `ANYSCALE_API_KEY` \| Anyscale Endpoints \|
		\| `perplexity` \| `pplx-7b-chat` \| `PERPLEXITY_API_KEY` \| Perplexity API \|
		\| `deepinfra` \| `meta-llama/Llama-3-70b` \| `DEEPINFRA_API_KEY` \| DeepInfra \|

		### Supported Embedding Providers

		The embedding feature uses sentence-transformers (local HuggingFace models). This is the only supported approach for embeddings.

		\| HuggingFace Model \| Dimension \| Description \|
		\|-------------------\|-----------\|-------------\|
		\| `sentence-transformers/all-MiniLM-L6-v2` \| 384 \| Default - Fast, good quality \|
		\| `sentence-transformers/bert-base-nli-mean-tokens` \| 768 \| High quality NLI model \|
		\| `BAAI/bge-small-en-v1.5` \| 384 \| BGE small - strong retriever \|
		\| `BAAI/bge-base-en-v1.5` \| 768 \| BGE base - higher quality \|

		Recommended Models:

		- Original models: <https://huggingface.co/models?num_parameters=min:0,max:3B&library=sentence-transformers,onnx&sort=trending&author=sentence-transformers>
		- Community models: <https://huggingface.co/models?num_parameters=min:0,max:3B&library=sentence-transformers,onnx&sort=trending>

		### Recommended Configuration

		Free/Local Setup (No API Cost):

		```bash
		TDC_AI_LLM_MODEL=openrouter/openrouter/free
		TDC_AI_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
		OPENROUTER_API_KEY=your-free-api-key
		```

		Production Setup (Higher Quality):

		```bash
		TDC_AI_LLM_MODEL=anthropic/claude-3-sonnet
		TDC_AI_EMBEDDING_MODEL=sentence-transformers/bert-base-nli-mean-tokens
		ANTHROPIC_API_KEY=your-key
		```

		______________________________________________________________________

		## Python API

		Legacy batch-processing helpers are removed. Use the LightRAG interfaces exposed by the
		`threegpp_ai` package for workspace processing and querying.

		## Supported File Types

		- DOCX - Primary format for extraction (via Docling)
		- PDF - Supported via Docling
		- XLSX - Handled as secondary files
		- PPTX - Handled as secondary files

		## Testing

		Run AI tests:

		```bash
		# All AI tests
		uv run pytest tests/ai -v

		# Specific module
		uv run pytest tests/ai/test_ai_extraction.py -v
		```

		Test data is located in `tests/ai/data/`.

		______________________________________________________________________

		## Troubleshooting

		### Installation Issues

		Problem: `ModuleNotFoundError: No module named 'opendataloader_pdf'`

		Solution: Install the AI optional dependencies:

		```bash
		uv add 3gpp-crawler[ai]
		```

		Problem: `Java not found` or `opendataloader_pdf requires Java 11+`

		Solution: Install Java 11 or later and ensure it's on your system PATH. Download from https://adoptium.net/ or use your system's package manager.

		Problem: `lancedb not available`

		Solution: LanceDB is included in the `[ai]` extra. Reinstall:

		```bash
		uv sync --extra ai
		```

		### Model Configuration Errors

		Problem: `ValueError: TDC_AI_LLM_MODEL must be in '<provider>/<model_name>' format`

		Solution: Ensure your model identifier includes a provider prefix:

		```bash
		# Wrong
		TDC_AI_LLM_MODEL=gpt-4o-mini

		# Correct
		TDC_AI_LLM_MODEL=openai/gpt-4o-mini
		```

		Problem: `ValueError: provider 'xyz' is not in supported provider allowlist`

		Solution: Check the provider name spelling. See [Model Providers](#model-providers) for the full list. Provider names are case-insensitive.

		### API Key Errors

		Problem: `litellm.AuthenticationError: Invalid API key`

		Solution: Verify your API key is set correctly:

		```bash
		# For OpenAI
		export OPENAI_API_KEY=sk-...

		# For Anthropic
		export ANTHROPIC_API_KEY=sk-ant-...

		# For OpenRouter
		export OPENROUTER_API_KEY=...

		# Check if set
		echo $OPENAI_API_KEY
		```

		Problem: `Missing API key for provider 'openai'`

		Solution: LiteLLM expects the API key in a standard environment variable named `<PROVIDER>_API_KEY`. See the [Model Providers](#model-providers) table for the correct variable name for each provider.

		Alternative: You can use `TDC_AI_LLM_API_KEY` as a universal API key that takes precedence over provider-specific keys. This is useful when you want to use a single API key across different providers (e.g., via OpenRouter or a proxy service):

		```bash
		export TDC_AI_LLM_API_KEY=your-key-here
		```

		If `TDC_AI_LLM_API_KEY` is set, it will be used instead of the provider-specific key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.).

		### Embedding Model Issues

		Problem: `OSError: No sentence-transformers model found`

		Solution: If using a Hugging Face embedding model, ensure `sentence-transformers` is installed:

		```bash
		uv add sentence-transformers
		```

		### Workspace Issues

		Problem: `Workspace 'my-project' not found`

		Solution: Create the workspace first:

		```bash
		3gpp-ai workspace create my-project
		```

		Solution: Use `summarize` or `convert` to work with individual TDocs directly. These commands fetch content from configured sources:

		```bash
		3gpp-ai summarize SP-240001
		3gpp-ai convert SP-240001 --output SP-240001.md
		```

		### Query Errors

		Problem: `TDoc 'SP-240001' not found`

		Solution: Ensure the TDoc exists in your workspace or use `summarize`/`convert` which fetch from external sources:

		```bash
		3gpp-ai summarize SP-240001 --format markdown
		```

		Problem: `LLM API timeout`

		Solution: Increase timeout or reduce token count:

		```bash
		# Increase timeout (if supported by provider)
		export LITELLM_REQUEST_TIMEOUT=60

		# Reduce max tokens
		export TDC_AI_LLM_MAX_TOKENS=1000
		```

		### Performance Issues

		Problem: Processing is very slow

		Solution:

		1. Increase parallelism:

		```bash
		export TDC_AI_PARALLELISM=8
		```

		1. Use a faster LLM for summarization (e.g., `gpt-4o-mini` instead of `gpt-4o`)

		1. For local models, ensure Ollama is running with GPU acceleration if available

		### LanceDB Issues

		Problem: `lancedb.errors.InternalError: Schema mismatch`

		Solution: This can occur after upgrading the AI module. The LanceDB schema may need to be recreated. Backup your data and delete the LanceDB directory:

		```bash
		# Backup first!
		cp -r ~/.3gpp-crawler/.ai/lancedb ~/.3gpp-crawler/.ai/lancedb.backup

		# Delete and let it recreate
		rm -rf ~/.3gpp-crawler/.ai/lancedb
		```

		Note: This will delete all processed embeddings and summaries. You'll need to re-process your documents.

		______________________________________________________________________

		## Additional Resources

		- [LiteLLM Provider Documentation](https://docs.litellm.ai/docs/providers) - Complete list of 100+ supported LLM providers
		- [OpenDataLoader PDF Documentation](https://github.com/opendataloader-project/opendataloader-pdf) - PDF extraction library (#1 in benchmarks)
		- [LanceDB Documentation](https://lancedb.github.io/lancedb/) - Vector database

docs/index.md

+0 −5

Original line number	Diff line number	Diff line
		@@ -7,7 +7,6 @@ Welcome to the documentation for 3gpp-crawler, a command-line tool for query
		## 📖 Table of Contents

		- [Crawl Documentation](crawl.md) – How to fetch metadata from 3GPP servers and portal.
		- [AI Document Processing](ai.md) – AI-powered document extraction and wiki-first architecture (legacy RAG deprecated).
		- [Query Documentation](query.md) – How to search and display stored metadata.
		- [Utility Documentation](utils.md) – File access, spec handling, and database inspection.
		- [WhatIsWhatTheSpec](whatthespec.md) – Understanding the primary WhatTheSpec data source and 3GPP fallback.
		@@ -25,9 +24,5 @@ Welcome to the documentation for 3gpp-crawler, a command-line tool for query
		- [Query-Specs](query.md#query-specs) (`qs`)
		- [Open TDoc](utils.md#open)
		- [Checkout Specs](utils.md#checkout-spec)
		- AI Commands (via `3gpp-ai` CLI)
		- [AI Workspace](ai.md#workspace-management) - Create and manage workspaces
		- [AI Query](ai.md#querying) - Semantic search over TDocs
		- [AI Summarize/Convert](ai.md#single-tdoc-operations) - Single TDoc operations

		For a brief overview of all commands, see the [README.md](../README.md).

Original line number	Diff line number	Diff line
		@@ -7,7 +7,6 @@ Welcome to the documentation for 3gpp-crawler, a command-line tool for query
		## 📖 Table of Contents

		- [Crawl Documentation](crawl.md) – How to fetch metadata from 3GPP servers and portal.
		- [AI Document Processing](ai.md) – AI-powered document extraction and wiki-first architecture (legacy RAG deprecated).
		- [Query Documentation](query.md) – How to search and display stored metadata.
		- [Utility Documentation](utils.md) – File access, spec handling, and database inspection.
		- [WhatIsWhatTheSpec](whatthespec.md) – Understanding the primary WhatTheSpec data source and 3GPP fallback.
		@@ -25,9 +24,5 @@ Welcome to the documentation for 3gpp-crawler, a command-line tool for query
		- [Query-Specs](query.md#query-specs) (`qs`)
		- [Open TDoc](utils.md#open)
		- [Checkout Specs](utils.md#checkout-spec)
		- AI Commands (via `3gpp-ai` CLI)
		- [AI Workspace](ai.md#workspace-management) - Create and manage workspaces
		- [AI Query](ai.md#querying) - Semantic search over TDocs
		- [AI Summarize/Convert](ai.md#single-tdoc-operations) - Single TDoc operations

		For a brief overview of all commands, see the [README.md](../README.md).