> **⚠️ Deprecated:** The `3gpp-ai` package has been removed from this repository. This documentation is kept for historical reference only. AI features (semantic search, knowledge graphs, summarization) are no longer available in the current codebase.
## Current Architecture
The project now uses a **wiki-first architecture**. Extraction artifacts are written directly to
`~/.3gpp-crawler/wiki/<workspace>/` during workspace processing. These artifacts can be consumed
by external wiki compiler tools such as `atomicmemory/llm-wiki-compiler` or `lucasastorian/llmwiki`.
The AI module previously provided intelligent document processing capabilities for 3GPP document data, including semantic search, knowledge graph construction, and AI-powered summarization.
**Key Features:**
-**Classification** - Identify main documents in multi-file TDoc folders
-**Extraction** - Convert DOCX/PDF to Markdown with keyword extraction and language detection (via Docling)
-**Structured Elements** - Preserve tables, figures, and equations with stable markers and metadata
-**Embeddings** - Generate semantic vector representations for similarity search
-**Summarization** - Create AI-powered abstracts
-**Knowledge Graph** - Build relationships between TDocs
-**Workspaces** - Organize TDocs into logical groups for focused analysis
Embedding models are **local-only** (downloaded from HuggingFace) and require the model ID to be a valid HuggingFace model that works with sentence-transformers.
Note: If you created the workspace with `--auto-build`, documents are processed automatically when added.
### 3. Query Your Knowledge Base
Once you have a workspace with documents, query using the single RAG command that searches enriched text plus preserved table/figure/equation context:
```bash
# Query a workspace
3gpp-ai workspace query --workspace my-project "What are the bit rates in Table 3?"
# Same command for figure/equation questions
3gpp-ai workspace query --workspace my-project "Describe the architecture figure"
3gpp-ai workspace query --workspace my-project "What is the throughput equation?"
```
Note: `workspace query` is the only query entrypoint. Do not use separate table/figure/equation query commands.
### 4. Workspace Maintenance
Keep your workspace clean and manage artifacts:
```bash
# Get detailed workspace information (member counts by type)
3gpp-ai workspace info my-project
# Remove invalid/inactive members
3gpp-ai workspace clear-invalid -w my-project
# Clear all AI artifacts (embeddings, summaries) while preserving members
3gpp-ai workspace clear -w my-project
# After clearing, re-process to regenerate artifacts
3gpp-ai workspace process -w my-project --force
```
### 5. Single TDoc Operations
Process a single TDoc through the pipeline (classification, extraction, embeddings, graph). Use `--accelerate` to choose the sentence-transformers backend.
```bash
3gpp-ai convert SP-240001 --output ./SP-240001.md
3gpp-ai summarize SP-240001 --words 200
```
When structured extraction is enabled, conversion and workspace processing may generate sidecars next to markdown artifacts:
-`*_tables.json`
-`*_figures.json`
-`*_equations.json`
### VLM Features (Optional)
The AI module supports optional Vision-Language Model (VLM) features for enhanced document processing. These features are disabled by default and must be explicitly enabled.
#### What VLM Provides
| Feature | Description | Model |
|---------|-------------|-------|
| **Picture Description** | Generates detailed natural language descriptions of figures and diagrams | Granite Docling VLM |
VLM features require a GPU with sufficient VRAM. If no GPU is available, the processing will fail or run very slowly. The standard pipeline (without VLM) works on CPU.
#### Enabling VLM
Use the `--vlm` flag with the workspace process command:
```bash
# Process with VLM features enabled
3gpp-ai workspace process -w my-project --vlm
# Force reprocess with VLM
3gpp-ai workspace process -w my-project --vlm--force
```
When `--vlm` is specified, both `enable_picture_description` and `enable_formula_enrichment` are activated.
- Original models: <https://huggingface.co/models?num_parameters=min:0,max:3B&library=sentence-transformers,onnx&sort=trending&author=sentence-transformers>
- Community models: <https://huggingface.co/models?num_parameters=min:0,max:3B&library=sentence-transformers,onnx&sort=trending>
**Problem:**`ModuleNotFoundError: No module named 'opendataloader_pdf'`
**Solution:** Install the AI optional dependencies:
```bash
uv add 3gpp-crawler[ai]
```
**Problem:**`Java not found` or `opendataloader_pdf requires Java 11+`
**Solution:** Install Java 11 or later and ensure it's on your system PATH. Download from https://adoptium.net/ or use your system's package manager.
**Problem:**`lancedb not available`
**Solution:** LanceDB is included in the `[ai]` extra. Reinstall:
```bash
uv sync--extra ai
```
### Model Configuration Errors
**Problem:**`ValueError: TDC_AI_LLM_MODEL must be in '<provider>/<model_name>' format`
**Solution:** Ensure your model identifier includes a provider prefix:
```bash
# Wrong
TDC_AI_LLM_MODEL=gpt-4o-mini
# Correct
TDC_AI_LLM_MODEL=openai/gpt-4o-mini
```
**Problem:**`ValueError: provider 'xyz' is not in supported provider allowlist`
**Solution:** Check the provider name spelling. See [Model Providers](#model-providers) for the full list. Provider names are case-insensitive.
### API Key Errors
**Problem:**`litellm.AuthenticationError: Invalid API key`
**Solution:** Verify your API key is set correctly:
```bash
# For OpenAI
export OPENAI_API_KEY=sk-...
# For Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
# For OpenRouter
export OPENROUTER_API_KEY=...
# Check if set
echo$OPENAI_API_KEY
```
**Problem:**`Missing API key for provider 'openai'`
**Solution:** LiteLLM expects the API key in a standard environment variable named `<PROVIDER>_API_KEY`. See the [Model Providers](#model-providers) table for the correct variable name for each provider.
**Alternative:** You can use `TDC_AI_LLM_API_KEY` as a universal API key that takes precedence over provider-specific keys. This is useful when you want to use a single API key across different providers (e.g., via OpenRouter or a proxy service):
```bash
export TDC_AI_LLM_API_KEY=your-key-here
```
If `TDC_AI_LLM_API_KEY` is set, it will be used instead of the provider-specific key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.).
### Embedding Model Issues
**Problem:**`OSError: No sentence-transformers model found`
**Solution:** If using a Hugging Face embedding model, ensure `sentence-transformers` is installed:
```bash
uv add sentence-transformers
```
### Workspace Issues
**Problem:**`Workspace 'my-project' not found`
**Solution:** Create the workspace first:
```bash
3gpp-ai workspace create my-project
```
**Solution:** Use `summarize` or `convert` to work with individual TDocs directly. These commands fetch content from configured sources:
```bash
3gpp-ai summarize SP-240001
3gpp-ai convert SP-240001 --output SP-240001.md
```
### Query Errors
**Problem:**`TDoc 'SP-240001' not found`
**Solution:** Ensure the TDoc exists in your workspace or use `summarize`/`convert` which fetch from external sources:
```bash
3gpp-ai summarize SP-240001 --format markdown
```
**Problem:**`LLM API timeout`
**Solution:** Increase timeout or reduce token count:
```bash
# Increase timeout (if supported by provider)
export LITELLM_REQUEST_TIMEOUT=60
# Reduce max tokens
export TDC_AI_LLM_MAX_TOKENS=1000
```
### Performance Issues
**Problem:** Processing is very slow
**Solution:**
1. Increase parallelism:
```bash
export TDC_AI_PARALLELISM=8
```
1. Use a faster LLM for summarization (e.g., `gpt-4o-mini` instead of `gpt-4o`)
1. For local models, ensure Ollama is running with GPU acceleration if available
**Solution:** This can occur after upgrading the AI module. The LanceDB schema may need to be recreated. Backup your data and delete the LanceDB directory: