Commit f65e215f authored by Jan Reimes's avatar Jan Reimes
Browse files

📝 docs(3gpp-ai): update documentation to reflect wiki-first architecture rule

parent 07935aa8
Loading
Loading
Loading
Loading
+34 −142
Original line number Diff line number Diff line
# 3gpp-ai

AI-powered document processing for 3GPP TDocs, specs and other documents (knowledge graphs, semantic search, summarization).
AI package for wiki-first processing of 3GPP TDocs and specs.

## Project Structure

The project structure can be parsed using the following command from the root of the repository:

```shell
rg --files | tree-cli --fromfile
```
## Architecture Rule

## Key Design Patterns
Use wiki-first as the only supported compile/query contract.

### LightRAG Integration
- Do not introduce additional query modes.
- Do not add fallback flags or alternate retrieval-mode toggles.
- Keep query metadata deterministic: `query_mode = "wiki-first"`.

The 3gpp-ai pipeline uses LightRAG for all document processing:
## Project Structure

```python
from tdoc_ai import LightRAGConfig, TDocRAG, TDocProcessor
Generate structure on demand from repository root:

# Automatically reads TDC_AI_* environment variables
config = LightRAGConfig.from_env()
rag = TDocRAG(config)
await rag.start("my-workspace")
```shell
rg --files | tree-cli --fromfile
```

### Main APIs
## Commands

Use `TDocRAG` for workspace-level retrieval and `TDocProcessor` for per-document ingestion.
| Task | Command |
|------|---------|
| Lint | `ruff check packages/3gpp-ai tests/ai` |
| Test (package) | `uv run pytest tests/ai -v` |
| Test (single) | `uv run pytest tests/ai/test_wiki_contracts.py -v` |

## Configuration

### LightRAG

Reads from `TDC_AI_*` environment variables (see `.env.example`):
Read settings from `TDC_AI_*` environment variables and use `CacheManager` for path resolution.

- `TDC_AI_LLM_MODEL` - LLM model in `<provider>/<model>` format (default: `openrouter/openrouter/free`)
- `TDC_AI_LLM_API_BASE` - Custom LLM API base URL (optional)
- `TDC_AI_LLM_API_KEY` - LLM API key (optional, overrides provider-specific env vars)
- `TDC_AI_EMBEDDING_MODEL` - Embedding model ID (default: `sentence-transformers/all-MiniLM-L6-v2`)

LightRAG-specific variables:

- `LIGHTRAG_SHARED_STORAGE` - Enable shared embedding storage (default: `true`)
- `LIGHTRAG_DB_BACKEND` - Storage backend: `file` or `pg0` (default: `file`)

### Path Management

**CRITICAL:** All file paths use `CacheManager` from `tdoc_crawler.config`:
Required pattern:

```python
from tdoc_crawler.config import resolve_cache_manager
from threegpp_ai.config import AiConfig

manager = resolve_cache_manager()
manager.ai_cache_dir       # ~/.3gpp-crawler/lightrag/
manager.ai_embed_dir(model)  # ~/.3gpp-crawler/lightrag/{model}/
config = AiConfig.from_env()
```

**NEVER hardcode paths** like `~/.3gpp-crawler` - see root `AGENTS.md` for the full CacheManager pattern.

## Single Source of Truth (SSOT) Principle

**Rule:** Every configuration value, constant, or shared resource must be defined **exactly once** and reused everywhere else.
Never hardcode paths such as `~/.3gpp-crawler`.

### What Must Follow SSOT

| Category | Source | Usage |
|----------|--------|-------|
| **Paths** | `CacheManager` | All file/directory paths |
| **API Keys** | Environment variables via `LightRAGConfig` | `config.llm.api_key`, `config.embedding.api_key` |
| **API Base URLs** | Environment variables via `LightRAGConfig` | `config.llm.api_base`, `config.embedding.api_base` |
| **Model Names** | Environment variables via `LightRAGConfig` | `config.llm.model`, `config.embedding.model` |
| **Provider Functions** | `PROVIDERS` registry in `rag.py` | Use `_get_provider(name)` - never inline |
| **Provider Aliases** | `PROVIDER_ALIASES` in `rag.py` | Central mapping (e.g., `zai``zhipu`) |
| **Embedding Dimensions** | `EMBEDDING_DIMENSIONS` in `rag.py` or provider config | Never hardcode dimension values |

### Anti-Patterns (NEVER DO)

```python
# ❌ Hardcoded paths
Path.home() / ".3gpp-crawler" / "lightrag"

# ❌ Hardcoded API configuration
api_key = "sk-..."
api_base = "https://api.z.ai/..."

# ❌ Duplicated provider mapping
if provider == "ollama":
    func = ollama_model_complete
elif provider == "zhipu":
    func = zhipu_complete

# ❌ Hardcoded dimension values
if model == "qwen3":
    dim = 1024
```
## Code Guidelines

### Correct Patterns (ALWAYS DO)
- Use type hints on all public functions.
- Keep imports at module top level.
- Use `logging` for diagnostics.
- Avoid introducing new dependencies unless required.

```python
# ✅ Paths via CacheManager
manager = resolve_cache_manager()
working_dir = manager.ai_embed_dir(model_name)

# ✅ Configuration via LightRAGConfig
config = LightRAGConfig.from_env()
api_key = config.llm.api_key

# ✅ Provider functions from registry
provider_config = _get_provider(provider_name)
func = provider_config.complete_func

# ✅ Dimensions from config or registry
dim = _get_embedding_dimension(model_name, provider)
```

### Why This Matters

1. **Maintainability**: Change once, update everywhere automatically
1. **Consistency**: No drift between different parts of the code
1. **Testability**: Easy to swap values in tests
1. **Security**: Secrets live in environment variables, not code
1. **DRY**: Eliminates duplicated logic and magic strings/numbers

## Storage Layer

### LightRAG

File-based storage by default:

- NanoVectorDB for embeddings (file-based)
- JsonKVStorage for cache (file-based)
- NetworkX for knowledge graph

Optionally use pg0 for PostgreSQL-backed storage.

## CLI Integration

The `3gpp-ai` package provides its own standalone CLI entrypoint:

```bash
3gpp-ai workspace process
3gpp-ai workspace query "your query"
3gpp-ai workspace status
3gpp-ai summarize <tdoc_id>
3gpp-ai convert <tdoc_id>
3gpp-ai providers list
```

## Import Guidelines

```python
from tdoc_ai import LightRAGConfig, TDocRAG, TDocProcessor
```
## Testing Expectations

## Extraction
When changing contracts, update tests in `tests/ai/` in the same change set.

LightRAG uses `opendataloader-pdf` for text, table, formula, and figure extraction before chunking and ingestion.
- Contract model updates: `tests/ai/test_wiki_contracts.py`
- CLI surface updates: `tests/ai/test_extraction_profiles.py` and relevant CLI tests

## Deprecated/Removed
## Never Do

- `AiStorage`
- `EmbeddingsManager`
- `create_embeddings_manager()`
- `tdoc_ai.operations.pipeline` (legacy CLASSIFY/EXTRACT/EMBED/GRAPH flow)
- `tdoc_ai.storage.lancedb`
- `sentence-transformers`
- `tokenizers`
- `lancedb`
- `docling` (replaced by `opendataloader-pdf`)
- Add query contract values other than `wiki-first`.
- Add config or CLI switches that change the fixed query contract mode.
- Reintroduce retrieval-mode configuration that changes wiki-first behavior.
+2 −3
Original line number Diff line number Diff line
@@ -5,9 +5,8 @@ Optional AI extension package for `3gpp-crawler`.
This package contains AI-focused capabilities including:

- Document extraction and conversion
- Summarization
- Embeddings and semantic search
- GraphRAG querying (via LightRAG)
- Deterministic wiki compilation from extraction artifacts
- Citation-grounded wiki querying and summarization
- AI workspace management

Install via `3gpp-crawler` extras: