📝 docs(3gpp-ai): update documentation to reflect wiki-first architecture rule (f65e215f) · Commits · Jan Reimes / 3gpp-crawler

packages/3gpp-ai/AGENTS.md

+34 −142

Original line number	Diff line number	Diff line
		# 3gpp-ai

		AI-powered document processing for 3GPP TDocs, specs and other documents (knowledge graphs, semantic search, summarization).
		AI package for wiki-first processing of 3GPP TDocs and specs.

		## Project Structure

		The project structure can be parsed using the following command from the root of the repository:

		```shell
		rg --files \| tree-cli --fromfile
		```
		## Architecture Rule

		## Key Design Patterns
		Use wiki-first as the only supported compile/query contract.

		### LightRAG Integration
		- Do not introduce additional query modes.
		- Do not add fallback flags or alternate retrieval-mode toggles.
		- Keep query metadata deterministic: `query_mode = "wiki-first"`.

		The 3gpp-ai pipeline uses LightRAG for all document processing:
		## Project Structure

		```python
		from tdoc_ai import LightRAGConfig, TDocRAG, TDocProcessor
		Generate structure on demand from repository root:

		# Automatically reads TDC_AI_* environment variables
		config = LightRAGConfig.from_env()
		rag = TDocRAG(config)
		await rag.start("my-workspace")
		```shell
		rg --files \| tree-cli --fromfile
		```

		### Main APIs
		## Commands

		Use `TDocRAG` for workspace-level retrieval and `TDocProcessor` for per-document ingestion.
		\| Task \| Command \|
		\|------\|---------\|
		\| Lint \| `ruff check packages/3gpp-ai tests/ai` \|
		\| Test (package) \| `uv run pytest tests/ai -v` \|
		\| Test (single) \| `uv run pytest tests/ai/test_wiki_contracts.py -v` \|

		## Configuration

		### LightRAG

		Reads from `TDC_AI_*` environment variables (see `.env.example`):
		Read settings from `TDC_AI_*` environment variables and use `CacheManager` for path resolution.

		- `TDC_AI_LLM_MODEL` - LLM model in `<provider>/<model>` format (default: `openrouter/openrouter/free`)
		- `TDC_AI_LLM_API_BASE` - Custom LLM API base URL (optional)
		- `TDC_AI_LLM_API_KEY` - LLM API key (optional, overrides provider-specific env vars)
		- `TDC_AI_EMBEDDING_MODEL` - Embedding model ID (default: `sentence-transformers/all-MiniLM-L6-v2`)

		LightRAG-specific variables:

		- `LIGHTRAG_SHARED_STORAGE` - Enable shared embedding storage (default: `true`)
		- `LIGHTRAG_DB_BACKEND` - Storage backend: `file` or `pg0` (default: `file`)

		### Path Management

		CRITICAL: All file paths use `CacheManager` from `tdoc_crawler.config`:
		Required pattern:

		```python
		from tdoc_crawler.config import resolve_cache_manager
		from threegpp_ai.config import AiConfig

		manager = resolve_cache_manager()
		manager.ai_cache_dir # ~/.3gpp-crawler/lightrag/
		manager.ai_embed_dir(model) # ~/.3gpp-crawler/lightrag/{model}/
		config = AiConfig.from_env()
		```

		NEVER hardcode paths like `~/.3gpp-crawler` - see root `AGENTS.md` for the full CacheManager pattern.

		## Single Source of Truth (SSOT) Principle

		Rule: Every configuration value, constant, or shared resource must be defined exactly once and reused everywhere else.
		Never hardcode paths such as `~/.3gpp-crawler`.

		### What Must Follow SSOT

		\| Category \| Source \| Usage \|
		\|----------\|--------\|-------\|
		\| Paths \| `CacheManager` \| All file/directory paths \|
		\| API Keys \| Environment variables via `LightRAGConfig` \| `config.llm.api_key`, `config.embedding.api_key` \|
		\| API Base URLs \| Environment variables via `LightRAGConfig` \| `config.llm.api_base`, `config.embedding.api_base` \|
		\| Model Names \| Environment variables via `LightRAGConfig` \| `config.llm.model`, `config.embedding.model` \|
		\| Provider Functions \| `PROVIDERS` registry in `rag.py` \| Use `_get_provider(name)` - never inline \|
		\| Provider Aliases \| `PROVIDER_ALIASES` in `rag.py` \| Central mapping (e.g., `zai` → `zhipu`) \|
		\| Embedding Dimensions \| `EMBEDDING_DIMENSIONS` in `rag.py` or provider config \| Never hardcode dimension values \|

		### Anti-Patterns (NEVER DO)

		```python
		# ❌ Hardcoded paths
		Path.home() / ".3gpp-crawler" / "lightrag"

		# ❌ Hardcoded API configuration
		api_key = "sk-..."
		api_base = "https://api.z.ai/..."

		# ❌ Duplicated provider mapping
		if provider == "ollama":
		func = ollama_model_complete
		elif provider == "zhipu":
		func = zhipu_complete

		# ❌ Hardcoded dimension values
		if model == "qwen3":
		dim = 1024
		```
		## Code Guidelines

		### Correct Patterns (ALWAYS DO)
		- Use type hints on all public functions.
		- Keep imports at module top level.
		- Use `logging` for diagnostics.
		- Avoid introducing new dependencies unless required.

		```python
		# ✅ Paths via CacheManager
		manager = resolve_cache_manager()
		working_dir = manager.ai_embed_dir(model_name)

		# ✅ Configuration via LightRAGConfig
		config = LightRAGConfig.from_env()
		api_key = config.llm.api_key

		# ✅ Provider functions from registry
		provider_config = _get_provider(provider_name)
		func = provider_config.complete_func

		# ✅ Dimensions from config or registry
		dim = _get_embedding_dimension(model_name, provider)
		```

		### Why This Matters

		1. Maintainability: Change once, update everywhere automatically
		1. Consistency: No drift between different parts of the code
		1. Testability: Easy to swap values in tests
		1. Security: Secrets live in environment variables, not code
		1. DRY: Eliminates duplicated logic and magic strings/numbers

		## Storage Layer

		### LightRAG

		File-based storage by default:

		- NanoVectorDB for embeddings (file-based)
		- JsonKVStorage for cache (file-based)
		- NetworkX for knowledge graph

		Optionally use pg0 for PostgreSQL-backed storage.

		## CLI Integration

		The `3gpp-ai` package provides its own standalone CLI entrypoint:

		```bash
		3gpp-ai workspace process
		3gpp-ai workspace query "your query"
		3gpp-ai workspace status
		3gpp-ai summarize <tdoc_id>
		3gpp-ai convert <tdoc_id>
		3gpp-ai providers list
		```

		## Import Guidelines

		```python
		from tdoc_ai import LightRAGConfig, TDocRAG, TDocProcessor
		```
		## Testing Expectations

		## Extraction
		When changing contracts, update tests in `tests/ai/` in the same change set.

		LightRAG uses `opendataloader-pdf` for text, table, formula, and figure extraction before chunking and ingestion.
		- Contract model updates: `tests/ai/test_wiki_contracts.py`
		- CLI surface updates: `tests/ai/test_extraction_profiles.py` and relevant CLI tests

		## Deprecated/Removed
		## Never Do

		- `AiStorage`
		- `EmbeddingsManager`
		- `create_embeddings_manager()`
		- `tdoc_ai.operations.pipeline` (legacy CLASSIFY/EXTRACT/EMBED/GRAPH flow)
		- `tdoc_ai.storage.lancedb`
		- `sentence-transformers`
		- `tokenizers`
		- `lancedb`
		- `docling` (replaced by `opendataloader-pdf`)
		- Add query contract values other than `wiki-first`.
		- Add config or CLI switches that change the fixed query contract mode.
		- Reintroduce retrieval-mode configuration that changes wiki-first behavior.

packages/3gpp-ai/README.md

+2 −3

Original line number	Diff line number	Diff line
		@@ -5,9 +5,8 @@ Optional AI extension package for `3gpp-crawler`.
		This package contains AI-focused capabilities including:

		- Document extraction and conversion
		- Summarization
		- Embeddings and semantic search
		- GraphRAG querying (via LightRAG)
		- Deterministic wiki compilation from extraction artifacts
		- Citation-grounded wiki querying and summarization
		- AI workspace management

		Install via `3gpp-crawler` extras: