Loading packages/3gpp-ai/AGENTS.md +34 −142 Original line number Diff line number Diff line # 3gpp-ai AI-powered document processing for 3GPP TDocs, specs and other documents (knowledge graphs, semantic search, summarization). AI package for wiki-first processing of 3GPP TDocs and specs. ## Project Structure The project structure can be parsed using the following command from the root of the repository: ```shell rg --files | tree-cli --fromfile ``` ## Architecture Rule ## Key Design Patterns Use wiki-first as the only supported compile/query contract. ### LightRAG Integration - Do not introduce additional query modes. - Do not add fallback flags or alternate retrieval-mode toggles. - Keep query metadata deterministic: `query_mode = "wiki-first"`. The 3gpp-ai pipeline uses LightRAG for all document processing: ## Project Structure ```python from tdoc_ai import LightRAGConfig, TDocRAG, TDocProcessor Generate structure on demand from repository root: # Automatically reads TDC_AI_* environment variables config = LightRAGConfig.from_env() rag = TDocRAG(config) await rag.start("my-workspace") ```shell rg --files | tree-cli --fromfile ``` ### Main APIs ## Commands Use `TDocRAG` for workspace-level retrieval and `TDocProcessor` for per-document ingestion. | Task | Command | |------|---------| | Lint | `ruff check packages/3gpp-ai tests/ai` | | Test (package) | `uv run pytest tests/ai -v` | | Test (single) | `uv run pytest tests/ai/test_wiki_contracts.py -v` | ## Configuration ### LightRAG Reads from `TDC_AI_*` environment variables (see `.env.example`): Read settings from `TDC_AI_*` environment variables and use `CacheManager` for path resolution. - `TDC_AI_LLM_MODEL` - LLM model in `<provider>/<model>` format (default: `openrouter/openrouter/free`) - `TDC_AI_LLM_API_BASE` - Custom LLM API base URL (optional) - `TDC_AI_LLM_API_KEY` - LLM API key (optional, overrides provider-specific env vars) - `TDC_AI_EMBEDDING_MODEL` - Embedding model ID (default: `sentence-transformers/all-MiniLM-L6-v2`) LightRAG-specific variables: - `LIGHTRAG_SHARED_STORAGE` - Enable shared embedding storage (default: `true`) - `LIGHTRAG_DB_BACKEND` - Storage backend: `file` or `pg0` (default: `file`) ### Path Management **CRITICAL:** All file paths use `CacheManager` from `tdoc_crawler.config`: Required pattern: ```python from tdoc_crawler.config import resolve_cache_manager from threegpp_ai.config import AiConfig manager = resolve_cache_manager() manager.ai_cache_dir # ~/.3gpp-crawler/lightrag/ manager.ai_embed_dir(model) # ~/.3gpp-crawler/lightrag/{model}/ config = AiConfig.from_env() ``` **NEVER hardcode paths** like `~/.3gpp-crawler` - see root `AGENTS.md` for the full CacheManager pattern. ## Single Source of Truth (SSOT) Principle **Rule:** Every configuration value, constant, or shared resource must be defined **exactly once** and reused everywhere else. Never hardcode paths such as `~/.3gpp-crawler`. ### What Must Follow SSOT | Category | Source | Usage | |----------|--------|-------| | **Paths** | `CacheManager` | All file/directory paths | | **API Keys** | Environment variables via `LightRAGConfig` | `config.llm.api_key`, `config.embedding.api_key` | | **API Base URLs** | Environment variables via `LightRAGConfig` | `config.llm.api_base`, `config.embedding.api_base` | | **Model Names** | Environment variables via `LightRAGConfig` | `config.llm.model`, `config.embedding.model` | | **Provider Functions** | `PROVIDERS` registry in `rag.py` | Use `_get_provider(name)` - never inline | | **Provider Aliases** | `PROVIDER_ALIASES` in `rag.py` | Central mapping (e.g., `zai` → `zhipu`) | | **Embedding Dimensions** | `EMBEDDING_DIMENSIONS` in `rag.py` or provider config | Never hardcode dimension values | ### Anti-Patterns (NEVER DO) ```python # ❌ Hardcoded paths Path.home() / ".3gpp-crawler" / "lightrag" # ❌ Hardcoded API configuration api_key = "sk-..." api_base = "https://api.z.ai/..." # ❌ Duplicated provider mapping if provider == "ollama": func = ollama_model_complete elif provider == "zhipu": func = zhipu_complete # ❌ Hardcoded dimension values if model == "qwen3": dim = 1024 ``` ## Code Guidelines ### Correct Patterns (ALWAYS DO) - Use type hints on all public functions. - Keep imports at module top level. - Use `logging` for diagnostics. - Avoid introducing new dependencies unless required. ```python # ✅ Paths via CacheManager manager = resolve_cache_manager() working_dir = manager.ai_embed_dir(model_name) # ✅ Configuration via LightRAGConfig config = LightRAGConfig.from_env() api_key = config.llm.api_key # ✅ Provider functions from registry provider_config = _get_provider(provider_name) func = provider_config.complete_func # ✅ Dimensions from config or registry dim = _get_embedding_dimension(model_name, provider) ``` ### Why This Matters 1. **Maintainability**: Change once, update everywhere automatically 1. **Consistency**: No drift between different parts of the code 1. **Testability**: Easy to swap values in tests 1. **Security**: Secrets live in environment variables, not code 1. **DRY**: Eliminates duplicated logic and magic strings/numbers ## Storage Layer ### LightRAG File-based storage by default: - NanoVectorDB for embeddings (file-based) - JsonKVStorage for cache (file-based) - NetworkX for knowledge graph Optionally use pg0 for PostgreSQL-backed storage. ## CLI Integration The `3gpp-ai` package provides its own standalone CLI entrypoint: ```bash 3gpp-ai workspace process 3gpp-ai workspace query "your query" 3gpp-ai workspace status 3gpp-ai summarize <tdoc_id> 3gpp-ai convert <tdoc_id> 3gpp-ai providers list ``` ## Import Guidelines ```python from tdoc_ai import LightRAGConfig, TDocRAG, TDocProcessor ``` ## Testing Expectations ## Extraction When changing contracts, update tests in `tests/ai/` in the same change set. LightRAG uses `opendataloader-pdf` for text, table, formula, and figure extraction before chunking and ingestion. - Contract model updates: `tests/ai/test_wiki_contracts.py` - CLI surface updates: `tests/ai/test_extraction_profiles.py` and relevant CLI tests ## Deprecated/Removed ## Never Do - `AiStorage` - `EmbeddingsManager` - `create_embeddings_manager()` - `tdoc_ai.operations.pipeline` (legacy CLASSIFY/EXTRACT/EMBED/GRAPH flow) - `tdoc_ai.storage.lancedb` - `sentence-transformers` - `tokenizers` - `lancedb` - `docling` (replaced by `opendataloader-pdf`) - Add query contract values other than `wiki-first`. - Add config or CLI switches that change the fixed query contract mode. - Reintroduce retrieval-mode configuration that changes wiki-first behavior. packages/3gpp-ai/README.md +2 −3 Original line number Diff line number Diff line Loading @@ -5,9 +5,8 @@ Optional AI extension package for `3gpp-crawler`. This package contains AI-focused capabilities including: - Document extraction and conversion - Summarization - Embeddings and semantic search - GraphRAG querying (via LightRAG) - Deterministic wiki compilation from extraction artifacts - Citation-grounded wiki querying and summarization - AI workspace management Install via `3gpp-crawler` extras: Loading Loading
packages/3gpp-ai/AGENTS.md +34 −142 Original line number Diff line number Diff line # 3gpp-ai AI-powered document processing for 3GPP TDocs, specs and other documents (knowledge graphs, semantic search, summarization). AI package for wiki-first processing of 3GPP TDocs and specs. ## Project Structure The project structure can be parsed using the following command from the root of the repository: ```shell rg --files | tree-cli --fromfile ``` ## Architecture Rule ## Key Design Patterns Use wiki-first as the only supported compile/query contract. ### LightRAG Integration - Do not introduce additional query modes. - Do not add fallback flags or alternate retrieval-mode toggles. - Keep query metadata deterministic: `query_mode = "wiki-first"`. The 3gpp-ai pipeline uses LightRAG for all document processing: ## Project Structure ```python from tdoc_ai import LightRAGConfig, TDocRAG, TDocProcessor Generate structure on demand from repository root: # Automatically reads TDC_AI_* environment variables config = LightRAGConfig.from_env() rag = TDocRAG(config) await rag.start("my-workspace") ```shell rg --files | tree-cli --fromfile ``` ### Main APIs ## Commands Use `TDocRAG` for workspace-level retrieval and `TDocProcessor` for per-document ingestion. | Task | Command | |------|---------| | Lint | `ruff check packages/3gpp-ai tests/ai` | | Test (package) | `uv run pytest tests/ai -v` | | Test (single) | `uv run pytest tests/ai/test_wiki_contracts.py -v` | ## Configuration ### LightRAG Reads from `TDC_AI_*` environment variables (see `.env.example`): Read settings from `TDC_AI_*` environment variables and use `CacheManager` for path resolution. - `TDC_AI_LLM_MODEL` - LLM model in `<provider>/<model>` format (default: `openrouter/openrouter/free`) - `TDC_AI_LLM_API_BASE` - Custom LLM API base URL (optional) - `TDC_AI_LLM_API_KEY` - LLM API key (optional, overrides provider-specific env vars) - `TDC_AI_EMBEDDING_MODEL` - Embedding model ID (default: `sentence-transformers/all-MiniLM-L6-v2`) LightRAG-specific variables: - `LIGHTRAG_SHARED_STORAGE` - Enable shared embedding storage (default: `true`) - `LIGHTRAG_DB_BACKEND` - Storage backend: `file` or `pg0` (default: `file`) ### Path Management **CRITICAL:** All file paths use `CacheManager` from `tdoc_crawler.config`: Required pattern: ```python from tdoc_crawler.config import resolve_cache_manager from threegpp_ai.config import AiConfig manager = resolve_cache_manager() manager.ai_cache_dir # ~/.3gpp-crawler/lightrag/ manager.ai_embed_dir(model) # ~/.3gpp-crawler/lightrag/{model}/ config = AiConfig.from_env() ``` **NEVER hardcode paths** like `~/.3gpp-crawler` - see root `AGENTS.md` for the full CacheManager pattern. ## Single Source of Truth (SSOT) Principle **Rule:** Every configuration value, constant, or shared resource must be defined **exactly once** and reused everywhere else. Never hardcode paths such as `~/.3gpp-crawler`. ### What Must Follow SSOT | Category | Source | Usage | |----------|--------|-------| | **Paths** | `CacheManager` | All file/directory paths | | **API Keys** | Environment variables via `LightRAGConfig` | `config.llm.api_key`, `config.embedding.api_key` | | **API Base URLs** | Environment variables via `LightRAGConfig` | `config.llm.api_base`, `config.embedding.api_base` | | **Model Names** | Environment variables via `LightRAGConfig` | `config.llm.model`, `config.embedding.model` | | **Provider Functions** | `PROVIDERS` registry in `rag.py` | Use `_get_provider(name)` - never inline | | **Provider Aliases** | `PROVIDER_ALIASES` in `rag.py` | Central mapping (e.g., `zai` → `zhipu`) | | **Embedding Dimensions** | `EMBEDDING_DIMENSIONS` in `rag.py` or provider config | Never hardcode dimension values | ### Anti-Patterns (NEVER DO) ```python # ❌ Hardcoded paths Path.home() / ".3gpp-crawler" / "lightrag" # ❌ Hardcoded API configuration api_key = "sk-..." api_base = "https://api.z.ai/..." # ❌ Duplicated provider mapping if provider == "ollama": func = ollama_model_complete elif provider == "zhipu": func = zhipu_complete # ❌ Hardcoded dimension values if model == "qwen3": dim = 1024 ``` ## Code Guidelines ### Correct Patterns (ALWAYS DO) - Use type hints on all public functions. - Keep imports at module top level. - Use `logging` for diagnostics. - Avoid introducing new dependencies unless required. ```python # ✅ Paths via CacheManager manager = resolve_cache_manager() working_dir = manager.ai_embed_dir(model_name) # ✅ Configuration via LightRAGConfig config = LightRAGConfig.from_env() api_key = config.llm.api_key # ✅ Provider functions from registry provider_config = _get_provider(provider_name) func = provider_config.complete_func # ✅ Dimensions from config or registry dim = _get_embedding_dimension(model_name, provider) ``` ### Why This Matters 1. **Maintainability**: Change once, update everywhere automatically 1. **Consistency**: No drift between different parts of the code 1. **Testability**: Easy to swap values in tests 1. **Security**: Secrets live in environment variables, not code 1. **DRY**: Eliminates duplicated logic and magic strings/numbers ## Storage Layer ### LightRAG File-based storage by default: - NanoVectorDB for embeddings (file-based) - JsonKVStorage for cache (file-based) - NetworkX for knowledge graph Optionally use pg0 for PostgreSQL-backed storage. ## CLI Integration The `3gpp-ai` package provides its own standalone CLI entrypoint: ```bash 3gpp-ai workspace process 3gpp-ai workspace query "your query" 3gpp-ai workspace status 3gpp-ai summarize <tdoc_id> 3gpp-ai convert <tdoc_id> 3gpp-ai providers list ``` ## Import Guidelines ```python from tdoc_ai import LightRAGConfig, TDocRAG, TDocProcessor ``` ## Testing Expectations ## Extraction When changing contracts, update tests in `tests/ai/` in the same change set. LightRAG uses `opendataloader-pdf` for text, table, formula, and figure extraction before chunking and ingestion. - Contract model updates: `tests/ai/test_wiki_contracts.py` - CLI surface updates: `tests/ai/test_extraction_profiles.py` and relevant CLI tests ## Deprecated/Removed ## Never Do - `AiStorage` - `EmbeddingsManager` - `create_embeddings_manager()` - `tdoc_ai.operations.pipeline` (legacy CLASSIFY/EXTRACT/EMBED/GRAPH flow) - `tdoc_ai.storage.lancedb` - `sentence-transformers` - `tokenizers` - `lancedb` - `docling` (replaced by `opendataloader-pdf`) - Add query contract values other than `wiki-first`. - Add config or CLI switches that change the fixed query contract mode. - Reintroduce retrieval-mode configuration that changes wiki-first behavior.
packages/3gpp-ai/README.md +2 −3 Original line number Diff line number Diff line Loading @@ -5,9 +5,8 @@ Optional AI extension package for `3gpp-crawler`. This package contains AI-focused capabilities including: - Document extraction and conversion - Summarization - Embeddings and semantic search - GraphRAG querying (via LightRAG) - Deterministic wiki compilation from extraction artifacts - Citation-grounded wiki querying and summarization - AI workspace management Install via `3gpp-crawler` extras: Loading