@@ -546,12 +549,14 @@ Replace the existing 3gpp-ai pipeline with **LightRAG** (Python GraphRAG framewo
The existing tdoc-crawler already stores structured 3GPP metadata (TDoc IDs, spec references, meeting codes, companies) in SQLite. Since the LightRAG stage receives converted PDF files plus fixed-key metadata dictionaries, enrichment should be schema-first and deterministic instead of ad hoc dictionary lookups.
- [ ] Define a typed metadata contract (Pydantic model) for enrichment with at least:
-`tdoc_id` (required)
-`title`, `source`, `meeting` (optional)
-`spec_refs` (list[str], default empty list)
-`release`, `wg` (optional, if available from SQLite)
- [ ] Add normalization before enrichment:
-`tdoc_id` uppercased and stripped
-`spec_refs` normalized into stable display form
- missing required fields cause explicit skip/error status
@@ -618,11 +623,13 @@ LightRAG supports `edit_entity()` and `create_entity()` APIs. For known 3GPP ent
#### 5.2 Update Dependencies
- [ ] Remove from `pyproject.toml`:
-`lancedb` (replaced by pg0/pgvector)
-`sentence-transformers` (handled by LightRAG embedding functions)
-`edgequake-sdk` (dropped)
- [ ] Add to `pyproject.toml`:
-`lightrag-hku`
-`pg0-embedded`
@@ -673,18 +680,23 @@ OpenSearch runs on Windows via zip archive (no Docker). See investigation report
## Open Questions
1.**LLM Provider Selection:** Default to Ollama (local, free) or require cloud provider?
- Recommendation: Ollama for development. LightRAG recommends 32B+ models for quality entity extraction. Qwen3-30B or similar.
2.**Embedding Provider:** Use Ollama (local) or sentence-transformers?
1.**Embedding Provider:** Use Ollama (local) or sentence-transformers?
- Recommendation: Ollama with qwen3-embedding:0.6b (available in 3gpp-ai workspace)
3.**Graph Storage Scaling:** NetworkX (file-based) vs. PostgreSQL AGE vs. OpenSearch?
1.**Graph Storage Scaling:** NetworkX (file-based) vs. PostgreSQL AGE vs. OpenSearch?
- Recommendation: Start with NetworkX. For typical 3GPP meeting scope (100-500 TDocs), file-based graph is sufficient. Migrate to OpenSearch only if performance becomes an issue.
4.**Metadata Contract Strictness:** Which metadata keys are required to proceed with insertion?
1.**Metadata Contract Strictness:** Which metadata keys are required to proceed with insertion?
- Recommendation: Require `tdoc_id`; treat all other fields as optional but normalized. On missing required fields, emit explicit skip/error result.
5.**LightRAG WebUI:** Deploy alongside CLI?
1.**LightRAG WebUI:** Deploy alongside CLI?
- Recommendation: Optional. `lightrag-server` provides an Ollama-compatible REST API + React WebUI for graph visualization. Useful for debugging, not required for pipeline.
## Progress
@@ -714,10 +726,10 @@ OpenSearch runs on Windows via zip archive (no Docker). See investigation report
This plan addresses the adaptation of `summarize` and `convert` commands for the 3GPP AI document processing pipeline. The goal is to provide robust, user-friendly commands for document conversion and summarization with proper caching, error handling, and integration with the LightRAG knowledge graph.