📝 docs: update development and migration documentation (bf189dd5) · Commits · Jan Reimes / 3gpp-crawler

docs/development.md

+7 −4

Original line number	Diff line number	Diff line
		@@ -9,15 +9,18 @@ This guide describes how to set up your environment for contributing to `3gpp-cr
		1. Clone the repository:

		```bash
		```

		git clone https://forge.3gpp.org/rep/reimes/3gpp-crawler.git
		cd 3gpp-crawler
		```

		````

		1. Sync dependencies:

		```bash
		uv sync --all-extras
		```
		````

		1. Install pre-commit hooks:

docs/history/2026-03-23_SUMMARY_LightRAG_migration_plan.md

+24 −9

Original line number	Diff line number	Diff line
		@@ -43,20 +43,23 @@ Replace the existing 3gpp-ai pipeline with LightRAG (Python GraphRAG framewo
		### Key Components

		1. LightRAG (Python library)

		- `pip install lightrag-hku` / `uv add lightrag-hku`
		- Async-first API (`await rag.ainsert()`, `await rag.aquery()`)
		- LLM-powered entity-relationship extraction with gleaning
		- 6 query modes, citation tracking, workspace isolation
		- Optional REST API server via `lightrag-server`

		2. pg0 (Embedded PostgreSQL)
		1. pg0 (Embedded PostgreSQL)

		- Zero-config, single binary (~50MB)
		- PostgreSQL 18 + pgvector 0.8.1 bundled
		- Python SDK: `from pg0 import Pg0`
		- Data stored in `~/.pg0/instances/<name>/data/`
		- Note: Does NOT include Apache AGE (graph extension)

		3. Storage Split
		1. Storage Split

		- pg0 handles: KV cache, vector embeddings, doc status (3 of 4 storage types)
		- NetworkX handles: Entity-relationship graph (file-based, LightRAG default)
		- This avoids the Apache AGE dependency entirely
		@@ -546,12 +549,14 @@ Replace the existing 3gpp-ai pipeline with LightRAG (Python GraphRAG framewo
		The existing tdoc-crawler already stores structured 3GPP metadata (TDoc IDs, spec references, meeting codes, companies) in SQLite. Since the LightRAG stage receives converted PDF files plus fixed-key metadata dictionaries, enrichment should be schema-first and deterministic instead of ad hoc dictionary lookups.

		- [ ] Define a typed metadata contract (Pydantic model) for enrichment with at least:

		- `tdoc_id` (required)
		- `title`, `source`, `meeting` (optional)
		- `spec_refs` (list[str], default empty list)
		- `release`, `wg` (optional, if available from SQLite)

		- [ ] Add normalization before enrichment:

		- `tdoc_id` uppercased and stripped
		- `spec_refs` normalized into stable display form
		- missing required fields cause explicit skip/error status
		@@ -618,11 +623,13 @@ LightRAG supports `edit_entity()` and `create_entity()` APIs. For known 3GPP ent
		#### 5.2 Update Dependencies

		- [ ] Remove from `pyproject.toml`:

		- `lancedb` (replaced by pg0/pgvector)
		- `sentence-transformers` (handled by LightRAG embedding functions)
		- `edgequake-sdk` (dropped)

		- [ ] Add to `pyproject.toml`:

		- `lightrag-hku`
		- `pg0-embedded`

		@@ -673,18 +680,23 @@ OpenSearch runs on Windows via zip archive (no Docker). See investigation report
		## Open Questions

		1. LLM Provider Selection: Default to Ollama (local, free) or require cloud provider?

		- Recommendation: Ollama for development. LightRAG recommends 32B+ models for quality entity extraction. Qwen3-30B or similar.

		2. Embedding Provider: Use Ollama (local) or sentence-transformers?
		1. Embedding Provider: Use Ollama (local) or sentence-transformers?

		- Recommendation: Ollama with qwen3-embedding:0.6b (available in 3gpp-ai workspace)

		3. Graph Storage Scaling: NetworkX (file-based) vs. PostgreSQL AGE vs. OpenSearch?
		1. Graph Storage Scaling: NetworkX (file-based) vs. PostgreSQL AGE vs. OpenSearch?

		- Recommendation: Start with NetworkX. For typical 3GPP meeting scope (100-500 TDocs), file-based graph is sufficient. Migrate to OpenSearch only if performance becomes an issue.

		4. Metadata Contract Strictness: Which metadata keys are required to proceed with insertion?
		1. Metadata Contract Strictness: Which metadata keys are required to proceed with insertion?

		- Recommendation: Require `tdoc_id`; treat all other fields as optional but normalized. On missing required fields, emit explicit skip/error result.

		5. LightRAG WebUI: Deploy alongside CLI?
		1. LightRAG WebUI: Deploy alongside CLI?

		- Recommendation: Optional. `lightrag-server` provides an Ollama-compatible REST API + React WebUI for graph visualization. Useful for debugging, not required for pipeline.

		## Progress
		@@ -714,10 +726,10 @@ OpenSearch runs on Windows via zip archive (no Docker). See investigation report
		- [x] (2026-03-23) Phase 5.4 complete: Removed broken CLI commands (ai query/process/status)
		- [x] (2026-03-23) Phase 6.1 complete: Unit tests (test_lightrag_config.py: 11 tests, test_metadata.py: 11 tests)
		- [x] (2026-03-23) Phase 6.2 complete: Integration tests (test_integration.py: 4 async tests)
		- [x] (2026-03-23) Phase 6.3 validation: Startup <10s ✓, Insert/query works ✓, Workspace isolation ✓
		- [x] (2026-03-23) Phase 6.3 validation: Startup \<10s ✓, Insert/query works ✓, Workspace isolation ✓
		- [x] (2026-03-23) Rename fix: Updated PLAN.md defaults from `tdoc-crawler` to `3gpp-crawler`
		- [ ] Phase 7: 3gpp-ai package cleanup (dead code removal, linter issues)
		- [ ] Phase 6.3 validation: Query latency <2s (manual validation needed)
		- [ ] Phase 6.3 validation: Query latency \<2s (manual validation needed)
		- [ ] Phase 6.3 validation: Idempotency (manual validation needed)

		### Phase 7: 3gpp-ai Package Cleanup
		@@ -744,6 +756,7 @@ The following modules are obsolete (replaced by LightRAG pipeline):
		Current state exports 40+ symbols including legacy ones. Target: clean separation.

		Remove from exports:

		- `AiConfig` (replaced by `LightRAGConfig`)
		- `AiStorage` (LanceDB, dead)
		- `EmbeddingsManager` (sentence-transformers, dead)
		@@ -752,6 +765,7 @@ Current state exports 40+ symbols including legacy ones. Target: clean separatio
		- `GraphNode`, `GraphEdge` (custom graph schema, dead)

		Keep exports:

		```python
		# Document operations (still used by both pipelines)
		from tdoc_ai.operations.convert import convert_tdoc as convert_document
		@@ -805,6 +819,7 @@ Due to `tdoc-crawler` → `3gpp-crawler` rename:
		#### 7.5 Verify No Regressions

		Before declaring Phase 7 complete, verify:

		- [ ] `3gpp-ai --help` shows all commands
		- [ ] `3gpp-ai rag status` works
		- [ ] `3gpp-ai rag query "test"` works

docs/history/2026-03-24_SUMMARY_convert_summarize_commands_implementation.md

+45 −30

Original line number	Diff line number	Diff line
		# PLAN: Adapt `summarize` and `convert` Commands

		Status: ✅ All Phases Complete
		Status: ✅ All Phases Complete\
		Last Updated: 2026-03-24

		Completion: All phases implemented.
		Completion: All phases implemented.\
		NOTE: See "Known Limitations" section for current constraints and "Planned Improvements" for roadmap.

		---
		______________________________________________________________________

		## Known Limitations

		@@ -20,24 +20,28 @@
		### Current Mitigations

		1. Large Document Handling:

		- Documents >500K tokens are truncated to `MAX_TOKENS` (default 100K)
		- Truncation may lose important content at the end of documents
		- Cache stored in `.ai/` subdirectory

		2. Rate Limiting:
		1. Rate Limiting:

		- 3GPP FTP may rate-limit requests
		- `fetch_tdoc_files()` implements retry logic with exponential backoff
		- Max 3 retries with 5-second delays

		3. LibreOffice Dependency:
		1. LibreOffice Dependency:

		- `convert-lo` requires LibreOffice installed on the system
		- Graceful error message if not available

		4. Memory Pressure:
		1. Memory Pressure:

		- Large PDFs may consume significant memory during extraction
		- `kreuzberg` handles this internally but large documents (>100 pages) may be slow

		---
		______________________________________________________________________

		## Planned Improvements

		@@ -46,6 +50,7 @@
		Goal: Preserve context across document boundaries instead of simple truncation.

		Design:

		```python
		from enum import Enum
		from dataclasses import dataclass
		@@ -69,13 +74,14 @@ class ChunkingConfig:

		Dependencies: `tiktoken` for token counting

		---
		______________________________________________________________________

		### 2. Connection Pooling for HTTP Requests

		Goal: Improve throughput and reduce rate-limiting for batch operations.

		Design:

		```python
		from dataclasses import dataclass
		import aiohttp
		@@ -94,7 +100,7 @@ class PoolConfig:

		Dependencies: `aiohttp` for async pooling

		---
		______________________________________________________________________

		### 3. PDF Remote Converter (LibreOffice Alternative)

		@@ -103,6 +109,7 @@ class PoolConfig:
		Repository: https://forge.3gpp.org/rep/reimes/pdf-remote-converter

		Design:

		```python
		from enum import Enum
		from dataclasses import dataclass
		@@ -126,6 +133,7 @@ class ConverterConfig:
		Dependencies: Add `pdf-remote-converter` as optional extra

		Installation:

		```bash
		# Users without LibreOffice can install:
		uv sync --extra pdf-remote
		@@ -134,7 +142,7 @@ uv sync --extra pdf-remote
		export PDF_REMOTE_API_KEY="your-api-key"
		```

		---
		______________________________________________________________________

		## Error Handling Patterns

		@@ -173,7 +181,7 @@ class ClassificationError(TDocError):
		\| Conversion failed \| `[red]Error: Failed to convert S4-260001: LibreOffice not available[/red]` \| 1 \|
		\| Classification failed \| `[red]Error: Cannot determine primary file for 'S4-260001'[/red]` \| 1 \|

		---
		______________________________________________________________________

		## Rollback Plan

		@@ -199,7 +207,7 @@ rm ~/.3gpp-crawler/checkout/TSG_SA/WG4_CODEC/TSGS4_131-bis-e/Docs/S4-260001/.ai/
		3gpp-ai workspace clear
		```

		---
		______________________________________________________________________

		## CLI Examples

		@@ -252,7 +260,7 @@ echo "## Summary" && 3gpp-ai summarize S4-260001 --words 100
		3gpp-ai rag query "What are the codec requirements?"
		```

		---
		______________________________________________________________________

		## Performance Metrics

		@@ -331,7 +339,7 @@ except Exception as e:
		))
		```

		---
		______________________________________________________________________

		## Test Results

		@@ -352,38 +360,42 @@ except Exception as e:
		\| `3gpp-ai summarize S4-260001 --words 100` \| 100-word summary \| ⏳ Ready to run \|
		\| `3gpp-ai workspace create test-ws` \| Workspace created \| ⏳ Ready to run \|

		---
		______________________________________________________________________

		## Context

		This plan addresses the adaptation of `summarize` and `convert` commands for the 3GPP AI document processing pipeline. The goal is to provide robust, user-friendly commands for document conversion and summarization with proper caching, error handling, and integration with the LightRAG knowledge graph.

		---
		______________________________________________________________________

		## Phases

		1. Phase 1: Core Implementation ✅

		- Implement `convert_tdoc_to_markdown()` with caching
		- Implement `summarize_document()` with LLM integration
		- Add CLI commands with proper options

		2. Phase 2: Error Handling ✅
		1. Phase 2: Error Handling ✅

		- Define exception hierarchy
		- Add CLI error messages
		- Implement rollback patterns

		3. Phase 3: Performance ✅
		1. Phase 3: Performance ✅

		- Implement semantic chunking
		- Add connection pooling
		- Integrate pdf-remote-converter
		- Fix soffice.exe console window on Windows

		4. Phase 4: Metrics & Monitoring ✅
		1. Phase 4: Metrics & Monitoring ✅

		- Implement MetricsTracker
		- Add performance logging
		- Create dashboard/reporting

		---
		______________________________________________________________________

		## Key Files to Create/Modify

		@@ -396,7 +408,7 @@ This plan addresses the adaptation of `summarize` and `convert` commands for the
		\| `src/tdoc_crawler/http_client/session.py` \| Add connection pooling \| ✅ Implemented \|
		\| `packages/convert-lo/convert_lo/converter.py` \| soffice.exe window fix \| ✅ Fixed \|

		---
		______________________________________________________________________

		## Dependencies

		@@ -408,7 +420,7 @@ This plan addresses the adaptation of `summarize` and `convert` commands for the
		\| `aiohttp` \| Connection pooling \| ✅ (via requests adapter) \|
		\| `pdf-remote-converter` \| Remote PDF conversion \| ✅ Implemented \|

		---
		______________________________________________________________________

		## Risks and Unknowns

		@@ -420,16 +432,16 @@ This plan addresses the adaptation of `summarize` and `convert` commands for the
		\| Memory pressure on large PDFs \| Keep kreuzberg approach \| ✅ Current \|
		\| soffice.exe console popup on Windows \| CREATE_NO_WINDOW flag \| ✅ Fixed \|

		---
		______________________________________________________________________

		## Decisions

		1. Caching Strategy: Store converted markdown in `.ai/` subdirectory per document
		2. Error Handling: Custom exception hierarchy with specific error types
		3. PDF Conversion: LibreOffice primary, remote converter fallback
		4. Chunking: Semantic chunking for large documents (Phase 3)
		1. Error Handling: Custom exception hierarchy with specific error types
		1. PDF Conversion: LibreOffice primary, remote converter fallback
		1. Chunking: Semantic chunking for large documents (Phase 3)

		---
		______________________________________________________________________

		## Progress

		@@ -443,7 +455,7 @@ This plan addresses the adaptation of `summarize` and `convert` commands for the
		- [x] soffice.exe console window fix
		- [x] Performance metrics tracking

		---
		______________________________________________________________________

		## Implementation Status

		@@ -459,20 +471,23 @@ This plan addresses the adaptation of `summarize` and `convert` commands for the
		\| soffice.exe window fix \| ✅ Fixed \| - \|
		\| Metrics tracking \| ✅ Implemented \| - \|

		---
		______________________________________________________________________

		## References

		### Core Implementation

		- `packages/3gpp-ai/threegpp_ai/operations/convert.py` - Convert operations
		- `packages/3gpp-ai/threegpp_ai/operations/summarize.py` - Summarize operations
		- `src/tdoc_crawler/cli/ai_app.py` - CLI commands

		### Dependencies

		- [kreuzberg](https://github.com/Goldziher/kreuzberg) - Text extraction
		- [convert-lo](https://github.com/monim67/convert-lo) - LibreOffice conversion
		- [pdf-remote-converter](https://forge.3gpp.org/rep/reimes/pdf-remote-converter) - Remote PDF conversion

		### Related Work

		- LightRAG integration for knowledge graph
		- Workspace management for batch processing

Admin message