Commit bf189dd5 authored by Jan Reimes's avatar Jan Reimes
Browse files

📝 docs: update development and migration documentation

parent 8552ef40
Loading
Loading
Loading
Loading
+7 −4
Original line number Diff line number Diff line
@@ -9,15 +9,18 @@ This guide describes how to set up your environment for contributing to `3gpp-cr
1. Clone the repository:

   ```bash
   ```

git clone https://forge.3gpp.org/rep/reimes/3gpp-crawler.git
cd 3gpp-crawler
   ```

````

1. Sync dependencies:

```bash
uv sync --all-extras
   ```
````

1. Install pre-commit hooks:

+24 −9
Original line number Diff line number Diff line
@@ -43,20 +43,23 @@ Replace the existing 3gpp-ai pipeline with **LightRAG** (Python GraphRAG framewo
### Key Components

1. **LightRAG** (Python library)

   - `pip install lightrag-hku` / `uv add lightrag-hku`
   - Async-first API (`await rag.ainsert()`, `await rag.aquery()`)
   - LLM-powered entity-relationship extraction with gleaning
   - 6 query modes, citation tracking, workspace isolation
   - Optional REST API server via `lightrag-server`

2. **pg0** (Embedded PostgreSQL)
1. **pg0** (Embedded PostgreSQL)

   - Zero-config, single binary (~50MB)
   - PostgreSQL 18 + pgvector 0.8.1 bundled
   - Python SDK: `from pg0 import Pg0`
   - Data stored in `~/.pg0/instances/<name>/data/`
   - Note: Does NOT include Apache AGE (graph extension)

3. **Storage Split**
1. **Storage Split**

   - **pg0 handles:** KV cache, vector embeddings, doc status (3 of 4 storage types)
   - **NetworkX handles:** Entity-relationship graph (file-based, LightRAG default)
   - This avoids the Apache AGE dependency entirely
@@ -546,12 +549,14 @@ Replace the existing 3gpp-ai pipeline with **LightRAG** (Python GraphRAG framewo
The existing tdoc-crawler already stores structured 3GPP metadata (TDoc IDs, spec references, meeting codes, companies) in SQLite. Since the LightRAG stage receives converted PDF files plus fixed-key metadata dictionaries, enrichment should be schema-first and deterministic instead of ad hoc dictionary lookups.

- [ ] Define a typed metadata contract (Pydantic model) for enrichment with at least:

  - `tdoc_id` (required)
  - `title`, `source`, `meeting` (optional)
  - `spec_refs` (list[str], default empty list)
  - `release`, `wg` (optional, if available from SQLite)

- [ ] Add normalization before enrichment:

  - `tdoc_id` uppercased and stripped
  - `spec_refs` normalized into stable display form
  - missing required fields cause explicit skip/error status
@@ -618,11 +623,13 @@ LightRAG supports `edit_entity()` and `create_entity()` APIs. For known 3GPP ent
#### 5.2 Update Dependencies

- [ ] Remove from `pyproject.toml`:

  - `lancedb` (replaced by pg0/pgvector)
  - `sentence-transformers` (handled by LightRAG embedding functions)
  - `edgequake-sdk` (dropped)

- [ ] Add to `pyproject.toml`:

  - `lightrag-hku`
  - `pg0-embedded`

@@ -673,18 +680,23 @@ OpenSearch runs on Windows via zip archive (no Docker). See investigation report
## Open Questions

1. **LLM Provider Selection:** Default to Ollama (local, free) or require cloud provider?

   - Recommendation: Ollama for development. LightRAG recommends 32B+ models for quality entity extraction. Qwen3-30B or similar.

2. **Embedding Provider:** Use Ollama (local) or sentence-transformers?
1. **Embedding Provider:** Use Ollama (local) or sentence-transformers?

   - Recommendation: Ollama with qwen3-embedding:0.6b (available in 3gpp-ai workspace)

3. **Graph Storage Scaling:** NetworkX (file-based) vs. PostgreSQL AGE vs. OpenSearch?
1. **Graph Storage Scaling:** NetworkX (file-based) vs. PostgreSQL AGE vs. OpenSearch?

   - Recommendation: Start with NetworkX. For typical 3GPP meeting scope (100-500 TDocs), file-based graph is sufficient. Migrate to OpenSearch only if performance becomes an issue.

4. **Metadata Contract Strictness:** Which metadata keys are required to proceed with insertion?
1. **Metadata Contract Strictness:** Which metadata keys are required to proceed with insertion?

   - Recommendation: Require `tdoc_id`; treat all other fields as optional but normalized. On missing required fields, emit explicit skip/error result.

5. **LightRAG WebUI:** Deploy alongside CLI?
1. **LightRAG WebUI:** Deploy alongside CLI?

   - Recommendation: Optional. `lightrag-server` provides an Ollama-compatible REST API + React WebUI for graph visualization. Useful for debugging, not required for pipeline.

## Progress
@@ -714,10 +726,10 @@ OpenSearch runs on Windows via zip archive (no Docker). See investigation report
- [x] (2026-03-23) Phase 5.4 complete: Removed broken CLI commands (ai query/process/status)
- [x] (2026-03-23) Phase 6.1 complete: Unit tests (test_lightrag_config.py: 11 tests, test_metadata.py: 11 tests)
- [x] (2026-03-23) Phase 6.2 complete: Integration tests (test_integration.py: 4 async tests)
- [x] (2026-03-23) Phase 6.3 validation: Startup <10s ✓, Insert/query works ✓, Workspace isolation ✓
- [x] (2026-03-23) Phase 6.3 validation: Startup \<10s ✓, Insert/query works ✓, Workspace isolation ✓
- [x] (2026-03-23) Rename fix: Updated PLAN.md defaults from `tdoc-crawler` to `3gpp-crawler`
- [ ] Phase 7: 3gpp-ai package cleanup (dead code removal, linter issues)
- [ ] Phase 6.3 validation: Query latency <2s (manual validation needed)
- [ ] Phase 6.3 validation: Query latency \<2s (manual validation needed)
- [ ] Phase 6.3 validation: Idempotency (manual validation needed)

### Phase 7: 3gpp-ai Package Cleanup
@@ -744,6 +756,7 @@ The following modules are obsolete (replaced by LightRAG pipeline):
Current state exports 40+ symbols including legacy ones. Target: clean separation.

**Remove from exports:**

- `AiConfig` (replaced by `LightRAGConfig`)
- `AiStorage` (LanceDB, dead)
- `EmbeddingsManager` (sentence-transformers, dead)
@@ -752,6 +765,7 @@ Current state exports 40+ symbols including legacy ones. Target: clean separatio
- `GraphNode`, `GraphEdge` (custom graph schema, dead)

**Keep exports:**

```python
# Document operations (still used by both pipelines)
from tdoc_ai.operations.convert import convert_tdoc as convert_document
@@ -805,6 +819,7 @@ Due to `tdoc-crawler` → `3gpp-crawler` rename:
#### 7.5 Verify No Regressions

Before declaring Phase 7 complete, verify:

- [ ] `3gpp-ai --help` shows all commands
- [ ] `3gpp-ai rag status` works
- [ ] `3gpp-ai rag query "test"` works
+45 −30
Original line number Diff line number Diff line
# PLAN: Adapt `summarize` and `convert` Commands

**Status:** ✅ All Phases Complete  
**Status:** ✅ All Phases Complete\
**Last Updated:** 2026-03-24

**Completion:** All phases implemented.  
**Completion:** All phases implemented.\
**NOTE:** See "Known Limitations" section for current constraints and "Planned Improvements" for roadmap.

---
______________________________________________________________________

## Known Limitations

@@ -20,24 +20,28 @@
### Current Mitigations

1. **Large Document Handling**:

   - Documents >500K tokens are truncated to `MAX_TOKENS` (default 100K)
   - Truncation may lose important content at the end of documents
   - Cache stored in `.ai/` subdirectory

2. **Rate Limiting**: 
1. **Rate Limiting**:

   - 3GPP FTP may rate-limit requests
   - `fetch_tdoc_files()` implements retry logic with exponential backoff
   - Max 3 retries with 5-second delays

3. **LibreOffice Dependency**: 
1. **LibreOffice Dependency**:

   - `convert-lo` requires LibreOffice installed on the system
   - Graceful error message if not available

4. **Memory Pressure**: 
1. **Memory Pressure**:

   - Large PDFs may consume significant memory during extraction
   - `kreuzberg` handles this internally but large documents (>100 pages) may be slow

---
______________________________________________________________________

## Planned Improvements

@@ -46,6 +50,7 @@
**Goal:** Preserve context across document boundaries instead of simple truncation.

**Design:**

```python
from enum import Enum
from dataclasses import dataclass
@@ -69,13 +74,14 @@ class ChunkingConfig:

**Dependencies:** `tiktoken` for token counting

---
______________________________________________________________________

### 2. Connection Pooling for HTTP Requests

**Goal:** Improve throughput and reduce rate-limiting for batch operations.

**Design:**

```python
from dataclasses import dataclass
import aiohttp
@@ -94,7 +100,7 @@ class PoolConfig:

**Dependencies:** `aiohttp` for async pooling

---
______________________________________________________________________

### 3. PDF Remote Converter (LibreOffice Alternative)

@@ -103,6 +109,7 @@ class PoolConfig:
**Repository:** https://forge.3gpp.org/rep/reimes/pdf-remote-converter

**Design:**

```python
from enum import Enum
from dataclasses import dataclass
@@ -126,6 +133,7 @@ class ConverterConfig:
**Dependencies:** Add `pdf-remote-converter` as optional extra

**Installation:**

```bash
# Users without LibreOffice can install:
uv sync --extra pdf-remote
@@ -134,7 +142,7 @@ uv sync --extra pdf-remote
export PDF_REMOTE_API_KEY="your-api-key"
```

---
______________________________________________________________________

## Error Handling Patterns

@@ -173,7 +181,7 @@ class ClassificationError(TDocError):
| Conversion failed | `[red]Error: Failed to convert S4-260001: LibreOffice not available[/red]` | 1 |
| Classification failed | `[red]Error: Cannot determine primary file for 'S4-260001'[/red]` | 1 |

---
______________________________________________________________________

## Rollback Plan

@@ -199,7 +207,7 @@ rm ~/.3gpp-crawler/checkout/TSG_SA/WG4_CODEC/TSGS4_131-bis-e/Docs/S4-260001/.ai/
3gpp-ai workspace clear
```

---
______________________________________________________________________

## CLI Examples

@@ -252,7 +260,7 @@ echo "## Summary" && 3gpp-ai summarize S4-260001 --words 100
3gpp-ai rag query "What are the codec requirements?"
```

---
______________________________________________________________________

## Performance Metrics

@@ -331,7 +339,7 @@ except Exception as e:
    ))
```

---
______________________________________________________________________

## Test Results

@@ -352,38 +360,42 @@ except Exception as e:
| `3gpp-ai summarize S4-260001 --words 100` | 100-word summary | ⏳ Ready to run |
| `3gpp-ai workspace create test-ws` | Workspace created | ⏳ Ready to run |

---
______________________________________________________________________

## Context

This plan addresses the adaptation of `summarize` and `convert` commands for the 3GPP AI document processing pipeline. The goal is to provide robust, user-friendly commands for document conversion and summarization with proper caching, error handling, and integration with the LightRAG knowledge graph.

---
______________________________________________________________________

## Phases

1. **Phase 1: Core Implementation**

   - Implement `convert_tdoc_to_markdown()` with caching
   - Implement `summarize_document()` with LLM integration
   - Add CLI commands with proper options

2. **Phase 2: Error Handling**
1. **Phase 2: Error Handling**

   - Define exception hierarchy
   - Add CLI error messages
   - Implement rollback patterns

3. **Phase 3: Performance**
1. **Phase 3: Performance**

   - Implement semantic chunking
   - Add connection pooling
   - Integrate pdf-remote-converter
   - Fix soffice.exe console window on Windows

4. **Phase 4: Metrics & Monitoring**
1. **Phase 4: Metrics & Monitoring**

   - Implement MetricsTracker
   - Add performance logging
   - Create dashboard/reporting

---
______________________________________________________________________

## Key Files to Create/Modify

@@ -396,7 +408,7 @@ This plan addresses the adaptation of `summarize` and `convert` commands for the
| `src/tdoc_crawler/http_client/session.py` | Add connection pooling | ✅ Implemented |
| `packages/convert-lo/convert_lo/converter.py` | soffice.exe window fix | ✅ Fixed |

---
______________________________________________________________________

## Dependencies

@@ -408,7 +420,7 @@ This plan addresses the adaptation of `summarize` and `convert` commands for the
| `aiohttp` | Connection pooling | ✅ (via requests adapter) |
| `pdf-remote-converter` | Remote PDF conversion | ✅ Implemented |

---
______________________________________________________________________

## Risks and Unknowns

@@ -420,16 +432,16 @@ This plan addresses the adaptation of `summarize` and `convert` commands for the
| Memory pressure on large PDFs | Keep kreuzberg approach | ✅ Current |
| soffice.exe console popup on Windows | CREATE_NO_WINDOW flag | ✅ Fixed |

---
______________________________________________________________________

## Decisions

1. **Caching Strategy:** Store converted markdown in `.ai/` subdirectory per document
2. **Error Handling:** Custom exception hierarchy with specific error types
3. **PDF Conversion:** LibreOffice primary, remote converter fallback
4. **Chunking:** Semantic chunking for large documents (Phase 3)
1. **Error Handling:** Custom exception hierarchy with specific error types
1. **PDF Conversion:** LibreOffice primary, remote converter fallback
1. **Chunking:** Semantic chunking for large documents (Phase 3)

---
______________________________________________________________________

## Progress

@@ -443,7 +455,7 @@ This plan addresses the adaptation of `summarize` and `convert` commands for the
- [x] soffice.exe console window fix
- [x] Performance metrics tracking

---
______________________________________________________________________

## Implementation Status

@@ -459,20 +471,23 @@ This plan addresses the adaptation of `summarize` and `convert` commands for the
| soffice.exe window fix | ✅ Fixed | - |
| Metrics tracking | ✅ Implemented | - |

---
______________________________________________________________________

## References

### Core Implementation

- `packages/3gpp-ai/threegpp_ai/operations/convert.py` - Convert operations
- `packages/3gpp-ai/threegpp_ai/operations/summarize.py` - Summarize operations
- `src/tdoc_crawler/cli/ai_app.py` - CLI commands

### Dependencies

- [kreuzberg](https://github.com/Goldziher/kreuzberg) - Text extraction
- [convert-lo](https://github.com/monim67/convert-lo) - LibreOffice conversion
- [pdf-remote-converter](https://forge.3gpp.org/rep/reimes/pdf-remote-converter) - Remote PDF conversion

### Related Work

- LightRAG integration for knowledge graph
- Workspace management for batch processing