Commit c92cc3f8 authored by Jan Reimes's avatar Jan Reimes
Browse files

🔥 chore(docs): remove FUTURE-PLAN.md and PLAN.md (moved into beads issues)

parent 90403586
Loading
Loading
Loading
Loading
+2 −1
Original line number Diff line number Diff line
@@ -253,3 +253,4 @@ src/teddi-mcp/uv.lock

# GSD planning docs (local only)
.planning/
/PLAN.md
 No newline at end of file

FUTURE-PLAN.md

deleted100644 → 0
+0 −126
Original line number Diff line number Diff line
# Future Plan: 3GPP AI Pipeline Enhancements

**Status:** Backlog
**Last Updated:** 2026-03-24

This document captures future enhancements that are not currently prioritized but may be valuable in future development cycles.

---

## 1. LightRAG Integration Details

Document the internal architecture of LightRAG integration, including entity extraction patterns, relationship types, and graph traversal strategies. This would help developers understand how TDoc content flows through the knowledge graph and enable customization of entity types for domain-specific concepts like "codec," "specification," and "working group."

---

## 2. Multi-File TDoc Handling

Enhance `classify.py` to handle TDocs with multiple files (e.g., presentation + document + spreadsheet) by implementing priority rules and content merging strategies. Currently, the system picks a primary file, but future versions could combine content from multiple files or allow users to specify which file to process.

---

## 3. Cache Behavior and Invalidation

Implement automatic cache invalidation when source documents change, and add size limits for the `.ai/` cache directory. This would include TTL-based expiration, checksum-based change detection, and a CLI command to inspect and manage cache state across workspaces.

---

## 4. Workspace Integration Examples

Create comprehensive examples showing how to integrate 3GPP AI commands into CI/CD pipelines, automated reporting workflows, and research tools. These examples would demonstrate batch processing patterns, scheduled workspace updates, and integration with external analysis tools.

---

## 5. Dependency Version Compatibility Matrix

Document which versions of LibreOffice, Python, and other dependencies are known to work with each release of the 3GPP AI pipeline. This matrix would help users troubleshoot compatibility issues and plan upgrades, especially for the LibreOffice conversion layer which has version-specific behaviors.

---

## 6. Troubleshooting Guide

Create a dedicated troubleshooting document covering common issues like "LibreOffice not found," "rate limiting errors," "out of memory on large PDFs," and "LightRAG query returns no results." Each issue would include symptoms, root causes, diagnostic commands, and resolution steps.

---

## 7. Streaming Extraction for Large Documents

Implement streaming extraction that processes documents in chunks rather than loading entirely into memory. This would enable handling of very large specifications (>500 pages) without memory pressure, using kreuzberg's streaming capabilities combined with incremental LightRAG ingestion.

---

## 8. Multi-Language Document Support

Add support for processing TDocs in languages other than English, including language detection, translation integration, and language-aware summarization. This would be particularly useful for regional contributions and historical documents that may not be in English.

---

## 9. Incremental Graph Updates

Implement incremental updates to the LightRAG knowledge graph when documents are modified or added, rather than rebuilding the entire graph. This would significantly reduce processing time for large workspaces and enable near-real-time updates when new TDocs are published.

---

## 10. Export and Integration APIs

Add export capabilities for the knowledge graph in formats like GraphML, RDF, or JSON-LD to enable integration with external tools like Neo4j, Gephi, or custom analysis pipelines. This would also include webhook support for notifying external systems when processing completes.

---

## 11. Improve list of workspace members

Improve output of `3gpp-ai workspace list-members` command to include more metadata about each member, such as file type (docx, pdf, etc.), size, if converted to PDF, if Markdown-extraction is available and if extracted metadata (figures, tables, equations, etc.) is present.

---

## 12. Non-3gpp-ai Repo Sweep Findings (2026-03-25)

Commands executed:

- `uv run ruff check src tests packages/convert-lo packages/pool_executors`
- `uv run pytest tests tests/convert_lo tests/pool_executor -v`

Summary:

- Lint: 14 errors (Ruff)
- Tests: 9 failed, 14 errors, 309 passed, 12 skipped

Lint findings (grouped):

- `src/tdoc_crawler/cli/ai_app.py`
	- `PLC0415` top-level import violations at multiple locations
	- `PLR0915` excessive function complexity (`workspace_process`)
	- `PLW0603` use of `global _cache_manager`
- `src/tdoc_crawler/cli/crawl.py`
	- `PLR0915` excessive function complexity (`crawl_tdocs`)

Test findings (grouped):

- Fixture mismatch in convert-lo tests (14 errors)
	- Missing fixture: `example_docx_path`
	- Affected files:
		- `tests/convert_lo/test_converter.py`
		- `tests/convert_lo/test_hybrid_converter.py`

- CLI behavior regressions (8 failures)
	- `tests/test_cli.py`
		- `TestStatsCommand::test_stats_basic`
		- `TestOpenCommand::test_open_existing_tdoc`
		- `TestOpenCommand::test_open_with_whatthespec_fallback`
		- `TestOpenCommand::test_open_with_whatthespec_no_credentials_required`
		- `TestCheckoutCommand::test_checkout_with_whatthespec_fallback`
		- `TestEnvironmentVariables::test_env_var_credentials`
		- `TestEnvironmentVariables::test_env_var_prompt_credentials`
		- `TestEnvironmentVariables::test_env_var_multiple_credentials`

- WhatTheSpec resolution regression (1 failure)
	- `tests/test_whatthespec.py`
		- `TestWhatTheSpecResolution::test_meeting_id_lazy_resolution`

Follow-up backlog tasks:

- Add/restore a canonical `example_docx_path` fixture or align convert-lo tests to existing fixture names.
- Refactor `ai_app.py` and `crawl.py` for top-level imports and reduced cyclomatic complexity.
- Investigate CLI open/checkout execution path to restore `prepare_tdoc_file`/`checkout_tdoc` call expectations in tests.
- Investigate credentials env resolution path in CLI tests (`test_env_var_*credentials`).
- Investigate meeting ID lazy resolution logic in WhatTheSpec path.

PLAN.md

deleted100644 → 0
+0 −315
Original line number Diff line number Diff line
# PLAN: Structured Extraction Artifact Storage

**Feature:** Store extracted tables, figures, equations in organized `.ai` subfolders with temp-then-commit pattern

**Created:** 2026-03-26 00:30

---

## Quality Guidelines (Phase 0)

**MUST follow for all implementation work:**

| Tool/Skill | Use For | When |
|------------|---------|------|
| `CytoScnPy` MCP server | Static analysis (unused code, complexity, security) | Before committing code |
| `debugging-code` skill + `dab` CLI | Interactive debugging (step-through, breakpoints, inspect state) | Instead of guessing/printing |
| `grepai` MCP server | Find similar code, methods, existing functionality | Before writing new code |
| `test-driven-development` skill | Writing tests first, red-green-refactor | All new functionality |
| `python-pro` skill | Modern Python 3.12+ patterns, async | Implementation |
| `python-standards` skill | Code quality, type hints, linting | All Python code |
| `python-testing-patterns` skill | pytest fixtures, mocking, isolation | Tests |
| `stop-slop` skill | Remove AI writing patterns from prose | Documentation/comments |
| `code-deduplication` skill | Prevent duplicate code semantically | Before adding new code |
| `visual-explainer` skill | Generate flowcharts/diagrams | Documentation |

**Rule:** Use debugging tools instead of print-statements or guessing. Use grepai to find existing patterns before implementing new ones.

---

## Quality Gates (Per Phase)

Every phase MUST pass these gates before moving to the next:

1. **Lint:** `ruff check packages/3gpp-ai/`
2. **Type check:** `ruff check packages/3gpp-ai/ --select=ANN`
3. **Tests:** `uv run pytest tests/ai/ -v`
4. **Static analysis:** `cytoscnpy analyze-path packages/3gpp-ai/`

If any gate fails, fix before proceeding.

---

## Goal

Refactor extraction artifact storage in both `process` and `convert_md` flows so that:

1. Tables, figures, equations are stored as individual files in `.ai/{tables,figures,equations}/` subfolders
2. Extraction uses a temp folder during processing; artifacts are moved on success
3. Re-extraction is skipped per-artifact-type if the corresponding subfolder already contains data
4. Both flows (`process` command and `convert_md`) share the same storage logic

---

## Context

### Current Storage Patterns

| Flow | File | Storage Location |
|------|------|------------------|
| `convert_md` | markdown | `.ai/{doc_id}.md` |
| `convert_md` | figures bytes | `.ai/figures/figure_N.png` |
| `convert_md` | tables | `.ai/{doc_id}_tables.json` (JSON sidecar) |
| `convert_md` | figures metadata | `.ai/{doc_id}_figures.json` (JSON sidecar) |
| `convert_md` | equations | `.ai/{doc_id}_equations.json` (JSON sidecar) |
| `process` | markdown | `.ai/{stem}.md` |
| `process` | figures bytes | `.ai/figures/figure_N.png` |
| `process` | tables | ❌ NOT stored |
| `process` | equations | ❌ NOT stored |

### Key Files

| File | Role |
|------|------|
| `packages/3gpp-ai/threegpp_ai/operations/extraction_result.py` | Utilities: `persist_figures_from_kreuzberg_result()`, `from_kreuzberg_result()`, `build_structured_extraction_result()` |
| `packages/3gpp-ai/threegpp_ai/operations/convert.py` | `extract_tdoc_structured()`, `_write_structured_sidecars()`, `_read_cached_structured()`, `_build_structured_from_result()` |
| `packages/3gpp-ai/threegpp_ai/lightrag/processor.py` | `TDocProcessor.extract_text()` - main extraction for `process` command |

### Dependencies

- `kreuzberg` for extraction
- `convert-lo` for PDF conversion
- Existing `ExtractedTableElement`, `ExtractedFigureElement`, `ExtractedEquationElement` models

---

## Phases

### Phase 1: Define Artifact Storage Utilities

**File:** `packages/3gpp-ai/threegpp_ai/operations/extraction_result.py`

**Deliverables:**
- Add `ArtifactStorage` class or module-level functions for:
  - `persist_tables_from_extraction(tables: list[ExtractedTableElement], ai_dir: Path, doc_stem: str) -> list[Path]`
  - `persist_equations_from_extraction(equations: list[ExtractedEquationElement], ai_dir: Path, doc_stem: str) -> list[Path]`
  - `read_cached_artifacts(ai_dir: Path, doc_stem: str) -> StructuredExtractionResult`
  - `has_cached_artifacts(ai_dir: Path, doc_stem: str, artifact_types: set[str]) -> bool`
- Use temp folder pattern: write to `.{artifact_type}.tmp`, then atomic move on success
- Naming: `{doc_stem}_{type}_{page}_{index}.json` (e.g., `S4-250638_table_5_1.json`)
- Figure bytes: `{doc_stem}_figure_{page}_{index}.png`
- Unit tests in `tests/ai/test_extraction_result.py` (or extend existing tests)

**Validation:**
```bash
cd /path/to/3gpp-crawler
python -c "from threegpp_ai.operations.extraction_result import ArtifactStorage; print('ArtifactStorage imported')"
uv run pytest tests/ai/test_extraction_result.py -v
```

---

### Phase 2: Refactor `convert.py` to Use Folder Storage

**File:** `packages/3gpp-ai/threegpp_ai/operations/convert.py`

**Deliverables:**
- Modify `extract_tdoc_structured()` to:
  - Check for existing artifact folders before extraction
  - Call new storage functions instead of `_write_structured_sidecars()`
  - Use temp-then-commit pattern
- Replace `_write_structured_sidecars()` with calls to new `ArtifactStorage` functions
- Update `_read_cached_structured()` to read from folder structure
- No backward migration - old sidecars ignored, re-generated if needed
- Extend `tests/ai/test_operations_metrics.py` if it exists, or add tests for conversion

**Validation:**
```bash
cd /path/to/3gpp-crawler
uv run python -c "
from threegpp_ai.operations.convert import extract_tdoc_structured
from pathlib import Path
# Test with a known TDoc that has tables
# extraction = extract_tdoc_structured('S4-250638')
# assert (ai_dir / 'tables').exists()
print('convert.py validation passed')
"
uv run pytest tests/ai/ -v -k "convert or conversion"
```

---

### Phase 3: Refactor `processor.py` to Use Structured Storage

**File:** `packages/3gpp-ai/threegpp_ai/lightrag/processor.py`

**Deliverables:**
- Modify `TDocProcessor.extract_text()` to:
  - Check for existing `.ai/{stem}_tables/` folder to skip table extraction
  - Check for existing `.ai/{stem}_equations/` folder to skip equation extraction
  - Persist tables and equations using new storage utilities
  - Store figure metadata alongside figure bytes (not just sidecar)
- Align with `convert.py` storage pattern
- Add/update tests in `tests/ai/test_lightrag_processor.py` or similar

**Validation:**
```bash
cd /path/to/3gpp-crawler
uv run pytest tests/ai/test_lightrag_processor.py -v
# If test file doesn't exist, run: uv run pytest tests/ai/ -v -k "processor or extraction"
```

---

### Phase 4: Add Skip Logic for Re-extraction

**Files:** Both `convert.py` and `processor.py`

**Deliverables:**
- Extraction functions accept `extract_types: set[str]` parameter (e.g., `{"tables", "figures", "equations"}`)
- Skip extraction for types whose subfolders already contain artifacts
- `force=True` bypasses skip logic and re-extracts everything

**Validation:**
```bash
cd /path/to/3gpp-crawler
# Run process twice on same doc - second run should skip extraction
uv run tdoc-crawler ai workspace process -w test-skip
# Check logs show "Skipping X extraction - artifacts exist"
```

---

### Phase 5: Integration Test

**Files:** N/A (end-to-end validation)

**Deliverables:**
- Create integration test that:
  1. Creates a temporary workspace
  2. Adds a TDoc with known tables/figures/equations
  3. Runs `process` command
  4. Verifies all artifact folders contain expected files
  5. Runs `process` again - verifies skip logic works
  6. Cleans up workspace
- Or manual validation with real TDoc

**Validation:**
```bash
cd /path/to/3gpp-crawler
# Manual test:
uv run tdoc-crawler ai workspace create test-integration --no-activate
uv run tdoc-crawler ai workspace add-members -w test-integration --kind tdoc S4-250638
uv run tdoc-crawler ai workspace process -w test-integration
# Verify .ai/{tables,figures,equations}/ folders exist and contain files
```

---

## Storage Structure (Target)

```
.ai/
  {doc_id}.md                           # Main markdown content
  {doc_id}.pdf                          # Converted PDF (office formats)
  
  tables/
    {doc_id}_table_{page}_{index}.json
    ...
    
  figures/
    {doc_id}_figure_{page}_{index}.png
    {doc_id}_figure_{page}_{index}.json  # Figure metadata
    ...
    
  equations/
    {doc_id}_equation_{page}_{index}.json
    ...
```

### File Formats

**Table JSON** (`{doc_id}_table_{page}_{index}.json`):
```json
{
  "element_id": "table_1",
  "page_number": 5,
  "row_count": 3,
  "column_count": 4,
  "cells": [["A1", "B1"], ["A2", "B2"]],
  "markdown": "| A1 | B1 |...",
  "caption": "Table caption if available"
}
```

**Equation JSON** (`{doc_id}_equation_{page}_{index}.json`):
```json
{
  "element_id": "equation_1",
  "page_number": 3,
  "latex": "E = mc^2",
  "raw_text": "E = mc^2"
}
```

**Figure JSON** (`{doc_id}_figure_{page}_{index}.json`):
```json
{
  "element_id": "figure_1",
  "page_number": 7,
  "image_path": "S4-250638_figure_7_1.png",
  "image_format": "png",
  "caption": "Figure caption",
  "description": "Vision-LLM description if available",
  "metadata": {}
}
```

---

## Decisions

| Decision | Rationale | Status |
|----------|-----------|--------|
| Folder structure over JSON sidecars | Easier to inspect/debug, individual file access | ✅ Decided |
| Temp-then-commit pattern | Avoids incomplete artifact sets on failure | ✅ Decided |
| Per-type skip logic | Allows selective re-extraction | ✅ Decided |
| Backward compatibility with old sidecars | No migration - re-generation is fine | ✅ Decided |
| Filename must link to document location | Page number in filename or as first-class field | ✅ Decided |

---

## Artifact Naming Convention

Filenames MUST enable clear assignment back to the source document location (especially page number).

**Pattern:** `{doc_id}_{type}_{page}_{index}.json`

| Artifact | Filename Pattern | Example |
|----------|-----------------|---------|
| Table | `{doc_id}_table_{page}_{index}.json` | `S4-250638_table_5_1.json` |
| Equation | `{doc_id}_equation_{page}_{index}.json` | `S4-250638_equation_3_1.json` |
| Figure | `{doc_id}_figure_{page}_{index}.json` | `S4-250638_figure_7_1.json` |

**Rationale:** Page number in filename allows correlating artifact to source without parsing JSON.

---

## Notes

- Temp-then-commit: use `tempfile.mkdtemp()` for temp extraction, then `shutil.movetree()` to target folder
- Ensure `pathlib.Path` operations are used (no `os.path`)
- If extraction fails mid-way, temp folder is cleaned up automatically
- On success, move temp folder contents to final location atomically where possible

---

## Progress

- [x] (2026-03-26 00:30) Created PLAN.md
- [x] Phase 0: Quality gates defined
- [x] Phase 1: Implement ArtifactStorage utilities (extraction_result.py)
- [x] Phase 2: Refactor convert.py to folder storage
- [x] Phase 3: Refactor processor.py to folder storage
- [x] Phase 4: Add skip logic for re-extraction (extract_types parameter)
- [x] Phase 5: Integration test / end-to-end validation (`tests/ai/test_ai_extraction_artifacts.py`, 10/10 passing)