🔥 chore(docs): remove FUTURE-PLAN.md and PLAN.md (moved into beads issues) (c92cc3f8) · Commits · Jan Reimes / 3gpp-crawler

.gitignore

+2 −1

Original line number	Diff line number	Diff line
		@@ -253,3 +253,4 @@ src/teddi-mcp/uv.lock

		# GSD planning docs (local only)
		.planning/
		/PLAN.md
		No newline at end of file

FUTURE-PLAN.md

deleted100644 → 0

+0 −126

Original line number	Diff line number	Diff line
		# Future Plan: 3GPP AI Pipeline Enhancements

		Status: Backlog
		Last Updated: 2026-03-24

		This document captures future enhancements that are not currently prioritized but may be valuable in future development cycles.

		---

		## 1. LightRAG Integration Details

		Document the internal architecture of LightRAG integration, including entity extraction patterns, relationship types, and graph traversal strategies. This would help developers understand how TDoc content flows through the knowledge graph and enable customization of entity types for domain-specific concepts like "codec," "specification," and "working group."

		---

		## 2. Multi-File TDoc Handling

		Enhance `classify.py` to handle TDocs with multiple files (e.g., presentation + document + spreadsheet) by implementing priority rules and content merging strategies. Currently, the system picks a primary file, but future versions could combine content from multiple files or allow users to specify which file to process.

		---

		## 3. Cache Behavior and Invalidation

		Implement automatic cache invalidation when source documents change, and add size limits for the `.ai/` cache directory. This would include TTL-based expiration, checksum-based change detection, and a CLI command to inspect and manage cache state across workspaces.

		---

		## 4. Workspace Integration Examples

		Create comprehensive examples showing how to integrate 3GPP AI commands into CI/CD pipelines, automated reporting workflows, and research tools. These examples would demonstrate batch processing patterns, scheduled workspace updates, and integration with external analysis tools.

		---

		## 5. Dependency Version Compatibility Matrix

		Document which versions of LibreOffice, Python, and other dependencies are known to work with each release of the 3GPP AI pipeline. This matrix would help users troubleshoot compatibility issues and plan upgrades, especially for the LibreOffice conversion layer which has version-specific behaviors.

		---

		## 6. Troubleshooting Guide

		Create a dedicated troubleshooting document covering common issues like "LibreOffice not found," "rate limiting errors," "out of memory on large PDFs," and "LightRAG query returns no results." Each issue would include symptoms, root causes, diagnostic commands, and resolution steps.

		---

		## 7. Streaming Extraction for Large Documents

		Implement streaming extraction that processes documents in chunks rather than loading entirely into memory. This would enable handling of very large specifications (>500 pages) without memory pressure, using kreuzberg's streaming capabilities combined with incremental LightRAG ingestion.

		---

		## 8. Multi-Language Document Support

		Add support for processing TDocs in languages other than English, including language detection, translation integration, and language-aware summarization. This would be particularly useful for regional contributions and historical documents that may not be in English.

		---

		## 9. Incremental Graph Updates

		Implement incremental updates to the LightRAG knowledge graph when documents are modified or added, rather than rebuilding the entire graph. This would significantly reduce processing time for large workspaces and enable near-real-time updates when new TDocs are published.

		---

		## 10. Export and Integration APIs

		Add export capabilities for the knowledge graph in formats like GraphML, RDF, or JSON-LD to enable integration with external tools like Neo4j, Gephi, or custom analysis pipelines. This would also include webhook support for notifying external systems when processing completes.

		---

		## 11. Improve list of workspace members

		Improve output of `3gpp-ai workspace list-members` command to include more metadata about each member, such as file type (docx, pdf, etc.), size, if converted to PDF, if Markdown-extraction is available and if extracted metadata (figures, tables, equations, etc.) is present.

		---

		## 12. Non-3gpp-ai Repo Sweep Findings (2026-03-25)

		Commands executed:

		- `uv run ruff check src tests packages/convert-lo packages/pool_executors`
		- `uv run pytest tests tests/convert_lo tests/pool_executor -v`

		Summary:

		- Lint: 14 errors (Ruff)
		- Tests: 9 failed, 14 errors, 309 passed, 12 skipped

		Lint findings (grouped):

		- `src/tdoc_crawler/cli/ai_app.py`
		- `PLC0415` top-level import violations at multiple locations
		- `PLR0915` excessive function complexity (`workspace_process`)
		- `PLW0603` use of `global _cache_manager`
		- `src/tdoc_crawler/cli/crawl.py`
		- `PLR0915` excessive function complexity (`crawl_tdocs`)

		Test findings (grouped):

		- Fixture mismatch in convert-lo tests (14 errors)
		- Missing fixture: `example_docx_path`
		- Affected files:
		- `tests/convert_lo/test_converter.py`
		- `tests/convert_lo/test_hybrid_converter.py`

		- CLI behavior regressions (8 failures)
		- `tests/test_cli.py`
		- `TestStatsCommand::test_stats_basic`
		- `TestOpenCommand::test_open_existing_tdoc`
		- `TestOpenCommand::test_open_with_whatthespec_fallback`
		- `TestOpenCommand::test_open_with_whatthespec_no_credentials_required`
		- `TestCheckoutCommand::test_checkout_with_whatthespec_fallback`
		- `TestEnvironmentVariables::test_env_var_credentials`
		- `TestEnvironmentVariables::test_env_var_prompt_credentials`
		- `TestEnvironmentVariables::test_env_var_multiple_credentials`

		- WhatTheSpec resolution regression (1 failure)
		- `tests/test_whatthespec.py`
		- `TestWhatTheSpecResolution::test_meeting_id_lazy_resolution`

		Follow-up backlog tasks:

		- Add/restore a canonical `example_docx_path` fixture or align convert-lo tests to existing fixture names.
		- Refactor `ai_app.py` and `crawl.py` for top-level imports and reduced cyclomatic complexity.
		- Investigate CLI open/checkout execution path to restore `prepare_tdoc_file`/`checkout_tdoc` call expectations in tests.
		- Investigate credentials env resolution path in CLI tests (`test_env_var_*credentials`).
		- Investigate meeting ID lazy resolution logic in WhatTheSpec path.

PLAN.md

deleted100644 → 0

+0 −315

Original line number	Diff line number	Diff line
		# PLAN: Structured Extraction Artifact Storage

		Feature: Store extracted tables, figures, equations in organized `.ai` subfolders with temp-then-commit pattern

		Created: 2026-03-26 00:30

		---

		## Quality Guidelines (Phase 0)

		MUST follow for all implementation work:

		\| Tool/Skill \| Use For \| When \|
		\|------------\|---------\|------\|
		\| `CytoScnPy` MCP server \| Static analysis (unused code, complexity, security) \| Before committing code \|
		\| `debugging-code` skill + `dab` CLI \| Interactive debugging (step-through, breakpoints, inspect state) \| Instead of guessing/printing \|
		\| `grepai` MCP server \| Find similar code, methods, existing functionality \| Before writing new code \|
		\| `test-driven-development` skill \| Writing tests first, red-green-refactor \| All new functionality \|
		\| `python-pro` skill \| Modern Python 3.12+ patterns, async \| Implementation \|
		\| `python-standards` skill \| Code quality, type hints, linting \| All Python code \|
		\| `python-testing-patterns` skill \| pytest fixtures, mocking, isolation \| Tests \|
		\| `stop-slop` skill \| Remove AI writing patterns from prose \| Documentation/comments \|
		\| `code-deduplication` skill \| Prevent duplicate code semantically \| Before adding new code \|
		\| `visual-explainer` skill \| Generate flowcharts/diagrams \| Documentation \|

		Rule: Use debugging tools instead of print-statements or guessing. Use grepai to find existing patterns before implementing new ones.

		---

		## Quality Gates (Per Phase)

		Every phase MUST pass these gates before moving to the next:

		1. Lint: `ruff check packages/3gpp-ai/`
		2. Type check: `ruff check packages/3gpp-ai/ --select=ANN`
		3. Tests: `uv run pytest tests/ai/ -v`
		4. Static analysis: `cytoscnpy analyze-path packages/3gpp-ai/`

		If any gate fails, fix before proceeding.

		---

		## Goal

		Refactor extraction artifact storage in both `process` and `convert_md` flows so that:

		1. Tables, figures, equations are stored as individual files in `.ai/{tables,figures,equations}/` subfolders
		2. Extraction uses a temp folder during processing; artifacts are moved on success
		3. Re-extraction is skipped per-artifact-type if the corresponding subfolder already contains data
		4. Both flows (`process` command and `convert_md`) share the same storage logic

		---

		## Context

		### Current Storage Patterns

		\| Flow \| File \| Storage Location \|
		\|------\|------\|------------------\|
		\| `convert_md` \| markdown \| `.ai/{doc_id}.md` \|
		\| `convert_md` \| figures bytes \| `.ai/figures/figure_N.png` \|
		\| `convert_md` \| tables \| `.ai/{doc_id}_tables.json` (JSON sidecar) \|
		\| `convert_md` \| figures metadata \| `.ai/{doc_id}_figures.json` (JSON sidecar) \|
		\| `convert_md` \| equations \| `.ai/{doc_id}_equations.json` (JSON sidecar) \|
		\| `process` \| markdown \| `.ai/{stem}.md` \|
		\| `process` \| figures bytes \| `.ai/figures/figure_N.png` \|
		\| `process` \| tables \| ❌ NOT stored \|
		\| `process` \| equations \| ❌ NOT stored \|

		### Key Files

		\| File \| Role \|
		\|------\|------\|
		\| `packages/3gpp-ai/threegpp_ai/operations/extraction_result.py` \| Utilities: `persist_figures_from_kreuzberg_result()`, `from_kreuzberg_result()`, `build_structured_extraction_result()` \|
		\| `packages/3gpp-ai/threegpp_ai/operations/convert.py` \| `extract_tdoc_structured()`, `_write_structured_sidecars()`, `_read_cached_structured()`, `_build_structured_from_result()` \|
		\| `packages/3gpp-ai/threegpp_ai/lightrag/processor.py` \| `TDocProcessor.extract_text()` - main extraction for `process` command \|

		### Dependencies

		- `kreuzberg` for extraction
		- `convert-lo` for PDF conversion
		- Existing `ExtractedTableElement`, `ExtractedFigureElement`, `ExtractedEquationElement` models

		---

		## Phases

		### Phase 1: Define Artifact Storage Utilities

		File: `packages/3gpp-ai/threegpp_ai/operations/extraction_result.py`

		Deliverables:
		- Add `ArtifactStorage` class or module-level functions for:
		- `persist_tables_from_extraction(tables: list[ExtractedTableElement], ai_dir: Path, doc_stem: str) -> list[Path]`
		- `persist_equations_from_extraction(equations: list[ExtractedEquationElement], ai_dir: Path, doc_stem: str) -> list[Path]`
		- `read_cached_artifacts(ai_dir: Path, doc_stem: str) -> StructuredExtractionResult`
		- `has_cached_artifacts(ai_dir: Path, doc_stem: str, artifact_types: set[str]) -> bool`
		- Use temp folder pattern: write to `.{artifact_type}.tmp`, then atomic move on success
		- Naming: `{doc_stem}_{type}_{page}_{index}.json` (e.g., `S4-250638_table_5_1.json`)
		- Figure bytes: `{doc_stem}_figure_{page}_{index}.png`
		- Unit tests in `tests/ai/test_extraction_result.py` (or extend existing tests)

		Validation:
		```bash
		cd /path/to/3gpp-crawler
		python -c "from threegpp_ai.operations.extraction_result import ArtifactStorage; print('ArtifactStorage imported')"
		uv run pytest tests/ai/test_extraction_result.py -v
		```

		---

		### Phase 2: Refactor `convert.py` to Use Folder Storage

		File: `packages/3gpp-ai/threegpp_ai/operations/convert.py`

		Deliverables:
		- Modify `extract_tdoc_structured()` to:
		- Check for existing artifact folders before extraction
		- Call new storage functions instead of `_write_structured_sidecars()`
		- Use temp-then-commit pattern
		- Replace `_write_structured_sidecars()` with calls to new `ArtifactStorage` functions
		- Update `_read_cached_structured()` to read from folder structure
		- No backward migration - old sidecars ignored, re-generated if needed
		- Extend `tests/ai/test_operations_metrics.py` if it exists, or add tests for conversion

		Validation:
		```bash
		cd /path/to/3gpp-crawler
		uv run python -c "
		from threegpp_ai.operations.convert import extract_tdoc_structured
		from pathlib import Path
		# Test with a known TDoc that has tables
		# extraction = extract_tdoc_structured('S4-250638')
		# assert (ai_dir / 'tables').exists()
		print('convert.py validation passed')
		"
		uv run pytest tests/ai/ -v -k "convert or conversion"
		```

		---

		### Phase 3: Refactor `processor.py` to Use Structured Storage

		File: `packages/3gpp-ai/threegpp_ai/lightrag/processor.py`

		Deliverables:
		- Modify `TDocProcessor.extract_text()` to:
		- Check for existing `.ai/{stem}_tables/` folder to skip table extraction
		- Check for existing `.ai/{stem}_equations/` folder to skip equation extraction
		- Persist tables and equations using new storage utilities
		- Store figure metadata alongside figure bytes (not just sidecar)
		- Align with `convert.py` storage pattern
		- Add/update tests in `tests/ai/test_lightrag_processor.py` or similar

		Validation:
		```bash
		cd /path/to/3gpp-crawler
		uv run pytest tests/ai/test_lightrag_processor.py -v
		# If test file doesn't exist, run: uv run pytest tests/ai/ -v -k "processor or extraction"
		```

		---

		### Phase 4: Add Skip Logic for Re-extraction

		Files: Both `convert.py` and `processor.py`

		Deliverables:
		- Extraction functions accept `extract_types: set[str]` parameter (e.g., `{"tables", "figures", "equations"}`)
		- Skip extraction for types whose subfolders already contain artifacts
		- `force=True` bypasses skip logic and re-extracts everything

		Validation:
		```bash
		cd /path/to/3gpp-crawler
		# Run process twice on same doc - second run should skip extraction
		uv run tdoc-crawler ai workspace process -w test-skip
		# Check logs show "Skipping X extraction - artifacts exist"
		```

		---

		### Phase 5: Integration Test

		Files: N/A (end-to-end validation)

		Deliverables:
		- Create integration test that:
		1. Creates a temporary workspace
		2. Adds a TDoc with known tables/figures/equations
		3. Runs `process` command
		4. Verifies all artifact folders contain expected files
		5. Runs `process` again - verifies skip logic works
		6. Cleans up workspace
		- Or manual validation with real TDoc

		Validation:
		```bash
		cd /path/to/3gpp-crawler
		# Manual test:
		uv run tdoc-crawler ai workspace create test-integration --no-activate
		uv run tdoc-crawler ai workspace add-members -w test-integration --kind tdoc S4-250638
		uv run tdoc-crawler ai workspace process -w test-integration
		# Verify .ai/{tables,figures,equations}/ folders exist and contain files
		```

		---

		## Storage Structure (Target)

		```
		.ai/
		{doc_id}.md # Main markdown content
		{doc_id}.pdf # Converted PDF (office formats)

		tables/
		{doc_id}_table_{page}_{index}.json
		...

		figures/
		{doc_id}_figure_{page}_{index}.png
		{doc_id}_figure_{page}_{index}.json # Figure metadata
		...

		equations/
		{doc_id}_equation_{page}_{index}.json
		...
		```

		### File Formats

		Table JSON (`{doc_id}_table_{page}_{index}.json`):
		```json
		{
		"element_id": "table_1",
		"page_number": 5,
		"row_count": 3,
		"column_count": 4,
		"cells": [["A1", "B1"], ["A2", "B2"]],
		"markdown": "\| A1 \| B1 \|...",
		"caption": "Table caption if available"
		}
		```

		Equation JSON (`{doc_id}_equation_{page}_{index}.json`):
		```json
		{
		"element_id": "equation_1",
		"page_number": 3,
		"latex": "E = mc^2",
		"raw_text": "E = mc^2"
		}
		```

		Figure JSON (`{doc_id}_figure_{page}_{index}.json`):
		```json
		{
		"element_id": "figure_1",
		"page_number": 7,
		"image_path": "S4-250638_figure_7_1.png",
		"image_format": "png",
		"caption": "Figure caption",
		"description": "Vision-LLM description if available",
		"metadata": {}
		}
		```

		---

		## Decisions

		\| Decision \| Rationale \| Status \|
		\|----------\|-----------\|--------\|
		\| Folder structure over JSON sidecars \| Easier to inspect/debug, individual file access \| ✅ Decided \|
		\| Temp-then-commit pattern \| Avoids incomplete artifact sets on failure \| ✅ Decided \|
		\| Per-type skip logic \| Allows selective re-extraction \| ✅ Decided \|
		\| Backward compatibility with old sidecars \| No migration - re-generation is fine \| ✅ Decided \|
		\| Filename must link to document location \| Page number in filename or as first-class field \| ✅ Decided \|

		---

		## Artifact Naming Convention

		Filenames MUST enable clear assignment back to the source document location (especially page number).

		Pattern: `{doc_id}_{type}_{page}_{index}.json`

		\| Artifact \| Filename Pattern \| Example \|
		\|----------\|-----------------\|---------\|
		\| Table \| `{doc_id}_table_{page}_{index}.json` \| `S4-250638_table_5_1.json` \|
		\| Equation \| `{doc_id}_equation_{page}_{index}.json` \| `S4-250638_equation_3_1.json` \|
		\| Figure \| `{doc_id}_figure_{page}_{index}.json` \| `S4-250638_figure_7_1.json` \|

		Rationale: Page number in filename allows correlating artifact to source without parsing JSON.

		---

		## Notes

		- Temp-then-commit: use `tempfile.mkdtemp()` for temp extraction, then `shutil.movetree()` to target folder
		- Ensure `pathlib.Path` operations are used (no `os.path`)
		- If extraction fails mid-way, temp folder is cleaned up automatically
		- On success, move temp folder contents to final location atomically where possible

		---

		## Progress

		- [x] (2026-03-26 00:30) Created PLAN.md
		- [x] Phase 0: Quality gates defined
		- [x] Phase 1: Implement ArtifactStorage utilities (extraction_result.py)
		- [x] Phase 2: Refactor convert.py to folder storage
		- [x] Phase 3: Refactor processor.py to folder storage
		- [x] Phase 4: Add skip logic for re-extraction (extract_types parameter)
		- [x] Phase 5: Integration test / end-to-end validation (`tests/ai/test_ai_extraction_artifacts.py`, 10/10 passing)