feat(ai): unify structured extraction and fix rag query compatibility (15c9fda9) · Commits · Jan Reimes / 3gpp-crawler

FUTURE-PLAN.md

0 → 100644

+120 −0

Original line number	Diff line number	Diff line
		# Future Plan: 3GPP AI Pipeline Enhancements

		Status: Backlog
		Last Updated: 2026-03-24

		This document captures future enhancements that are not currently prioritized but may be valuable in future development cycles.

		---

		## 1. LightRAG Integration Details

		Document the internal architecture of LightRAG integration, including entity extraction patterns, relationship types, and graph traversal strategies. This would help developers understand how TDoc content flows through the knowledge graph and enable customization of entity types for domain-specific concepts like "codec," "specification," and "working group."

		---

		## 2. Multi-File TDoc Handling

		Enhance `classify.py` to handle TDocs with multiple files (e.g., presentation + document + spreadsheet) by implementing priority rules and content merging strategies. Currently, the system picks a primary file, but future versions could combine content from multiple files or allow users to specify which file to process.

		---

		## 3. Cache Behavior and Invalidation

		Implement automatic cache invalidation when source documents change, and add size limits for the `.ai/` cache directory. This would include TTL-based expiration, checksum-based change detection, and a CLI command to inspect and manage cache state across workspaces.

		---

		## 4. Workspace Integration Examples

		Create comprehensive examples showing how to integrate 3GPP AI commands into CI/CD pipelines, automated reporting workflows, and research tools. These examples would demonstrate batch processing patterns, scheduled workspace updates, and integration with external analysis tools.

		---

		## 5. Dependency Version Compatibility Matrix

		Document which versions of LibreOffice, Python, and other dependencies are known to work with each release of the 3GPP AI pipeline. This matrix would help users troubleshoot compatibility issues and plan upgrades, especially for the LibreOffice conversion layer which has version-specific behaviors.

		---

		## 6. Troubleshooting Guide

		Create a dedicated troubleshooting document covering common issues like "LibreOffice not found," "rate limiting errors," "out of memory on large PDFs," and "LightRAG query returns no results." Each issue would include symptoms, root causes, diagnostic commands, and resolution steps.

		---

		## 7. Streaming Extraction for Large Documents

		Implement streaming extraction that processes documents in chunks rather than loading entirely into memory. This would enable handling of very large specifications (>500 pages) without memory pressure, using kreuzberg's streaming capabilities combined with incremental LightRAG ingestion.

		---

		## 8. Multi-Language Document Support

		Add support for processing TDocs in languages other than English, including language detection, translation integration, and language-aware summarization. This would be particularly useful for regional contributions and historical documents that may not be in English.

		---

		## 9. Incremental Graph Updates

		Implement incremental updates to the LightRAG knowledge graph when documents are modified or added, rather than rebuilding the entire graph. This would significantly reduce processing time for large workspaces and enable near-real-time updates when new TDocs are published.

		---

		## 10. Export and Integration APIs

		Add export capabilities for the knowledge graph in formats like GraphML, RDF, or JSON-LD to enable integration with external tools like Neo4j, Gephi, or custom analysis pipelines. This would also include webhook support for notifying external systems when processing completes.

		---

		## 11. Non-3gpp-ai Repo Sweep Findings (2026-03-25)

		Commands executed:

		- `uv run ruff check src tests packages/convert-lo packages/pool_executors`
		- `uv run pytest tests tests/convert_lo tests/pool_executor -v`

		Summary:

		- Lint: 14 errors (Ruff)
		- Tests: 9 failed, 14 errors, 309 passed, 12 skipped

		Lint findings (grouped):

		- `src/tdoc_crawler/cli/ai_app.py`
		- `PLC0415` top-level import violations at multiple locations
		- `PLR0915` excessive function complexity (`workspace_process`)
		- `PLW0603` use of `global _cache_manager`
		- `src/tdoc_crawler/cli/crawl.py`
		- `PLR0915` excessive function complexity (`crawl_tdocs`)

		Test findings (grouped):

		- Fixture mismatch in convert-lo tests (14 errors)
		- Missing fixture: `example_docx_path`
		- Affected files:
		- `tests/convert_lo/test_converter.py`
		- `tests/convert_lo/test_hybrid_converter.py`

		- CLI behavior regressions (8 failures)
		- `tests/test_cli.py`
		- `TestStatsCommand::test_stats_basic`
		- `TestOpenCommand::test_open_existing_tdoc`
		- `TestOpenCommand::test_open_with_whatthespec_fallback`
		- `TestOpenCommand::test_open_with_whatthespec_no_credentials_required`
		- `TestCheckoutCommand::test_checkout_with_whatthespec_fallback`
		- `TestEnvironmentVariables::test_env_var_credentials`
		- `TestEnvironmentVariables::test_env_var_prompt_credentials`
		- `TestEnvironmentVariables::test_env_var_multiple_credentials`

		- WhatTheSpec resolution regression (1 failure)
		- `tests/test_whatthespec.py`
		- `TestWhatTheSpecResolution::test_meeting_id_lazy_resolution`

		Follow-up backlog tasks:

		- Add/restore a canonical `example_docx_path` fixture or align convert-lo tests to existing fixture names.
		- Refactor `ai_app.py` and `crawl.py` for top-level imports and reduced cyclomatic complexity.
		- Investigate CLI open/checkout execution path to restore `prepare_tdoc_file`/`checkout_tdoc` call expectations in tests.
		- Investigate credentials env resolution path in CLI tests (`test_env_var_*credentials`).
		- Investigate meeting ID lazy resolution logic in WhatTheSpec path.

PLAN.md

0 → 100644

+556 −0

File added.

Preview size limit exceeded, changes collapsed.

docs/ai.md

+49 −44

Original line number	Diff line number	Diff line
		@@ -6,6 +6,7 @@ The AI module provides intelligent document processing capabilities for 3GPP doc

		- Classification - Identify main documents in multi-file TDoc folders
		- Extraction - Convert DOCX/PDF to Markdown with keyword extraction and language detection (via Kreuzberg)
		- Structured Elements - Preserve tables, figures, and equations with stable markers and metadata
		- Embeddings - Generate semantic vector representations for similarity search
		- Summarization - Create AI-powered abstracts
		- Knowledge Graph - Build relationships between TDocs
		@@ -99,14 +100,16 @@ ______________________________________________________________________

		The AI module follows a workspace-based workflow for organizing and querying your document collection:

		All examples below use the current CLI entrypoint: `tdoc-crawler ai ...`.

		### 1. Create and Activate Workspace

		```bash
		# Create a new workspace for your project
		3gpp-ai workspace create my-project
		tdoc-crawler ai workspace create my-project

		# Activate it so you don't need --workspace for other commands
		3gpp-ai workspace activate my-project
		tdoc-crawler ai workspace activate my-project
		```

		Once activated, all workspace commands use the active workspace by default. No need to pass `-w` every time.
		@@ -117,30 +120,31 @@ After adding TDocs to your workspace, process them to generate RAG/GraphRAG embe

		```bash
		# Add TDocs to the active workspace
		3gpp-ai workspace add-members S4-251971 S4-251972
		tdoc-crawler ai workspace add-members --kind tdoc S4-251971 S4-251972

		# Process all TDocs in workspace (only new ones)
		3gpp-ai workspace process -w my-project
		tdoc-crawler ai workspace process -w my-project

		# Force reprocess all TDocs
		3gpp-ai workspace process -w my-project --force
		tdoc-crawler ai workspace process -w my-project --force
		```

		Note: If you created the workspace with `--auto-build`, documents are processed automatically when added.

		### 3. Query Your Knowledge Base

		Once you have a workspace with documents, query using semantic search and knowledge graph (RAG + GraphRAG):
		Once you have a workspace with documents, query using the single RAG command that searches enriched text plus preserved table/figure/equation context:

		```bash
		# Query the active workspace
		3gpp-ai query "your query here"
		# Query a workspace
		tdoc-crawler ai rag query --workspace my-project "What are the bit rates in Table 3?"

		# Or specify a workspace explicitly
		3gpp-ai query -w my-project "your query here"
		# Same command for figure/equation questions
		tdoc-crawler ai rag query --workspace my-project "Describe the architecture figure"
		tdoc-crawler ai rag query --workspace my-project "What is the throughput equation?"
		```

		Note: Uses active workspace if `-w` is not provided. Results combine vector embeddings (RAG) and knowledge graph (GraphRAG).
		Note: `ai rag query` is the only query entrypoint. Do not use separate table/figure/equation query commands.

		### 4. Workspace Maintenance

		@@ -148,16 +152,16 @@ Keep your workspace clean and manage artifacts:

		```bash
		# Get detailed workspace information (member counts by type)
		3gpp-ai workspace info my-project
		tdoc-crawler ai workspace info my-project

		# Remove invalid/inactive members
		3gpp-ai workspace clear-invalid -w my-project
		tdoc-crawler ai workspace clear-invalid -w my-project

		# Clear all AI artifacts (embeddings, summaries) while preserving members
		3gpp-ai workspace clear -w my-project
		tdoc-crawler ai workspace clear -w my-project

		# After clearing, re-process to regenerate artifacts
		3gpp-ai workspace process -w my-project --force
		tdoc-crawler ai workspace process -w my-project --force
		```

		### 5. Single TDoc Operations
		@@ -165,9 +169,16 @@ Keep your workspace clean and manage artifacts:
		Process a single TDoc through the pipeline (classification, extraction, embeddings, graph). Use `--accelerate` to choose the sentence-transformers backend.

		```bash
		3gpp-ai process --tdoc-id SP-240001 --accelerate onnx
		tdoc-crawler ai convert SP-240001 --output ./SP-240001.md
		tdoc-crawler ai summarize SP-240001 --words 200
		```

		When structured extraction is enabled, conversion and workspace processing may generate sidecars next to markdown artifacts:

		- `*_tables.json`
		- `*_figures.json`
		- `*_equations.json`

		______________________________________________________________________

		## CLI Commands
		@@ -176,7 +187,7 @@ ______________________________________________________________________

		````bash
		# Create a new workspace
		3gpp-ai workspace create <name> [--auto-build]
		tdoc-crawler ai workspace create <name> [--auto-build]

		Options:
		- `name`: Workspace name
		@@ -184,48 +195,42 @@ Options:

		# List all workspaces
		# Shows (*) next to the active workspace
		3gpp-ai workspace list
		tdoc-crawler ai workspace list

		# Activate a workspace (sets as default for workspace commands)
		3gpp-ai workspace activate <name>
		tdoc-crawler ai workspace activate <name>

		# Deactivate the active workspace
		3gpp-ai workspace deactivate
		tdoc-crawler ai workspace deactivate

		# Get workspace details (name, status, member counts)
		3gpp-ai workspace info <name>
		tdoc-crawler ai workspace info <name>

		# Remove invalid/inactive members from workspace
		3gpp-ai workspace clear-invalid [-w <name>]
		tdoc-crawler ai workspace clear-invalid [-w <name>]

		# Clear all AI artifacts while preserving members
		3gpp-ai workspace clear [-w <name>]
		tdoc-crawler ai workspace clear [-w <name>]

		# Delete a workspace
		3gpp-ai workspace delete <name>
		tdoc-crawler ai workspace delete <name>
		### Querying

		Query the knowledge base using semantic embeddings and knowledge graph (RAG + GraphRAG).

		```bash
		# Query the active workspace
		3gpp-ai query "your query here"

		# Query a specific workspace
		3gpp-ai query -w <workspace_name> "your query here"

		# Specify number of results
		3gpp-ai query "your query here" -k 10
		# Query a specific workspace (single query command)
		tdoc-crawler ai rag query --workspace <workspace_name> "your query here"
		````

		Note: Uses active workspace if `-w` is not provided. Combines vector embeddings (RAG) and knowledge graph (GraphRAG). The query is a positional argument (no `--query` flag needed).
		Note: Keep `ai rag query` as the single query interface. The query is a positional argument (no `--query` flag).

		#### Summarize a TDoc

		Summarize a single TDoc with specified word count.

		```bash
		3gpp-ai summarize <tdoc_id> [--words N] [--format markdown\|json\|yaml] [--json-output]
		tdoc-crawler ai summarize <tdoc_id> [--words N] [--format markdown\|json\|yaml] [--json-output]
		```

		Options:
		@@ -240,7 +245,7 @@ Options:
		Convert a single TDoc to markdown format.

		```bash
		3gpp-ai convert <tdoc_id> [--output FILE.md] [--json-output]
		tdoc-crawler ai convert <tdoc_id> [--output FILE.md] [--json-output]
		```

		Options:
		@@ -255,31 +260,31 @@ Add TDocs to workspaces and process them to generate embeddings and knowledge gr

		```bash
		# Add members to the active workspace
		3gpp-ai workspace add-members S4-251971 S4-251972
		tdoc-crawler ai workspace add-members --kind tdoc S4-251971 S4-251972

		# Add members to a specific workspace
		3gpp-ai workspace add-members -w my-project S4-251971 S4-251972
		tdoc-crawler ai workspace add-members -w my-project --kind tdoc S4-251971 S4-251972

		# List members in the active workspace
		3gpp-ai workspace list-members
		tdoc-crawler ai workspace list-members

		# List members including inactive ones
		3gpp-ai workspace list-members --include-inactive
		tdoc-crawler ai workspace list-members --include-inactive

		# Process all TDocs in the active workspace
		3gpp-ai workspace process
		tdoc-crawler ai workspace process

		# Process with options
		3gpp-ai workspace process -w my-project --force
		tdoc-crawler ai workspace process -w my-project --force

		# Get workspace information with member counts
		3gpp-ai workspace info my-project
		tdoc-crawler ai workspace info my-project

		# Remove invalid members (failed checkouts, etc.)
		3gpp-ai workspace clear-invalid -w my-project
		tdoc-crawler ai workspace clear-invalid -w my-project

		# Clear AI artifacts (keep members, remove embeddings/summaries)
		3gpp-ai workspace clear -w my-project
		tdoc-crawler ai workspace clear -w my-project
		```

		______________________________________________________________________

docs/convert-lo-usage.md

+15 −0

Original line number	Diff line number	Diff line
		@@ -139,3 +139,18 @@ converter.convert(
		)
		"
		```

		## Relationship to AI Conversion Artifacts

		`convert-lo` handles format conversion only. Structured AI extraction artifacts are produced by the AI pipeline commands:

		```bash
		tdoc-crawler ai convert <tdoc_id> --output <file>.md
		tdoc-crawler ai workspace process --workspace <workspace_name>
		```

		When structured extraction is enabled, these AI commands may emit sidecars next to markdown output:

		- `*_tables.json`
		- `*_figures.json`
		- `*_equations.json`

docs/query.md

+21 −0

Original line number	Diff line number	Diff line
		@@ -2,6 +2,27 @@

		Query commands allow you to search and display metadata stored in your local database. They support various output formats like tables, JSON, and YAML.

		## AI RAG Query

		Use a single command for AI-assisted retrieval across text, tables, figures, and equations:

		```bash
		tdoc-crawler ai rag query --workspace <workspace_name> "your query here"
		```

		Examples:

		```bash
		tdoc-crawler ai rag query --workspace test-rag-elements "What are the bit rates in Table 3?"
		tdoc-crawler ai rag query --workspace test-rag-elements "Describe the architecture figure"
		tdoc-crawler ai rag query --workspace test-rag-elements "What is the throughput equation?"
		```

		Notes:

		- Keep `ai rag query` as the single query entrypoint (no separate table/figure/equation query commands).
		- Retrieval uses enriched chunk content and element-aware metadata when available.

		## Commands

		### `query-tdocs` (alias: `qt`)

Admin message