Commit 15c9fda9 authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(ai): unify structured extraction and fix rag query compatibility

parent 8421d18f
Loading
Loading
Loading
Loading

FUTURE-PLAN.md

0 → 100644
+120 −0
Original line number Diff line number Diff line
# Future Plan: 3GPP AI Pipeline Enhancements

**Status:** Backlog  
**Last Updated:** 2026-03-24

This document captures future enhancements that are not currently prioritized but may be valuable in future development cycles.

---

## 1. LightRAG Integration Details

Document the internal architecture of LightRAG integration, including entity extraction patterns, relationship types, and graph traversal strategies. This would help developers understand how TDoc content flows through the knowledge graph and enable customization of entity types for domain-specific concepts like "codec," "specification," and "working group."

---

## 2. Multi-File TDoc Handling

Enhance `classify.py` to handle TDocs with multiple files (e.g., presentation + document + spreadsheet) by implementing priority rules and content merging strategies. Currently, the system picks a primary file, but future versions could combine content from multiple files or allow users to specify which file to process.

---

## 3. Cache Behavior and Invalidation

Implement automatic cache invalidation when source documents change, and add size limits for the `.ai/` cache directory. This would include TTL-based expiration, checksum-based change detection, and a CLI command to inspect and manage cache state across workspaces.

---

## 4. Workspace Integration Examples

Create comprehensive examples showing how to integrate 3GPP AI commands into CI/CD pipelines, automated reporting workflows, and research tools. These examples would demonstrate batch processing patterns, scheduled workspace updates, and integration with external analysis tools.

---

## 5. Dependency Version Compatibility Matrix

Document which versions of LibreOffice, Python, and other dependencies are known to work with each release of the 3GPP AI pipeline. This matrix would help users troubleshoot compatibility issues and plan upgrades, especially for the LibreOffice conversion layer which has version-specific behaviors.

---

## 6. Troubleshooting Guide

Create a dedicated troubleshooting document covering common issues like "LibreOffice not found," "rate limiting errors," "out of memory on large PDFs," and "LightRAG query returns no results." Each issue would include symptoms, root causes, diagnostic commands, and resolution steps.

---

## 7. Streaming Extraction for Large Documents

Implement streaming extraction that processes documents in chunks rather than loading entirely into memory. This would enable handling of very large specifications (>500 pages) without memory pressure, using kreuzberg's streaming capabilities combined with incremental LightRAG ingestion.

---

## 8. Multi-Language Document Support

Add support for processing TDocs in languages other than English, including language detection, translation integration, and language-aware summarization. This would be particularly useful for regional contributions and historical documents that may not be in English.

---

## 9. Incremental Graph Updates

Implement incremental updates to the LightRAG knowledge graph when documents are modified or added, rather than rebuilding the entire graph. This would significantly reduce processing time for large workspaces and enable near-real-time updates when new TDocs are published.

---

## 10. Export and Integration APIs

Add export capabilities for the knowledge graph in formats like GraphML, RDF, or JSON-LD to enable integration with external tools like Neo4j, Gephi, or custom analysis pipelines. This would also include webhook support for notifying external systems when processing completes.

---

## 11. Non-3gpp-ai Repo Sweep Findings (2026-03-25)

Commands executed:

- `uv run ruff check src tests packages/convert-lo packages/pool_executors`
- `uv run pytest tests tests/convert_lo tests/pool_executor -v`

Summary:

- Lint: 14 errors (Ruff)
- Tests: 9 failed, 14 errors, 309 passed, 12 skipped

Lint findings (grouped):

- `src/tdoc_crawler/cli/ai_app.py`
	- `PLC0415` top-level import violations at multiple locations
	- `PLR0915` excessive function complexity (`workspace_process`)
	- `PLW0603` use of `global _cache_manager`
- `src/tdoc_crawler/cli/crawl.py`
	- `PLR0915` excessive function complexity (`crawl_tdocs`)

Test findings (grouped):

- Fixture mismatch in convert-lo tests (14 errors)
	- Missing fixture: `example_docx_path`
	- Affected files:
		- `tests/convert_lo/test_converter.py`
		- `tests/convert_lo/test_hybrid_converter.py`

- CLI behavior regressions (8 failures)
	- `tests/test_cli.py`
		- `TestStatsCommand::test_stats_basic`
		- `TestOpenCommand::test_open_existing_tdoc`
		- `TestOpenCommand::test_open_with_whatthespec_fallback`
		- `TestOpenCommand::test_open_with_whatthespec_no_credentials_required`
		- `TestCheckoutCommand::test_checkout_with_whatthespec_fallback`
		- `TestEnvironmentVariables::test_env_var_credentials`
		- `TestEnvironmentVariables::test_env_var_prompt_credentials`
		- `TestEnvironmentVariables::test_env_var_multiple_credentials`

- WhatTheSpec resolution regression (1 failure)
	- `tests/test_whatthespec.py`
		- `TestWhatTheSpecResolution::test_meeting_id_lazy_resolution`

Follow-up backlog tasks:

- Add/restore a canonical `example_docx_path` fixture or align convert-lo tests to existing fixture names.
- Refactor `ai_app.py` and `crawl.py` for top-level imports and reduced cyclomatic complexity.
- Investigate CLI open/checkout execution path to restore `prepare_tdoc_file`/`checkout_tdoc` call expectations in tests.
- Investigate credentials env resolution path in CLI tests (`test_env_var_*credentials`).
- Investigate meeting ID lazy resolution logic in WhatTheSpec path.

PLAN.md

0 → 100644
+556 −0

File added.

Preview size limit exceeded, changes collapsed.

+49 −44
Original line number Diff line number Diff line
@@ -6,6 +6,7 @@ The AI module provides intelligent document processing capabilities for 3GPP doc

- **Classification** - Identify main documents in multi-file TDoc folders
- **Extraction** - Convert DOCX/PDF to Markdown with keyword extraction and language detection (via Kreuzberg)
- **Structured Elements** - Preserve tables, figures, and equations with stable markers and metadata
- **Embeddings** - Generate semantic vector representations for similarity search
- **Summarization** - Create AI-powered abstracts
- **Knowledge Graph** - Build relationships between TDocs
@@ -99,14 +100,16 @@ ______________________________________________________________________

The AI module follows a workspace-based workflow for organizing and querying your document collection:

All examples below use the current CLI entrypoint: `tdoc-crawler ai ...`.

### 1. Create and Activate Workspace

```bash
# Create a new workspace for your project
3gpp-ai workspace create my-project
tdoc-crawler ai workspace create my-project

# Activate it so you don't need --workspace for other commands
3gpp-ai workspace activate my-project
tdoc-crawler ai workspace activate my-project
```

Once activated, all workspace commands use the active workspace by default. No need to pass `-w` every time.
@@ -117,30 +120,31 @@ After adding TDocs to your workspace, process them to generate RAG/GraphRAG embe

```bash
# Add TDocs to the active workspace
3gpp-ai workspace add-members S4-251971 S4-251972
tdoc-crawler ai workspace add-members --kind tdoc S4-251971 S4-251972

# Process all TDocs in workspace (only new ones)
3gpp-ai workspace process -w my-project
tdoc-crawler ai workspace process -w my-project

# Force reprocess all TDocs
3gpp-ai workspace process -w my-project --force
tdoc-crawler ai workspace process -w my-project --force
```

Note: If you created the workspace with `--auto-build`, documents are processed automatically when added.

### 3. Query Your Knowledge Base

Once you have a workspace with documents, query using semantic search and knowledge graph (RAG + GraphRAG):
Once you have a workspace with documents, query using the single RAG command that searches enriched text plus preserved table/figure/equation context:

```bash
# Query the active workspace
3gpp-ai query "your query here"
# Query a workspace
tdoc-crawler ai rag query --workspace my-project "What are the bit rates in Table 3?"

# Or specify a workspace explicitly
3gpp-ai query -w my-project "your query here"
# Same command for figure/equation questions
tdoc-crawler ai rag query --workspace my-project "Describe the architecture figure"
tdoc-crawler ai rag query --workspace my-project "What is the throughput equation?"
```

Note: Uses active workspace if `-w` is not provided. Results combine vector embeddings (RAG) and knowledge graph (GraphRAG).
Note: `ai rag query` is the only query entrypoint. Do not use separate table/figure/equation query commands.

### 4. Workspace Maintenance

@@ -148,16 +152,16 @@ Keep your workspace clean and manage artifacts:

```bash
# Get detailed workspace information (member counts by type)
3gpp-ai workspace info my-project
tdoc-crawler ai workspace info my-project

# Remove invalid/inactive members
3gpp-ai workspace clear-invalid -w my-project
tdoc-crawler ai workspace clear-invalid -w my-project

# Clear all AI artifacts (embeddings, summaries) while preserving members
3gpp-ai workspace clear -w my-project
tdoc-crawler ai workspace clear -w my-project

# After clearing, re-process to regenerate artifacts
3gpp-ai workspace process -w my-project --force
tdoc-crawler ai workspace process -w my-project --force
```

### 5. Single TDoc Operations
@@ -165,9 +169,16 @@ Keep your workspace clean and manage artifacts:
Process a single TDoc through the pipeline (classification, extraction, embeddings, graph). Use `--accelerate` to choose the sentence-transformers backend.

```bash
3gpp-ai process --tdoc-id SP-240001 --accelerate onnx
tdoc-crawler ai convert SP-240001 --output ./SP-240001.md
tdoc-crawler ai summarize SP-240001 --words 200
```

When structured extraction is enabled, conversion and workspace processing may generate sidecars next to markdown artifacts:

- `*_tables.json`
- `*_figures.json`
- `*_equations.json`

______________________________________________________________________

## CLI Commands
@@ -176,7 +187,7 @@ ______________________________________________________________________

````bash
# Create a new workspace
3gpp-ai workspace create <name> [--auto-build]
tdoc-crawler ai workspace create <name> [--auto-build]

Options:
- `name`: Workspace name
@@ -184,48 +195,42 @@ Options:

# List all workspaces
# Shows (*) next to the active workspace
3gpp-ai workspace list
tdoc-crawler ai workspace list

# Activate a workspace (sets as default for workspace commands)
3gpp-ai workspace activate <name>
tdoc-crawler ai workspace activate <name>

# Deactivate the active workspace
3gpp-ai workspace deactivate
tdoc-crawler ai workspace deactivate

# Get workspace details (name, status, member counts)
3gpp-ai workspace info <name>
tdoc-crawler ai workspace info <name>

# Remove invalid/inactive members from workspace
3gpp-ai workspace clear-invalid [-w <name>]
tdoc-crawler ai workspace clear-invalid [-w <name>]

# Clear all AI artifacts while preserving members
3gpp-ai workspace clear [-w <name>]
tdoc-crawler ai workspace clear [-w <name>]

# Delete a workspace
3gpp-ai workspace delete <name>
tdoc-crawler ai workspace delete <name>
### Querying

Query the knowledge base using semantic embeddings and knowledge graph (RAG + GraphRAG).

```bash
# Query the active workspace
3gpp-ai query "your query here"

# Query a specific workspace
3gpp-ai query -w <workspace_name> "your query here"

# Specify number of results
3gpp-ai query "your query here" -k 10
# Query a specific workspace (single query command)
tdoc-crawler ai rag query --workspace <workspace_name> "your query here"
````

Note: Uses active workspace if `-w` is not provided. Combines vector embeddings (RAG) and knowledge graph (GraphRAG). The query is a **positional argument** (no `--query` flag needed).
Note: Keep `ai rag query` as the single query interface. The query is a positional argument (no `--query` flag).

#### Summarize a TDoc

Summarize a single TDoc with specified word count.

```bash
3gpp-ai summarize <tdoc_id> [--words N] [--format markdown|json|yaml] [--json-output]
tdoc-crawler ai summarize <tdoc_id> [--words N] [--format markdown|json|yaml] [--json-output]
```

Options:
@@ -240,7 +245,7 @@ Options:
Convert a single TDoc to markdown format.

```bash
3gpp-ai convert <tdoc_id> [--output FILE.md] [--json-output]
tdoc-crawler ai convert <tdoc_id> [--output FILE.md] [--json-output]
```

Options:
@@ -255,31 +260,31 @@ Add TDocs to workspaces and process them to generate embeddings and knowledge gr

```bash
# Add members to the active workspace
3gpp-ai workspace add-members S4-251971 S4-251972
tdoc-crawler ai workspace add-members --kind tdoc S4-251971 S4-251972

# Add members to a specific workspace
3gpp-ai workspace add-members -w my-project S4-251971 S4-251972
tdoc-crawler ai workspace add-members -w my-project --kind tdoc S4-251971 S4-251972

# List members in the active workspace
3gpp-ai workspace list-members
tdoc-crawler ai workspace list-members

# List members including inactive ones
3gpp-ai workspace list-members --include-inactive
tdoc-crawler ai workspace list-members --include-inactive

# Process all TDocs in the active workspace
3gpp-ai workspace process
tdoc-crawler ai workspace process

# Process with options
3gpp-ai workspace process -w my-project --force
tdoc-crawler ai workspace process -w my-project --force

# Get workspace information with member counts
3gpp-ai workspace info my-project
tdoc-crawler ai workspace info my-project

# Remove invalid members (failed checkouts, etc.)
3gpp-ai workspace clear-invalid -w my-project
tdoc-crawler ai workspace clear-invalid -w my-project

# Clear AI artifacts (keep members, remove embeddings/summaries)
3gpp-ai workspace clear -w my-project
tdoc-crawler ai workspace clear -w my-project
```

______________________________________________________________________
+15 −0
Original line number Diff line number Diff line
@@ -139,3 +139,18 @@ converter.convert(
)
"
```

## Relationship to AI Conversion Artifacts

`convert-lo` handles format conversion only. Structured AI extraction artifacts are produced by the AI pipeline commands:

```bash
tdoc-crawler ai convert <tdoc_id> --output <file>.md
tdoc-crawler ai workspace process --workspace <workspace_name>
```

When structured extraction is enabled, these AI commands may emit sidecars next to markdown output:

- `*_tables.json`
- `*_figures.json`
- `*_equations.json`
+21 −0
Original line number Diff line number Diff line
@@ -2,6 +2,27 @@

Query commands allow you to search and display metadata stored in your local database. They support various output formats like tables, JSON, and YAML.

## AI RAG Query

Use a single command for AI-assisted retrieval across text, tables, figures, and equations:

```bash
tdoc-crawler ai rag query --workspace <workspace_name> "your query here"
```

Examples:

```bash
tdoc-crawler ai rag query --workspace test-rag-elements "What are the bit rates in Table 3?"
tdoc-crawler ai rag query --workspace test-rag-elements "Describe the architecture figure"
tdoc-crawler ai rag query --workspace test-rag-elements "What is the throughput equation?"
```

Notes:

- Keep `ai rag query` as the single query entrypoint (no separate table/figure/equation query commands).
- Retrieval uses enriched chunk content and element-aware metadata when available.

## Commands

### `query-tdocs` (alias: `qt`)
Loading