This document captures future enhancements that are not currently prioritized but may be valuable in future development cycles.
---
## 1. LightRAG Integration Details
Document the internal architecture of LightRAG integration, including entity extraction patterns, relationship types, and graph traversal strategies. This would help developers understand how TDoc content flows through the knowledge graph and enable customization of entity types for domain-specific concepts like "codec," "specification," and "working group."
---
## 2. Multi-File TDoc Handling
Enhance `classify.py` to handle TDocs with multiple files (e.g., presentation + document + spreadsheet) by implementing priority rules and content merging strategies. Currently, the system picks a primary file, but future versions could combine content from multiple files or allow users to specify which file to process.
---
## 3. Cache Behavior and Invalidation
Implement automatic cache invalidation when source documents change, and add size limits for the `.ai/` cache directory. This would include TTL-based expiration, checksum-based change detection, and a CLI command to inspect and manage cache state across workspaces.
---
## 4. Workspace Integration Examples
Create comprehensive examples showing how to integrate 3GPP AI commands into CI/CD pipelines, automated reporting workflows, and research tools. These examples would demonstrate batch processing patterns, scheduled workspace updates, and integration with external analysis tools.
---
## 5. Dependency Version Compatibility Matrix
Document which versions of LibreOffice, Python, and other dependencies are known to work with each release of the 3GPP AI pipeline. This matrix would help users troubleshoot compatibility issues and plan upgrades, especially for the LibreOffice conversion layer which has version-specific behaviors.
---
## 6. Troubleshooting Guide
Create a dedicated troubleshooting document covering common issues like "LibreOffice not found," "rate limiting errors," "out of memory on large PDFs," and "LightRAG query returns no results." Each issue would include symptoms, root causes, diagnostic commands, and resolution steps.
---
## 7. Streaming Extraction for Large Documents
Implement streaming extraction that processes documents in chunks rather than loading entirely into memory. This would enable handling of very large specifications (>500 pages) without memory pressure, using kreuzberg's streaming capabilities combined with incremental LightRAG ingestion.
---
## 8. Multi-Language Document Support
Add support for processing TDocs in languages other than English, including language detection, translation integration, and language-aware summarization. This would be particularly useful for regional contributions and historical documents that may not be in English.
---
## 9. Incremental Graph Updates
Implement incremental updates to the LightRAG knowledge graph when documents are modified or added, rather than rebuilding the entire graph. This would significantly reduce processing time for large workspaces and enable near-real-time updates when new TDocs are published.
---
## 10. Export and Integration APIs
Add export capabilities for the knowledge graph in formats like GraphML, RDF, or JSON-LD to enable integration with external tools like Neo4j, Gephi, or custom analysis pipelines. This would also include webhook support for notifying external systems when processing completes.
The AI module follows a workspace-based workflow for organizing and querying your document collection:
All examples below use the current CLI entrypoint: `tdoc-crawler ai ...`.
### 1. Create and Activate Workspace
```bash
# Create a new workspace for your project
3gpp-ai workspace create my-project
tdoc-crawler ai workspace create my-project
# Activate it so you don't need --workspace for other commands
3gpp-ai workspace activate my-project
tdoc-crawler ai workspace activate my-project
```
Once activated, all workspace commands use the active workspace by default. No need to pass `-w` every time.
@@ -117,30 +120,31 @@ After adding TDocs to your workspace, process them to generate RAG/GraphRAG embe
```bash
# Add TDocs to the active workspace
3gpp-ai workspace add-members S4-251971 S4-251972
tdoc-crawler ai workspace add-members --kind tdoc S4-251971 S4-251972
# Process all TDocs in workspace (only new ones)
3gpp-ai workspace process -w my-project
tdoc-crawler ai workspace process -w my-project
# Force reprocess all TDocs
3gpp-ai workspace process -w my-project --force
tdoc-crawler ai workspace process -w my-project --force
```
Note: If you created the workspace with `--auto-build`, documents are processed automatically when added.
### 3. Query Your Knowledge Base
Once you have a workspace with documents, query using semantic search and knowledge graph (RAG + GraphRAG):
Once you have a workspace with documents, query using the single RAG command that searches enriched text plus preserved table/figure/equation context:
```bash
# Query the active workspace
3gpp-ai query "your query here"
# Query a workspace
tdoc-crawler ai rag query --workspace my-project "What are the bit rates in Table 3?"
# Or specify a workspace explicitly
3gpp-ai query -w my-project "your query here"
# Same command for figure/equation questions
tdoc-crawler ai rag query --workspace my-project "Describe the architecture figure"
tdoc-crawler ai rag query --workspace my-project "What is the throughput equation?"
```
Note: Uses active workspace if `-w` is not provided. Results combine vector embeddings (RAG) and knowledge graph (GraphRAG).
Note: `ai rag query` is the only query entrypoint. Do not use separate table/figure/equation query commands.
### 4. Workspace Maintenance
@@ -148,16 +152,16 @@ Keep your workspace clean and manage artifacts:
```bash
# Get detailed workspace information (member counts by type)
3gpp-ai workspace info my-project
tdoc-crawler ai workspace info my-project
# Remove invalid/inactive members
3gpp-ai workspace clear-invalid -w my-project
tdoc-crawler ai workspace clear-invalid -w my-project
# Clear all AI artifacts (embeddings, summaries) while preserving members
3gpp-ai workspace clear -w my-project
tdoc-crawler ai workspace clear -w my-project
# After clearing, re-process to regenerate artifacts
3gpp-ai workspace process -w my-project --force
tdoc-crawler ai workspace process -w my-project --force
```
### 5. Single TDoc Operations
@@ -165,9 +169,16 @@ Keep your workspace clean and manage artifacts:
Process a single TDoc through the pipeline (classification, extraction, embeddings, graph). Use `--accelerate` to choose the sentence-transformers backend.
```bash
3gpp-ai process --tdoc-id SP-240001 --accelerate onnx
tdoc-crawler ai convert SP-240001 --output ./SP-240001.md
tdoc-crawler ai summarize SP-240001 --words 200
```
When structured extraction is enabled, conversion and workspace processing may generate sidecars next to markdown artifacts:
# Query a specific workspace (single query command)
tdoc-crawler ai rag query --workspace <workspace_name> "your query here"
````
Note: Uses active workspace if `-w`is not provided. Combines vector embeddings (RAG) and knowledge graph (GraphRAG). The query is a **positional argument** (no `--query` flag needed).
Note: Keep `ai rag query`as the single query interface. The query is a positional argument (no `--query` flag).
#### Summarize a TDoc
Summarize a single TDoc with specified word count.
Query commands allow you to search and display metadata stored in your local database. They support various output formats like tables, JSON, and YAML.
## AI RAG Query
Use a single command for AI-assisted retrieval across text, tables, figures, and equations:
```bash
tdoc-crawler ai rag query --workspace <workspace_name> "your query here"
```
Examples:
```bash
tdoc-crawler ai rag query --workspace test-rag-elements "What are the bit rates in Table 3?"
tdoc-crawler ai rag query --workspace test-rag-elements "Describe the architecture figure"
tdoc-crawler ai rag query --workspace test-rag-elements "What is the throughput equation?"
```
Notes:
- Keep `ai rag query` as the single query entrypoint (no separate table/figure/equation query commands).
- Retrieval uses enriched chunk content and element-aware metadata when available.