Loading docs/ai.md 0 → 100644 +215 −0 Original line number Diff line number Diff line # AI Document Processing The AI module provides intelligent document processing capabilities for TDoc data, including: - **Classification** - Identify main documents in multi-file TDoc folders - **Extraction** - Convert DOCX to Markdown for easier analysis - **Embeddings** - Generate semantic vector representations - **Summarization** - Create AI-powered summaries - **Knowledge Graph** - Build relationships between TDocs ## Table of Contents - [Installation](#installation) - [Configuration](#configuration) - [CLI Commands](#cli-commands) - [Python API](#python-api) - [Environment Variables](#environment-variables) ## Installation Install required dependencies: ```bash # Core AI dependencies uv add docling sentence-transformers litellm # Optional: for vector storage uv add lancedb ``` ## Configuration ### Environment Variables Configure AI processing via environment variables (see `.env.example`): ```bash # LLM Configuration TDC_LLM_MODEL=openai/gpt-4o-mini TDC_LLM_API_KEY=your-api-key TDC_LLM_BASE_URL= # For proxy/custom endpoints TDC_LLM_MAX_TOKENS=2000 TDC_LLM_TEMPERATURE=0.3 # Embedding Model TDC_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5 TDC_EMBEDDING_API_KEY= # Storage TDC_LANCEDB_PATH= TDC_CHUNK_SIZE=500 TDC_CHUNK_OVERLAP=50 ``` ### Model Format Both LLM and embedding models use the `<provider>/<model_name>` format: | Provider | Example Model | Description | |----------|---------------|-------------| | `openai` | `gpt-4o-mini`, `gpt-4o` | OpenAI models | | `anthropic` | `claude-3-haiku`, `claude-3-sonnet` | Anthropic models | | `azure` | `gpt-4o` | Azure OpenAI | | `google` | `gemini-1.5-flash` | Google models | | `cohere` | `command-r` | Cohere models | | `BAAI` | `bge-small-en-v1.5` | BGE embeddings | | `ollama` | `llama3`, `mistral` | Local Ollama models | **Note:** LiteLLM is used as the backend, supporting 100+ providers. See [LiteLLM documentation](https://docs.litellm.ai/) for full list. ## CLI Commands ### Process a TDoc ```bash tdoc-crawler ai process --tdoc-id SP-123456 --checkout-path /path/to/checkout ``` Options: - `--tdoc-id`: TDoc identifier (e.g., SP-123456) - `--checkout-path`: Path to TDoc checkout folder - `--force`: Force re-processing even if completed - `--json`: Output as JSON ### Get Status ```bash tdoc-crawler ai status --tdoc-id SP-123456 ``` Options: - `--tdoc-id`: TDoc identifier - `--json`: Output as JSON ### Semantic Search ```bash tdoc-crawler ai query --query "5G architecture overview" --top-k 5 ``` Options: - `--query`: Search query - `--top-k`: Number of results (default: 5) - `--json`: Output as JSON ### Knowledge Graph ```bash tdoc-crawler ai graph --query "evolution of 5G NR" ``` Options: - `--query`: Graph query - `--json`: Output as JSON ## Python API ```python from tdoc_crawler.ai import ( process_tdoc, process_all, get_status, query_embeddings, query_graph, ) # Process single TDoc status = process_tdoc("SP-123456", "/path/to/checkout") # Batch processing results = process_all( ["SP-123456", "SP-123457"], "/base/checkout/path" ) # Get status status = get_status("SP-123456") # Semantic search results = query_embeddings("5G architecture", top_k=5) # Query knowledge graph graph_data = query_graph("evolution of 5G NR") ``` ### Models ```python from tdoc_crawler.ai import ( ProcessingStatus, PipelineStage, DocumentClassification, DocumentSummary, DocumentChunk, ) ``` ## Pipeline Stages The AI processing pipeline consists of these stages: 1. **CLASSIFY** - Identify main document among multiple files 1. **EXTRACT** - Convert DOCX to Markdown 1. **EMBED** - Generate vector embeddings 1. **SUMMARIZE** - Create AI summaries 1. **GRAPH** - Build knowledge graph relationships ## Supported File Types - **DOCX** - Primary format for extraction (via Docling) - **XLSX** - Handled as secondary files - **PPTX** - Handled as secondary files - **PDF** - Supported via Docling ## Testing Run AI tests: ```bash # All AI tests uv run pytest tests/test_ai*.py -v # Specific module uv run pytest tests/test_ai_extraction.py -v ``` Test data is located in `tests/data/ai/`. ## Troubleshooting ### Docling not available ```bash uv add docling ``` ### Embedding model issues ```bash uv add sentence-transformers torch ``` ### LLM API errors Check your API key is set correctly: ```bash export OPENAI_API_KEY=your-key # or export ANTHROPIC_API_KEY=your-key ``` docs/index.md +4 −1 Original line number Diff line number Diff line Loading @@ -4,7 +4,10 @@ Welcome to the documentation for **tdoc-crawler**, a command-line tool for query ## 📖 Table of Contents - [**Crawl Documentation**](crawl.md) – How to fetch metadata from 3GPP servers and portal. SS|- [**Crawl Documentation**](crawl.md) – How to fetch metadata from 3GPP servers and portal. QM|- [**AI Document Processing**](ai.md) – AI-powered TDoc analysis and embeddings. PQ|- [**Query Documentation**](query.md) – How to search and display stored metadata. - [**Query Documentation**](query.md) – How to search and display stored metadata. - [**Utility Documentation**](utils.md) – File access, spec handling, and database inspection. - [**WhatIsWhatTheSpec**](whatthespec.md) – Understanding the primary WhatTheSpec data source and 3GPP fallback. Loading Loading
docs/ai.md 0 → 100644 +215 −0 Original line number Diff line number Diff line # AI Document Processing The AI module provides intelligent document processing capabilities for TDoc data, including: - **Classification** - Identify main documents in multi-file TDoc folders - **Extraction** - Convert DOCX to Markdown for easier analysis - **Embeddings** - Generate semantic vector representations - **Summarization** - Create AI-powered summaries - **Knowledge Graph** - Build relationships between TDocs ## Table of Contents - [Installation](#installation) - [Configuration](#configuration) - [CLI Commands](#cli-commands) - [Python API](#python-api) - [Environment Variables](#environment-variables) ## Installation Install required dependencies: ```bash # Core AI dependencies uv add docling sentence-transformers litellm # Optional: for vector storage uv add lancedb ``` ## Configuration ### Environment Variables Configure AI processing via environment variables (see `.env.example`): ```bash # LLM Configuration TDC_LLM_MODEL=openai/gpt-4o-mini TDC_LLM_API_KEY=your-api-key TDC_LLM_BASE_URL= # For proxy/custom endpoints TDC_LLM_MAX_TOKENS=2000 TDC_LLM_TEMPERATURE=0.3 # Embedding Model TDC_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5 TDC_EMBEDDING_API_KEY= # Storage TDC_LANCEDB_PATH= TDC_CHUNK_SIZE=500 TDC_CHUNK_OVERLAP=50 ``` ### Model Format Both LLM and embedding models use the `<provider>/<model_name>` format: | Provider | Example Model | Description | |----------|---------------|-------------| | `openai` | `gpt-4o-mini`, `gpt-4o` | OpenAI models | | `anthropic` | `claude-3-haiku`, `claude-3-sonnet` | Anthropic models | | `azure` | `gpt-4o` | Azure OpenAI | | `google` | `gemini-1.5-flash` | Google models | | `cohere` | `command-r` | Cohere models | | `BAAI` | `bge-small-en-v1.5` | BGE embeddings | | `ollama` | `llama3`, `mistral` | Local Ollama models | **Note:** LiteLLM is used as the backend, supporting 100+ providers. See [LiteLLM documentation](https://docs.litellm.ai/) for full list. ## CLI Commands ### Process a TDoc ```bash tdoc-crawler ai process --tdoc-id SP-123456 --checkout-path /path/to/checkout ``` Options: - `--tdoc-id`: TDoc identifier (e.g., SP-123456) - `--checkout-path`: Path to TDoc checkout folder - `--force`: Force re-processing even if completed - `--json`: Output as JSON ### Get Status ```bash tdoc-crawler ai status --tdoc-id SP-123456 ``` Options: - `--tdoc-id`: TDoc identifier - `--json`: Output as JSON ### Semantic Search ```bash tdoc-crawler ai query --query "5G architecture overview" --top-k 5 ``` Options: - `--query`: Search query - `--top-k`: Number of results (default: 5) - `--json`: Output as JSON ### Knowledge Graph ```bash tdoc-crawler ai graph --query "evolution of 5G NR" ``` Options: - `--query`: Graph query - `--json`: Output as JSON ## Python API ```python from tdoc_crawler.ai import ( process_tdoc, process_all, get_status, query_embeddings, query_graph, ) # Process single TDoc status = process_tdoc("SP-123456", "/path/to/checkout") # Batch processing results = process_all( ["SP-123456", "SP-123457"], "/base/checkout/path" ) # Get status status = get_status("SP-123456") # Semantic search results = query_embeddings("5G architecture", top_k=5) # Query knowledge graph graph_data = query_graph("evolution of 5G NR") ``` ### Models ```python from tdoc_crawler.ai import ( ProcessingStatus, PipelineStage, DocumentClassification, DocumentSummary, DocumentChunk, ) ``` ## Pipeline Stages The AI processing pipeline consists of these stages: 1. **CLASSIFY** - Identify main document among multiple files 1. **EXTRACT** - Convert DOCX to Markdown 1. **EMBED** - Generate vector embeddings 1. **SUMMARIZE** - Create AI summaries 1. **GRAPH** - Build knowledge graph relationships ## Supported File Types - **DOCX** - Primary format for extraction (via Docling) - **XLSX** - Handled as secondary files - **PPTX** - Handled as secondary files - **PDF** - Supported via Docling ## Testing Run AI tests: ```bash # All AI tests uv run pytest tests/test_ai*.py -v # Specific module uv run pytest tests/test_ai_extraction.py -v ``` Test data is located in `tests/data/ai/`. ## Troubleshooting ### Docling not available ```bash uv add docling ``` ### Embedding model issues ```bash uv add sentence-transformers torch ``` ### LLM API errors Check your API key is set correctly: ```bash export OPENAI_API_KEY=your-key # or export ANTHROPIC_API_KEY=your-key ```
docs/index.md +4 −1 Original line number Diff line number Diff line Loading @@ -4,7 +4,10 @@ Welcome to the documentation for **tdoc-crawler**, a command-line tool for query ## 📖 Table of Contents - [**Crawl Documentation**](crawl.md) – How to fetch metadata from 3GPP servers and portal. SS|- [**Crawl Documentation**](crawl.md) – How to fetch metadata from 3GPP servers and portal. QM|- [**AI Document Processing**](ai.md) – AI-powered TDoc analysis and embeddings. PQ|- [**Query Documentation**](query.md) – How to search and display stored metadata. - [**Query Documentation**](query.md) – How to search and display stored metadata. - [**Utility Documentation**](utils.md) – File access, spec handling, and database inspection. - [**WhatIsWhatTheSpec**](whatthespec.md) – Understanding the primary WhatTheSpec data source and 3GPP fallback. Loading