feat(docs): add initial AI Document Processing section and update index links (52c7bbb8) · Commits · Jan Reimes / 3gpp-crawler

docs/ai.md

0 → 100644

+215 −0

Original line number	Diff line number	Diff line
		# AI Document Processing

		The AI module provides intelligent document processing capabilities for TDoc data, including:

		- Classification - Identify main documents in multi-file TDoc folders
		- Extraction - Convert DOCX to Markdown for easier analysis
		- Embeddings - Generate semantic vector representations
		- Summarization - Create AI-powered summaries
		- Knowledge Graph - Build relationships between TDocs

		## Table of Contents

		- [Installation](#installation)
		- [Configuration](#configuration)
		- [CLI Commands](#cli-commands)
		- [Python API](#python-api)
		- [Environment Variables](#environment-variables)

		## Installation

		Install required dependencies:

		```bash
		# Core AI dependencies
		uv add docling sentence-transformers litellm

		# Optional: for vector storage
		uv add lancedb
		```

		## Configuration

		### Environment Variables

		Configure AI processing via environment variables (see `.env.example`):

		```bash
		# LLM Configuration
		TDC_LLM_MODEL=openai/gpt-4o-mini
		TDC_LLM_API_KEY=your-api-key
		TDC_LLM_BASE_URL= # For proxy/custom endpoints
		TDC_LLM_MAX_TOKENS=2000
		TDC_LLM_TEMPERATURE=0.3

		# Embedding Model
		TDC_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
		TDC_EMBEDDING_API_KEY=

		# Storage
		TDC_LANCEDB_PATH=
		TDC_CHUNK_SIZE=500
		TDC_CHUNK_OVERLAP=50
		```

		### Model Format

		Both LLM and embedding models use the `<provider>/<model_name>` format:

		\| Provider \| Example Model \| Description \|
		\|----------\|---------------\|-------------\|
		\| `openai` \| `gpt-4o-mini`, `gpt-4o` \| OpenAI models \|
		\| `anthropic` \| `claude-3-haiku`, `claude-3-sonnet` \| Anthropic models \|
		\| `azure` \| `gpt-4o` \| Azure OpenAI \|
		\| `google` \| `gemini-1.5-flash` \| Google models \|
		\| `cohere` \| `command-r` \| Cohere models \|
		\| `BAAI` \| `bge-small-en-v1.5` \| BGE embeddings \|
		\| `ollama` \| `llama3`, `mistral` \| Local Ollama models \|

		Note: LiteLLM is used as the backend, supporting 100+ providers. See [LiteLLM documentation](https://docs.litellm.ai/) for full list.

		## CLI Commands

		### Process a TDoc

		```bash
		tdoc-crawler ai process --tdoc-id SP-123456 --checkout-path /path/to/checkout
		```

		Options:

		- `--tdoc-id`: TDoc identifier (e.g., SP-123456)
		- `--checkout-path`: Path to TDoc checkout folder
		- `--force`: Force re-processing even if completed
		- `--json`: Output as JSON

		### Get Status

		```bash
		tdoc-crawler ai status --tdoc-id SP-123456
		```

		Options:

		- `--tdoc-id`: TDoc identifier
		- `--json`: Output as JSON

		### Semantic Search

		```bash
		tdoc-crawler ai query --query "5G architecture overview" --top-k 5
		```

		Options:

		- `--query`: Search query
		- `--top-k`: Number of results (default: 5)
		- `--json`: Output as JSON

		### Knowledge Graph

		```bash
		tdoc-crawler ai graph --query "evolution of 5G NR"
		```

		Options:

		- `--query`: Graph query
		- `--json`: Output as JSON

		## Python API

		```python
		from tdoc_crawler.ai import (
		process_tdoc,
		process_all,
		get_status,
		query_embeddings,
		query_graph,
		)

		# Process single TDoc
		status = process_tdoc("SP-123456", "/path/to/checkout")

		# Batch processing
		results = process_all(
		["SP-123456", "SP-123457"],
		"/base/checkout/path"
		)

		# Get status
		status = get_status("SP-123456")

		# Semantic search
		results = query_embeddings("5G architecture", top_k=5)

		# Query knowledge graph
		graph_data = query_graph("evolution of 5G NR")
		```

		### Models

		```python
		from tdoc_crawler.ai import (
		ProcessingStatus,
		PipelineStage,
		DocumentClassification,
		DocumentSummary,
		DocumentChunk,
		)
		```

		## Pipeline Stages

		The AI processing pipeline consists of these stages:

		1. CLASSIFY - Identify main document among multiple files
		1. EXTRACT - Convert DOCX to Markdown
		1. EMBED - Generate vector embeddings
		1. SUMMARIZE - Create AI summaries
		1. GRAPH - Build knowledge graph relationships

		## Supported File Types

		- DOCX - Primary format for extraction (via Docling)
		- XLSX - Handled as secondary files
		- PPTX - Handled as secondary files
		- PDF - Supported via Docling

		## Testing

		Run AI tests:

		```bash
		# All AI tests
		uv run pytest tests/test_ai*.py -v

		# Specific module
		uv run pytest tests/test_ai_extraction.py -v
		```

		Test data is located in `tests/data/ai/`.

		## Troubleshooting

		### Docling not available

		```bash
		uv add docling
		```

		### Embedding model issues

		```bash
		uv add sentence-transformers torch
		```

		### LLM API errors

		Check your API key is set correctly:

		```bash
		export OPENAI_API_KEY=your-key
		# or
		export ANTHROPIC_API_KEY=your-key
		```

docs/index.md

+4 −1

Original line number	Diff line number	Diff line
		@@ -4,7 +4,10 @@ Welcome to the documentation for tdoc-crawler, a command-line tool for query

		## 📖 Table of Contents

		- [Crawl Documentation](crawl.md) – How to fetch metadata from 3GPP servers and portal.
		SS\|- [Crawl Documentation](crawl.md) – How to fetch metadata from 3GPP servers and portal.
		QM\|- [AI Document Processing](ai.md) – AI-powered TDoc analysis and embeddings.
		PQ\|- [Query Documentation](query.md) – How to search and display stored metadata.

		- [Query Documentation](query.md) – How to search and display stored metadata.
		- [Utility Documentation](utils.md) – File access, spec handling, and database inspection.
		- [WhatIsWhatTheSpec](whatthespec.md) – Understanding the primary WhatTheSpec data source and 3GPP fallback.

Original line number	Diff line number	Diff line
		@@ -4,7 +4,10 @@ Welcome to the documentation for tdoc-crawler, a command-line tool for query

		## 📖 Table of Contents

		- [Crawl Documentation](crawl.md) – How to fetch metadata from 3GPP servers and portal.
		SS\|- [Crawl Documentation](crawl.md) – How to fetch metadata from 3GPP servers and portal.
		QM\|- [AI Document Processing](ai.md) – AI-powered TDoc analysis and embeddings.
		PQ\|- [Query Documentation](query.md) – How to search and display stored metadata.

		- [Query Documentation](query.md) – How to search and display stored metadata.
		- [Utility Documentation](utils.md) – File access, spec handling, and database inspection.
		- [WhatIsWhatTheSpec](whatthespec.md) – Understanding the primary WhatTheSpec data source and 3GPP fallback.

Admin message