feat(docs): Add AI documentation improvements and CLI redesign spec (a319ffaa) · Commits · Jan Reimes / 3gpp-crawler

.env.example

+6 −4

Original line number	Diff line number	Diff line
		@@ -78,20 +78,22 @@ HTTP_CACHE_TTL=7200
		HTTP_CACHE_REFRESH_ON_ACCESS=true

		# AI Configuration
		# Note: AI module requires API keys for cloud providers. See docs/ai.md for details.

		# Path to AI LanceDB store (default: <cache_dir>/.ai/lancedb)
		TDC_AI_STORE_PATH=

		# LLM model in format <provider>/<model_name>
		# Example: ollama/llama3.2 or openai/gpt-4o-mini
		TDC_AI_LLM_MODEL=ollama/llama3.2
		# Recommended: openrouter/openrouter/free (free tier, no subscription required)
		# API key: OPENROUTER_API_KEY
		TDC_AI_LLM_MODEL=openrouter/openrouter/free

		# Optional custom base URL for LLM provider/proxy
		TDC_AI_LLM_API_BASE=

		# Embedding model in format <provider>/<model_name>
		# Example: huggingface/BAAI/bge-small-en-v1.5
		TDC_AI_EMBEDDING_MODEL=huggingface/BAAI/bge-small-en-v1.5
		# Recommended: ollama/embeddinggemma (local, no API key required)
		TDC_AI_EMBEDDING_MODEL=ollama/embeddinggemma

		# Chunking
		TDC_AI_MAX_CHUNK_SIZE=1000

README.md

+4 −0

Original line number	Diff line number	Diff line
		@@ -13,6 +13,7 @@ A command-line tool for crawling the 3GPP FTP server, caching TDoc metadata in a
		- Case-Insensitive Queries: Search for TDocs regardless of case
		- Multiple Output Formats: Export results as table, JSON, or YAML
		- Incremental Updates: Only fetch new data on subsequent crawls
		- AI Document Processing - Semantic search, knowledge graphs, and AI-powered summarization (optional, install with `tdoc-crawler[ai]`)
		- Rich CLI: Beautiful terminal output with progress indicators

		## Installation
		@@ -30,6 +31,9 @@ uvx tdoc-crawler --help
		# Install from PyPI (publication pending)
		uv add tdoc-crawler

		# Install with AI features (optional)
		uv add tdoc-crawler[ai]

		# Or install from source
		git clone https://forge.3gpp.org/rep/reimes/tdoc-crawler.git
		cd tdoc-crawler

docs/ai.md

+349 −93

Original line number	Diff line number	Diff line
		# AI Document Processing

		The AI module provides intelligent document processing capabilities for TDoc data, including:
		The AI module provides intelligent document processing capabilities for TDoc data, including semantic search, knowledge graph construction, and AI-powered summarization.

		Key Features:

		- Classification - Identify main documents in multi-file TDoc folders
		- Extraction - Convert DOCX to Markdown with keyword extraction and language detection
		- Embeddings - Generate semantic vector representations
		- Summarization - Create AI-powered summaries
		- Extraction - Convert DOCX/PDF to Markdown with keyword extraction and language detection (via Kreuzberg)
		- Embeddings - Generate semantic vector representations for similarity search
		- Summarization - Create AI-powered abstracts
		- Knowledge Graph - Build relationships between TDocs
		- Workspaces - Organize TDocs into logical groups for focused analysis

		______________________________________________________________________

		## Table of Contents

		- [Installation](#installation)
		- [Configuration](#configuration)
		- [Workflow Guide](#workflow-guide)
		- [CLI Commands](#cli-commands)
		- [Model Providers](#model-providers)
		- [Python API](#python-api)
		- [Environment Variables](#environment-variables)
		- [Troubleshooting](#troubleshooting)

		______________________________________________________________________

		## Installation

		Install required dependencies:
		The AI module is available as an optional dependency. Install it with:

		```bash
		# Core AI dependencies
		uv add kreuzberg[all] sentence-transformers litellm
		# Install tdoc-crawler with AI support
		uv add tdoc-crawler[ai]

		# Optional: for vector storage
		uv add lancedb
		# Or install from source
		git clone https://forge.3gpp.org/rep/reimes/tdoc-crawler.git
		cd tdoc-crawler
		uv sync --extra ai
		```

		All required dependencies (Kreuzberg, LiteLLM, sentence-transformers, LanceDB) are installed automatically.

		______________________________________________________________________

		## Configuration

		### Environment Variables
		@@ -36,122 +51,216 @@ Configure AI processing via environment variables (see `.env.example`):

		```bash
		# LLM Configuration
		TDC_LLM_MODEL=openai/gpt-4o-mini
		TDC_LLM_API_KEY=your-api-key
		TDC_LLM_BASE_URL= # For proxy/custom endpoints
		TDC_LLM_MAX_TOKENS=2000
		TDC_LLM_TEMPERATURE=0.3
		TDC_AI_LLM_MODEL=openrouter/openrouter/free # Default: free tier via OpenRouter
		TDC_AI_LLM_API_KEY=your-api-key # Required for cloud providers
		TDC_AI_LLM_API_BASE= # Optional: custom endpoint

		# Embedding Model
		TDC_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
		TDC_EMBEDDING_API_KEY=
		TDC_AI_EMBEDDING_MODEL=ollama/embeddinggemma # Default: local Ollama
		TDC_AI_EMBEDDING_API_KEY= # Not needed for local models

		# Storage
		TDC_LANCEDB_PATH=
		TDC_CHUNK_SIZE=500
		TDC_CHUNK_OVERLAP=50
		TDC_AI_STORE_PATH= # Defaults to <cache_dir>/.ai/lancedb
		TDC_AI_MAX_CHUNK_SIZE=1000 # Chunk size for embeddings
		TDC_AI_CHUNK_OVERLAP=100 # Overlap between chunks

		# Summary Constraints
		TDC_AI_ABSTRACT_MIN_WORDS=150
		TDC_AI_ABSTRACT_MAX_WORDS=250

		# Processing
		TDC_AI_PARALLELISM=4 # Parallel workers
		```

		### Model Format
		### Model Identifier Format

		Both LLM and embedding models use the `<provider>/<model_name>` format:

		\| Provider \| Example Model \| Description \|
		\|----------\|---------------\|-------------\|
		\| `openai` \| `gpt-4o-mini`, `gpt-4o` \| OpenAI models \|
		\| `anthropic` \| `claude-3-haiku`, `claude-3-sonnet` \| Anthropic models \|
		\| `azure` \| `gpt-4o` \| Azure OpenAI \|
		\| `google` \| `gemini-1.5-flash` \| Google models \|
		\| `cohere` \| `command-r` \| Cohere models \|
		\| `BAAI` \| `bge-small-en-v1.5` \| BGE embeddings \|
		\| `ollama` \| `llama3`, `mistral` \| Local Ollama models \|
		```bash
		# Simple format: provider/model
		TDC_AI_LLM_MODEL=openai/gpt-4o-mini

		# Nested format: provider/model_group/model (also supported)
		TDC_AI_LLM_MODEL=openrouter/anthropic/claude-3-sonnet
		TDC_AI_EMBEDDING_MODEL=huggingface/BAAI/bge-small-en-v1.5
		```

		The provider (first segment) is validated against the supported allowlist. The model name (everything after the first `/`) can contain additional slashes for nested model paths.

		Note: LiteLLM is used as the backend, supporting 100+ providers. See [LiteLLM documentation](https://docs.litellm.ai/) for full list.
		______________________________________________________________________

		## CLI Commands
		## Workflow Guide

		## AI Commands
		The AI module follows a workspace-based workflow for organizing and querying your document collection:

		### Process a TDoc {#ai-process}
		### 1. Create Workspace

		```bash
		tdoc-crawler ai process --tdoc-id SP-123456 --checkout-path /path/to/checkout
		# Create a new workspace for your project
		tdoc-crawler ai workspace create my-project
		```

		Options:
		### 2. Add Documents to Workspace

		- `--tdoc-id`: TDoc identifier (e.g., SP-123456)
		- `--checkout-path`: Path to TDoc checkout folder
		- `--force`: Force re-processing even if completed
		- `--json`: Output as JSON
		Use the existing `checkout` and `checkout-spec` commands to download documents:

		### Get Status
		### Get Status {#ai-status}
		```bash
		tdoc-crawler ai status --tdoc-id SP-123456
		```
		# Add TDocs to workspace
		tdoc-crawler checkout --workspace my-project SP-240001 SP-240002

		Options:
		# Add specifications
		tdoc-crawler checkout-spec --workspace my-project 23.501 23.502
		```

		- `--tdoc-id`: TDoc identifier
		- `--json`: Output as JSON
		### 3. Process Documents (Build Knowledge Base)

		### Semantic Search
		Process documents to extract content, generate embeddings, and create summaries:

		```bash
		tdoc-crawler ai query --query "5G architecture overview" --top-k 5
		# Process all documents in workspace
		tdoc-crawler ai process-all --workspace my-project

		# Or process individual TDoc
		tdoc-crawler ai process --tdoc-id SP-240001 --checkout-path /path/to/checkout
		```

		Options:
		### 4. Query Your Knowledge Base

		- `--query`: Search query
		- `--top-k`: Number of results (default: 5)
		- `--json`: Output as JSON
		Once processed, query your documents using semantic search or graph queries:

		### Knowledge Graph
		```bash
		# Semantic search
		tdoc-crawler ai query --workspace my-project --query "5G NR architecture" --top-k 5

		# Knowledge graph query
		tdoc-crawler ai graph --workspace my-project --query "evolution of 5G standards"
		```

		### 5. Check Status

		Monitor processing status:

		```bash
		tdoc-crawler ai graph --query "evolution of 5G NR"
		# Check status of specific TDoc
		tdoc-crawler ai status --tdoc-id SP-240001

		# List all processed documents
		tdoc-crawler ai status --workspace my-project
		```

		Options:
		______________________________________________________________________

		- `--query`: Graph query
		SJ\|- `--json`: Output as JSON
		## CLI Commands

		### AI Workspace Management
		### Workspace Management

		```bash
		# Create a new workspace
		tdoc-crawler ai workspace create my-workspace
		tdoc-crawler ai workspace create <name>

		# List all workspaces
		tdoc-crawler ai workspace list

		# Get workspace details
		tdoc-crawler ai workspace get my-workspace
		tdoc-crawler ai workspace get <name>

		# Add members to a workspace
		tdoc-crawler ai workspace add-members --workspace my-workspace SP-123456 SP-123457 --kind tdoc
		# Delete a workspace
		tdoc-crawler ai workspace delete <name>
		```

		# List members of a workspace
		tdoc-crawler ai workspace list-members --workspace my-workspace
		### Document Processing

		# Delete a workspace
		tdoc-crawler ai workspace delete my-workspace
		```bash
		# Process single TDoc
		tdoc-crawler ai process --tdoc-id <TDOC_ID> --checkout-path <PATH>

		# Process all TDocs in workspace
		tdoc-crawler ai process-all --workspace <NAME>

		# Force re-processing
		tdoc-crawler ai process --tdoc-id <TDOC_ID> --checkout-path <PATH> --force
		```

		### Querying

		```bash
		# Semantic search
		tdoc-crawler ai query --workspace <NAME> --query "<SEARCH_QUERY>" --top-k 5

		# Knowledge graph query
		tdoc-crawler ai graph --workspace <NAME> --query "<GRAPH_QUERY>"
		```

		### Status

		```bash
		# Check processing status
		tdoc-crawler ai status --tdoc-id <TDOC_ID>

		# List all statuses in workspace
		tdoc-crawler ai status --workspace <NAME>

		# Output as JSON
		tdoc-crawler ai status --tdoc-id <TDOC_ID> --json
		```

		Workspace Options:
		______________________________________________________________________

		## Model Providers

		### Supported LLM Providers

		\| Provider \| Example Model \| API Key Env Var \| Notes \|
		\|----------\|---------------\|-----------------\|-------\|
		\| `openai` \| `gpt-4o-mini`, `gpt-4o` \| `OPENAI_API_KEY` \| Industry standard \|
		\| `anthropic` \| `claude-3-haiku`, `claude-3-sonnet` \| `ANTHROPIC_API_KEY` \| High-quality reasoning \|
		\| `openrouter` \| `openrouter/free` \| `OPENROUTER_API_KEY` \| Recommended - Free tier available \|
		\| `github_copilot` \| `gpt-4o` \| `GITHUB_COPILOT_API_KEY` \| GitHub Copilot endpoint \|
		\| `nvidia` \| `meta/llama3-70b` \| `NVIDIA_API_KEY` \| NVIDIA NIM platform \|
		\| `google` \| `gemini-1.5-flash` \| `GOOGLE_API_KEY` \| Google AI Studio \|
		\| `azure` \| `gpt-4o` \| `AZURE_API_KEY` \| Azure OpenAI Service \|
		\| `vertex_ai` \| `gemini-pro` \| `VERTEX_AI_API_KEY` \| Google Cloud Vertex \|
		\| `groq` \| `llama-3.1-70b` \| `GROQ_API_KEY` \| Fast inference \|
		\| `mistral` \| `mistral-large` \| `MISTRAL_API_KEY` \| Mistral AI \|
		\| `together_ai` \| `meta-llama/Llama-3-70b` \| `TOGETHER_API_KEY` \| Together AI platform \|
		\| `huggingface` \| `mistralai/Mistral-7B` \| `HF_API_KEY` \| Hugging Face Inference \|
		\| `ollama` \| `llama3.2`, `mistral` \| (none) \| Local - No API key needed \|
		\| `sambanova` \| `Meta-Llama-3.1-70B` \| `SAMBANOVA_API_KEY` \| SambaNova Cloud \|
		\| `fireworks_ai` \| `accounts/fireworks/models/llama-v3-70b` \| `FIREWORKS_API_KEY` \| Fireworks AI \|
		\| `anyscale` \| `meta-llama/Llama-3-70b` \| `ANYSCALE_API_KEY` \| Anyscale Endpoints \|
		\| `perplexity` \| `pplx-7b-chat` \| `PERPLEXITY_API_KEY` \| Perplexity API \|
		\| `deepinfra` \| `meta-llama/Llama-3-70b` \| `DEEPINFRA_API_KEY` \| DeepInfra \|

		### Supported Embedding Providers

		\| Provider \| Example Model \| API Key Env Var \| Notes \|
		\|----------\|---------------\|-----------------\|-------\|
		\| `ollama` \| `embeddinggemma`, `nomic-embed-text` \| (none) \| Recommended - Local \|
		\| `huggingface` \| `BAAI/bge-small-en-v1.5` \| `HF_API_KEY` \| BGE embeddings \|
		\| `openai` \| `text-embedding-3-small` \| `OPENAI_API_KEY` \| OpenAI embeddings \|
		\| `cohere` \| `embed-english-v3.0` \| `COHERE_API_KEY` \| Cohere embeddings \|
		\| `google` \| `text-embedding-004` \| `GOOGLE_API_KEY` \| Google embeddings \|

		### Recommended Configuration

		Free/Local Setup (No Cost):

		```bash
		TDC_AI_LLM_MODEL=openrouter/openrouter/free
		TDC_AI_EMBEDDING_MODEL=ollama/embeddinggemma
		OPENROUTER_API_KEY=your-free-api-key
		```

		- `--workspace`: Workspace name (defaults to 'default')
		- `--json`: Output as JSON
		- `create <name>`: Create a new workspace
		- `list`: List all workspaces
		- `get <name>`: Get workspace details
		- `add-members <items...>`: Add source items to a workspace
		- `list-members`: List members of a workspace
		- `delete <name>`: Delete a workspace
		Production Setup (High Quality):

		```bash
		TDC_AI_LLM_MODEL=anthropic/claude-3-sonnet
		TDC_AI_EMBEDDING_MODEL=openai/text-embedding-3-small
		ANTHROPIC_API_KEY=your-key
		OPENAI_API_KEY=your-key
		```

		______________________________________________________________________

		## Python API

		@@ -162,25 +271,31 @@ from tdoc_crawler.ai import (
		get_status,
		query_embeddings,
		query_graph,
		create_workspace,
		get_workspace,
		)

		# Create workspace
		workspace = create_workspace("my-project")

		# Process single TDoc
		status = process_tdoc("SP-123456", "/path/to/checkout")
		status = process_tdoc("SP-240001", "/path/to/checkout", workspace="my-project")

		# Batch processing
		results = process_all(
		["SP-123456", "SP-123457"],
		"/base/checkout/path"
		["SP-240001", "SP-240002"],
		"/base/checkout/path",
		workspace="my-project"
		)

		# Get status
		status = get_status("SP-123456")
		status = get_status("SP-240001")

		# Semantic search
		results = query_embeddings("5G architecture", top_k=5)
		results = query_embeddings("5G architecture", top_k=5, workspace="my-project")

		# Query knowledge graph
		graph_data = query_graph("evolution of 5G NR")
		graph_data = query_graph("evolution of 5G NR", workspace="my-project")
		```

		### Models
		@@ -192,6 +307,7 @@ from tdoc_crawler.ai import (
		DocumentClassification,
		DocumentSummary,
		DocumentChunk,
		Workspace,
		)
		```

		@@ -200,7 +316,7 @@ from tdoc_crawler.ai import (
		The AI processing pipeline consists of these stages:

		1. CLASSIFY - Identify main document among multiple files
		1. EXTRACT - Convert DOCX to Markdown
		1. EXTRACT - Convert DOCX/PDF to Markdown (via Kreuzberg)
		1. EMBED - Generate vector embeddings
		1. SUMMARIZE - Create AI summaries
		1. GRAPH - Build knowledge graph relationships
		@@ -208,9 +324,9 @@ The AI processing pipeline consists of these stages:
		## Supported File Types

		- DOCX - Primary format for extraction (via Kreuzberg)
		- PDF - Supported via Kreuzberg
		- XLSX - Handled as secondary files
		- PPTX - Handled as secondary files
		- PDF - Supported via Kreuzberg

		## Testing

		@@ -226,26 +342,166 @@ uv run pytest tests/ai/test_ai_extraction.py -v

		Test data is located in `tests/ai/data/`.

		______________________________________________________________________

		## Troubleshooting

		### Kreuzberg not available
		### Installation Issues

		Problem: `ModuleNotFoundError: No module named 'kreuzberg'`

		Solution: Install the AI optional dependencies:

		```bash
		uv add tdoc-crawler[ai]
		```

		Problem: `lancedb not available`

		Solution: LanceDB is included in the `[ai]` extra. Reinstall:

		```bash
		uv sync --extra ai
		```

		### Model Configuration Errors

		Problem: `ValueError: TDC_AI_LLM_MODEL must be in '<provider>/<model_name>' format`

		Solution: Ensure your model identifier includes a provider prefix:

		```bash
		# Wrong
		TDC_AI_LLM_MODEL=gpt-4o-mini

		# Correct
		TDC_AI_LLM_MODEL=openai/gpt-4o-mini
		```

		Problem: `ValueError: provider 'xyz' is not in supported provider allowlist`

		Solution: Check the provider name spelling. See [Model Providers](#model-providers) for the full list. Provider names are case-insensitive.

		### API Key Errors

		Problem: `litellm.AuthenticationError: Invalid API key`

		Solution: Verify your API key is set correctly:

		```bash
		# For OpenAI
		export OPENAI_API_KEY=sk-...

		# For Anthropic
		export ANTHROPIC_API_KEY=sk-ant-...

		# For OpenRouter
		export OPENROUTER_API_KEY=...

		# Check if set
		echo $OPENAI_API_KEY
		```

		Problem: `Missing API key for provider 'openai'`

		Solution: LiteLLM expects the API key in a standard environment variable named `<PROVIDER>_API_KEY`. See the [Model Providers](#model-providers) table for the correct variable name for each provider.

		### Embedding Model Issues

		Problem: `OSError: No sentence-transformers model found`

		Solution: If using a Hugging Face embedding model, ensure `sentence-transformers` is installed:

		```bash
		uv add sentence-transformers
		```

		Problem: Ollama embedding model not found

		Solution: Pull the model in Ollama first:

		```bash
		ollama pull embeddinggemma
		```

		### Workspace Issues

		Problem: `Workspace 'my-project' not found`

		Solution: Create the workspace first:

		```bash
		tdoc-crawler ai workspace create my-project
		```

		Problem: `No documents found in workspace`

		Solution: Add documents to the workspace using `checkout` or `checkout-spec` commands, then process them:

		```bash
		uv add kreuzberg[all]
		tdoc-crawler checkout --workspace my-project SP-240001
		tdoc-crawler ai process-all --workspace my-project
		```

		### Embedding model issues
		### Processing Errors

		Problem: `TDoc 'SP-240001' not found in checkout path`

		Solution: Ensure the TDoc has been downloaded to the specified path:

		```bash
		uv add sentence-transformers torch
		tdoc-crawler checkout SP-240001
		tdoc-crawler ai process --tdoc-id SP-240001 --checkout-path ~/.tdoc-crawler/checkout
		```

		### LLM API errors
		Problem: `LLM API timeout`

		Check your API key is set correctly:
		Solution: Increase timeout or reduce token count:

		```bash
		export OPENAI_API_KEY=your-key
		# or
		export ANTHROPIC_API_KEY=your-key
		# Increase timeout (if supported by provider)
		export LITELLM_REQUEST_TIMEOUT=60

		# Reduce max tokens
		export TDC_AI_LLM_MAX_TOKENS=1000
		```

		### Performance Issues

		Problem: Processing is very slow

		Solution:

		1. Increase parallelism:

		```bash
		export TDC_AI_PARALLELISM=8
		```

		1. Use a faster LLM for summarization (e.g., `gpt-4o-mini` instead of `gpt-4o`)

		1. For local models, ensure Ollama is running with GPU acceleration if available

		### LanceDB Issues

		Problem: `lancedb.errors.InternalError: Schema mismatch`

		Solution: This can occur after upgrading the AI module. The LanceDB schema may need to be recreated. Backup your data and delete the LanceDB directory:

		```bash
		# Backup first!
		cp -r ~/.tdoc-crawler/.ai/lancedb ~/.tdoc-crawler/.ai/lancedb.backup

		# Delete and let it recreate
		rm -rf ~/.tdoc-crawler/.ai/lancedb
		```

		Note: This will delete all processed embeddings and summaries. You'll need to re-process your documents.

		______________________________________________________________________

		## Additional Resources

		- [LiteLLM Provider Documentation](https://docs.litellm.ai/docs/providers) - Complete list of 100+ supported LLM providers
		- [Kreuzberg Documentation](https://docs.kreuzberg.dev/) - Document extraction library
		- [LanceDB Documentation](https://lancedb.github.io/lancedb/) - Vector database

docs/history/2026-02-26_SUMMARY_AI_FEATURE_VALIDATION.md

+6 −5

Original line number	Diff line number	Diff line
		@@ -5,6 +5,7 @@
		## Test Results Summary

		Total Tests: 377

		- Passed: 372
		- Failed: 1 (pre-existing)
		- Skipped: 5 (model-dependent tests)
		@@ -15,11 +16,11 @@

		\| ID \| Criterion \| Status \| Notes \|
		\|----\|-----------\|--------\|-------\|
		\| SC-001 \| Single TDoc extraction <30s \| ✅ PASS \| Unit tests verify extraction logic; actual performance depends on hardware \|
		\| SC-001 \| Single TDoc extraction \<30s \| ✅ PASS \| Unit tests verify extraction logic; actual performance depends on hardware \|
		\| SC-002 \| Main doc identification >90% \| ✅ PASS \| Heuristic-based classification with confidence scoring \|
		\| SC-003 \| Semantic search top-5 >80% \| ⚠️ DEFERRED \| Requires actual embedding model; test infrastructure in place \|
		\| SC-004 \| LLM abstracts 150-250 words \| ✅ PASS \| Word count validation in tests; requires LLM for E2E \|
		\| SC-005 \| Idempotent re-processing <10% \| ✅ PASS \| Hash-based skip logic implemented \|
		\| SC-005 \| Idempotent re-processing \<10% \| ✅ PASS \| Hash-based skip logic implemented \|
		\| SC-006 \| Resume after crash \| ✅ PASS \| Pipeline status tracking enables resume \|
		\| SC-007 \| Temporal graph ordering \| ✅ PASS \| Chronological sorting in query_graph \|

		@@ -45,10 +46,10 @@
		## Known Issues

		1. Type Checking: Pre-existing type errors in embeddings.py, graph.py, summarize.py - requires model validation pattern fixes
		2. One Test Failure: `test_no_whatthespec_when_credentials_available` - pre-existing failure unrelated to AI features
		1. One Test Failure: `test_no_whatthespec_when_credentials_available` - pre-existing failure unrelated to AI features

		## Recommendations

		1. Address type checking errors in follow-up PR
		2. Add integration test markers for E2E tests requiring actual models
		3. Consider adding SC-003 validation with actual embedding model
		1. Add integration test markers for E2E tests requiring actual models
		1. Consider adding SC-003 validation with actual embedding model

specs/002-ai-document-processing/research.md

+11 −4

Original line number	Diff line number	Diff line
		@@ -46,6 +46,7 @@ Phase 9 migration will refactor the extraction interface to better match Kreuzbe
		patterns rather than maintaining API compatibility.

		Migration requirements:

		- ✅ Kreuzberg MUST provide equivalent or better DOCX-to-Markdown conversion capabilities
		- ✅ The `extract_from_folder()` function signature MAY change to leverage Kreuzberg's native API
		- ✅ Internal implementation in extract.py should be refactored to use Kreuzberg idioms
		@@ -62,23 +63,27 @@ migration happens in Phase 9 when Kreuzberg integration is complete and tested.
		Decision: Use a hybrid approach combining rule-based and LLM-powered extraction.

		Rule-based extraction (deterministic, fast):

		- TDoc ID patterns: `S[0-9]+-[0-9]+`, `[0-9]{5}-j[0-9]+` (regex matching)
		- Meeting codes: `SA4#[0-9]+`, `RP-[0-9]+` (structured identifiers)
		- Specification references: `TS [0-9]+.[0-9]+.[0-9]+` (3GPP spec format)
		- Explicit cross-references: When TDoc content explicitly mentions another TDoc ID

		LLM-powered extraction (semantic, flexible):

		- Concept extraction: Identify technical concepts from content semantics
		- Implicit relationships: Discover connections not explicitly stated (e.g., "similar approach to..." without TDoc ID)
		- Work item identification: Map TDocs to work items based on topic discussion
		- Relationship typing: Classify edge types (discusses, revises, extends, contradicts)

		Implementation phases:

		1. Phase 9a: Implement rule-based extraction first (quick wins, deterministic)
		2. Phase 9b: Add LLM-powered semantic extraction (comprehensive but slower)
		3. Phase 9c: Merge both sources with conflict resolution (rules take precedence for structured data)
		1. Phase 9b: Add LLM-powered semantic extraction (comprehensive but slower)
		1. Phase 9c: Merge both sources with conflict resolution (rules take precedence for structured data)

		Conflict resolution:

		- When rules and LLM disagree on structured data (TDoc IDs, meeting codes) → trust rules
		- When LLM identifies implicit relationships → accept unless contradicted by rules
		- Log all conflicts for debugging and manual review if needed
		@@ -86,6 +91,7 @@ migration happens in Phase 9 when Kreuzberg integration is complete and tested.
		migration happens in Phase 9 when Kreuzberg integration is complete and tested.

		Authorized breaking changes:

		- Function signatures in `src/tdoc_crawler/ai/operations/extract.py`
		- Internal data structures and intermediate representations
		- Error handling patterns (may adopt Kreuzberg-specific exception types)
		@@ -99,6 +105,7 @@ migration happens in Phase 9 when Kreuzberg integration is complete and tested.
		Docling serves as an interim solution to unblock early phase implementation and testing.

		Migration requirements:

		- Kreuzberg MUST provide equivalent or better DOCX-to-Markdown conversion capabilities
		- All imports of `docling.` modules must be replaced with `kreuzberg.` equivalents
		- The `extract_from_folder()` function signature should remain stable to avoid breaking pipeline.py

Original line number	Diff line number	Diff line
		@@ -5,6 +5,7 @@
		## Test Results Summary

		Total Tests: 377

		- Passed: 372
		- Failed: 1 (pre-existing)
		- Skipped: 5 (model-dependent tests)
		@@ -15,11 +16,11 @@

		\| ID \| Criterion \| Status \| Notes \|
		\|----\|-----------\|--------\|-------\|
		\| SC-001 \| Single TDoc extraction <30s \| ✅ PASS \| Unit tests verify extraction logic; actual performance depends on hardware \|
		\| SC-001 \| Single TDoc extraction \<30s \| ✅ PASS \| Unit tests verify extraction logic; actual performance depends on hardware \|
		\| SC-002 \| Main doc identification >90% \| ✅ PASS \| Heuristic-based classification with confidence scoring \|
		\| SC-003 \| Semantic search top-5 >80% \| ⚠️ DEFERRED \| Requires actual embedding model; test infrastructure in place \|
		\| SC-004 \| LLM abstracts 150-250 words \| ✅ PASS \| Word count validation in tests; requires LLM for E2E \|
		\| SC-005 \| Idempotent re-processing <10% \| ✅ PASS \| Hash-based skip logic implemented \|
		\| SC-005 \| Idempotent re-processing \<10% \| ✅ PASS \| Hash-based skip logic implemented \|
		\| SC-006 \| Resume after crash \| ✅ PASS \| Pipeline status tracking enables resume \|
		\| SC-007 \| Temporal graph ordering \| ✅ PASS \| Chronological sorting in query_graph \|

		@@ -45,10 +46,10 @@
		## Known Issues

		1. Type Checking: Pre-existing type errors in embeddings.py, graph.py, summarize.py - requires model validation pattern fixes
		2. One Test Failure: `test_no_whatthespec_when_credentials_available` - pre-existing failure unrelated to AI features
		1. One Test Failure: `test_no_whatthespec_when_credentials_available` - pre-existing failure unrelated to AI features

		## Recommendations

		1. Address type checking errors in follow-up PR
		2. Add integration test markers for E2E tests requiring actual models
		3. Consider adding SC-003 validation with actual embedding model
		1. Add integration test markers for E2E tests requiring actual models
		1. Consider adding SC-003 validation with actual embedding model

Admin message