Commit a319ffaa authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(docs): Add AI documentation improvements and CLI redesign spec

* Introduce new feature specification for AI documentation improvements.
* Clarify user scenarios and testing requirements for new CLI commands.
* Define acceptance criteria for workspace creation with auto-build.
* Outline functional requirements for model configuration and command redesign.
* Document user stories focusing on free AI features and streamlined workflows.
* Include edge cases and success criteria for measurable outcomes.

feat(cli): Implement CLI command redesign for AI operations

* Remove deprecated commands: `ai process`, `ai status`, `ai graph`.
* Introduce new commands: `ai summarize` and `ai convert` for single TDoc operations.
* Ensure `ai query` command is workspace-only, integrating RAG and GraphRAG.
* Add `--auto-build` flag to workspace creation for automatic processing.

fix(config): Update default LLM and embedding models

* Change default LLM model to `openrouter/openrouter/free`.
* Update default embedding model to `ollama/embeddinggemma`.

refactor(operations): Modify summarization logic to use dynamic models

* Update `LiteLLMClient` to accept model parameter, defaulting to config.
* Ensure summarization uses the configured model instead of hardcoded values.

docs(tests): Enhance test documentation for new AI features

* Add tasks for testing new CLI commands and auto-build functionality.
* Outline phases for implementing and verifying new features in the AI module.
parent 0992bce9
Loading
Loading
Loading
Loading
+6 −4
Original line number Diff line number Diff line
@@ -78,20 +78,22 @@ HTTP_CACHE_TTL=7200
HTTP_CACHE_REFRESH_ON_ACCESS=true

# AI Configuration
# Note: AI module requires API keys for cloud providers. See docs/ai.md for details.

# Path to AI LanceDB store (default: <cache_dir>/.ai/lancedb)
TDC_AI_STORE_PATH=

# LLM model in format <provider>/<model_name>
# Example: ollama/llama3.2 or openai/gpt-4o-mini
TDC_AI_LLM_MODEL=ollama/llama3.2
# Recommended: openrouter/openrouter/free (free tier, no subscription required)
# API key: OPENROUTER_API_KEY
TDC_AI_LLM_MODEL=openrouter/openrouter/free

# Optional custom base URL for LLM provider/proxy
TDC_AI_LLM_API_BASE=

# Embedding model in format <provider>/<model_name>
# Example: huggingface/BAAI/bge-small-en-v1.5
TDC_AI_EMBEDDING_MODEL=huggingface/BAAI/bge-small-en-v1.5
# Recommended: ollama/embeddinggemma (local, no API key required)
TDC_AI_EMBEDDING_MODEL=ollama/embeddinggemma

# Chunking
TDC_AI_MAX_CHUNK_SIZE=1000
+4 −0
Original line number Diff line number Diff line
@@ -13,6 +13,7 @@ A command-line tool for crawling the 3GPP FTP server, caching TDoc metadata in a
- **Case-Insensitive Queries**: Search for TDocs regardless of case
- **Multiple Output Formats**: Export results as table, JSON, or YAML
- **Incremental Updates**: Only fetch new data on subsequent crawls
- **AI Document Processing** - Semantic search, knowledge graphs, and AI-powered summarization (optional, install with `tdoc-crawler[ai]`)
- **Rich CLI**: Beautiful terminal output with progress indicators

## Installation
@@ -30,6 +31,9 @@ uvx tdoc-crawler --help
# Install from PyPI (publication pending)
uv add tdoc-crawler

# Install with AI features (optional)
uv add tdoc-crawler[ai]

# Or install from source
git clone https://forge.3gpp.org/rep/reimes/tdoc-crawler.git
cd tdoc-crawler
+349 −93
Original line number Diff line number Diff line
# AI Document Processing

The AI module provides intelligent document processing capabilities for TDoc data, including:
The AI module provides intelligent document processing capabilities for TDoc data, including semantic search, knowledge graph construction, and AI-powered summarization.

**Key Features:**

- **Classification** - Identify main documents in multi-file TDoc folders
- **Extraction** - Convert DOCX to Markdown with keyword extraction and language detection
- **Embeddings** - Generate semantic vector representations
- **Summarization** - Create AI-powered summaries
- **Extraction** - Convert DOCX/PDF to Markdown with keyword extraction and language detection (via Kreuzberg)
- **Embeddings** - Generate semantic vector representations for similarity search
- **Summarization** - Create AI-powered abstracts
- **Knowledge Graph** - Build relationships between TDocs
- **Workspaces** - Organize TDocs into logical groups for focused analysis

______________________________________________________________________

## Table of Contents

- [Installation](#installation)
- [Configuration](#configuration)
- [Workflow Guide](#workflow-guide)
- [CLI Commands](#cli-commands)
- [Model Providers](#model-providers)
- [Python API](#python-api)
- [Environment Variables](#environment-variables)
- [Troubleshooting](#troubleshooting)

______________________________________________________________________

## Installation

Install required dependencies:
The AI module is available as an optional dependency. Install it with:

```bash
# Core AI dependencies
uv add kreuzberg[all] sentence-transformers litellm
# Install tdoc-crawler with AI support
uv add tdoc-crawler[ai]

# Optional: for vector storage
uv add lancedb
# Or install from source
git clone https://forge.3gpp.org/rep/reimes/tdoc-crawler.git
cd tdoc-crawler
uv sync --extra ai
```

All required dependencies (Kreuzberg, LiteLLM, sentence-transformers, LanceDB) are installed automatically.

______________________________________________________________________

## Configuration

### Environment Variables
@@ -36,122 +51,216 @@ Configure AI processing via environment variables (see `.env.example`):

```bash
# LLM Configuration
TDC_LLM_MODEL=openai/gpt-4o-mini
TDC_LLM_API_KEY=your-api-key
TDC_LLM_BASE_URL=  # For proxy/custom endpoints
TDC_LLM_MAX_TOKENS=2000
TDC_LLM_TEMPERATURE=0.3
TDC_AI_LLM_MODEL=openrouter/openrouter/free    # Default: free tier via OpenRouter
TDC_AI_LLM_API_KEY=your-api-key                # Required for cloud providers
TDC_AI_LLM_API_BASE=                           # Optional: custom endpoint

# Embedding Model
TDC_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
TDC_EMBEDDING_API_KEY=
TDC_AI_EMBEDDING_MODEL=ollama/embeddinggemma   # Default: local Ollama
TDC_AI_EMBEDDING_API_KEY=                      # Not needed for local models

# Storage
TDC_LANCEDB_PATH=
TDC_CHUNK_SIZE=500
TDC_CHUNK_OVERLAP=50
TDC_AI_STORE_PATH=                             # Defaults to <cache_dir>/.ai/lancedb
TDC_AI_MAX_CHUNK_SIZE=1000                     # Chunk size for embeddings
TDC_AI_CHUNK_OVERLAP=100                       # Overlap between chunks

# Summary Constraints
TDC_AI_ABSTRACT_MIN_WORDS=150
TDC_AI_ABSTRACT_MAX_WORDS=250

# Processing
TDC_AI_PARALLELISM=4                           # Parallel workers
```

### Model Format
### Model Identifier Format

Both LLM and embedding models use the `<provider>/<model_name>` format:

| Provider | Example Model | Description |
|----------|---------------|-------------|
| `openai` | `gpt-4o-mini`, `gpt-4o` | OpenAI models |
| `anthropic` | `claude-3-haiku`, `claude-3-sonnet` | Anthropic models |
| `azure` | `gpt-4o` | Azure OpenAI |
| `google` | `gemini-1.5-flash` | Google models |
| `cohere` | `command-r` | Cohere models |
| `BAAI` | `bge-small-en-v1.5` | BGE embeddings |
| `ollama` | `llama3`, `mistral` | Local Ollama models |
```bash
# Simple format: provider/model
TDC_AI_LLM_MODEL=openai/gpt-4o-mini

# Nested format: provider/model_group/model (also supported)
TDC_AI_LLM_MODEL=openrouter/anthropic/claude-3-sonnet
TDC_AI_EMBEDDING_MODEL=huggingface/BAAI/bge-small-en-v1.5
```

The provider (first segment) is validated against the supported allowlist. The model name (everything after the first `/`) can contain additional slashes for nested model paths.

**Note:** LiteLLM is used as the backend, supporting 100+ providers. See [LiteLLM documentation](https://docs.litellm.ai/) for full list.
______________________________________________________________________

## CLI Commands
## Workflow Guide

## AI Commands
The AI module follows a workspace-based workflow for organizing and querying your document collection:

### Process a TDoc {#ai-process}
### 1. Create Workspace

```bash
tdoc-crawler ai process --tdoc-id SP-123456 --checkout-path /path/to/checkout
# Create a new workspace for your project
tdoc-crawler ai workspace create my-project
```

Options:
### 2. Add Documents to Workspace

- `--tdoc-id`: TDoc identifier (e.g., SP-123456)
- `--checkout-path`: Path to TDoc checkout folder
- `--force`: Force re-processing even if completed
- `--json`: Output as JSON
Use the existing `checkout` and `checkout-spec` commands to download documents:

### Get Status
### Get Status {#ai-status}
```bash
tdoc-crawler ai status --tdoc-id SP-123456
```
# Add TDocs to workspace
tdoc-crawler checkout --workspace my-project SP-240001 SP-240002

Options:
# Add specifications
tdoc-crawler checkout-spec --workspace my-project 23.501 23.502
```

- `--tdoc-id`: TDoc identifier
- `--json`: Output as JSON
### 3. Process Documents (Build Knowledge Base)

### Semantic Search
Process documents to extract content, generate embeddings, and create summaries:

```bash
tdoc-crawler ai query --query "5G architecture overview" --top-k 5
# Process all documents in workspace
tdoc-crawler ai process-all --workspace my-project

# Or process individual TDoc
tdoc-crawler ai process --tdoc-id SP-240001 --checkout-path /path/to/checkout
```

Options:
### 4. Query Your Knowledge Base

- `--query`: Search query
- `--top-k`: Number of results (default: 5)
- `--json`: Output as JSON
Once processed, query your documents using semantic search or graph queries:

### Knowledge Graph
```bash
# Semantic search
tdoc-crawler ai query --workspace my-project --query "5G NR architecture" --top-k 5

# Knowledge graph query
tdoc-crawler ai graph --workspace my-project --query "evolution of 5G standards"
```

### 5. Check Status

Monitor processing status:

```bash
tdoc-crawler ai graph --query "evolution of 5G NR"
# Check status of specific TDoc
tdoc-crawler ai status --tdoc-id SP-240001

# List all processed documents
tdoc-crawler ai status --workspace my-project
```

Options:
______________________________________________________________________

- `--query`: Graph query
SJ|- `--json`: Output as JSON
## CLI Commands

### AI Workspace Management
### Workspace Management

```bash
# Create a new workspace
tdoc-crawler ai workspace create my-workspace
tdoc-crawler ai workspace create <name>

# List all workspaces
tdoc-crawler ai workspace list

# Get workspace details
tdoc-crawler ai workspace get my-workspace
tdoc-crawler ai workspace get <name>

# Add members to a workspace
tdoc-crawler ai workspace add-members --workspace my-workspace SP-123456 SP-123457 --kind tdoc
# Delete a workspace
tdoc-crawler ai workspace delete <name>
```

# List members of a workspace
tdoc-crawler ai workspace list-members --workspace my-workspace
### Document Processing

# Delete a workspace
tdoc-crawler ai workspace delete my-workspace
```bash
# Process single TDoc
tdoc-crawler ai process --tdoc-id <TDOC_ID> --checkout-path <PATH>

# Process all TDocs in workspace
tdoc-crawler ai process-all --workspace <NAME>

# Force re-processing
tdoc-crawler ai process --tdoc-id <TDOC_ID> --checkout-path <PATH> --force
```

### Querying

```bash
# Semantic search
tdoc-crawler ai query --workspace <NAME> --query "<SEARCH_QUERY>" --top-k 5

# Knowledge graph query
tdoc-crawler ai graph --workspace <NAME> --query "<GRAPH_QUERY>"
```

### Status

```bash
# Check processing status
tdoc-crawler ai status --tdoc-id <TDOC_ID>

# List all statuses in workspace
tdoc-crawler ai status --workspace <NAME>

# Output as JSON
tdoc-crawler ai status --tdoc-id <TDOC_ID> --json
```

Workspace Options:
______________________________________________________________________

## Model Providers

### Supported LLM Providers

| Provider | Example Model | API Key Env Var | Notes |
|----------|---------------|-----------------|-------|
| `openai` | `gpt-4o-mini`, `gpt-4o` | `OPENAI_API_KEY` | Industry standard |
| `anthropic` | `claude-3-haiku`, `claude-3-sonnet` | `ANTHROPIC_API_KEY` | High-quality reasoning |
| `openrouter` | `openrouter/free` | `OPENROUTER_API_KEY` | **Recommended** - Free tier available |
| `github_copilot` | `gpt-4o` | `GITHUB_COPILOT_API_KEY` | GitHub Copilot endpoint |
| `nvidia` | `meta/llama3-70b` | `NVIDIA_API_KEY` | NVIDIA NIM platform |
| `google` | `gemini-1.5-flash` | `GOOGLE_API_KEY` | Google AI Studio |
| `azure` | `gpt-4o` | `AZURE_API_KEY` | Azure OpenAI Service |
| `vertex_ai` | `gemini-pro` | `VERTEX_AI_API_KEY` | Google Cloud Vertex |
| `groq` | `llama-3.1-70b` | `GROQ_API_KEY` | Fast inference |
| `mistral` | `mistral-large` | `MISTRAL_API_KEY` | Mistral AI |
| `together_ai` | `meta-llama/Llama-3-70b` | `TOGETHER_API_KEY` | Together AI platform |
| `huggingface` | `mistralai/Mistral-7B` | `HF_API_KEY` | Hugging Face Inference |
| `ollama` | `llama3.2`, `mistral` | *(none)* | **Local** - No API key needed |
| `sambanova` | `Meta-Llama-3.1-70B` | `SAMBANOVA_API_KEY` | SambaNova Cloud |
| `fireworks_ai` | `accounts/fireworks/models/llama-v3-70b` | `FIREWORKS_API_KEY` | Fireworks AI |
| `anyscale` | `meta-llama/Llama-3-70b` | `ANYSCALE_API_KEY` | Anyscale Endpoints |
| `perplexity` | `pplx-7b-chat` | `PERPLEXITY_API_KEY` | Perplexity API |
| `deepinfra` | `meta-llama/Llama-3-70b` | `DEEPINFRA_API_KEY` | DeepInfra |

### Supported Embedding Providers

| Provider | Example Model | API Key Env Var | Notes |
|----------|---------------|-----------------|-------|
| `ollama` | `embeddinggemma`, `nomic-embed-text` | *(none)* | **Recommended** - Local |
| `huggingface` | `BAAI/bge-small-en-v1.5` | `HF_API_KEY` | BGE embeddings |
| `openai` | `text-embedding-3-small` | `OPENAI_API_KEY` | OpenAI embeddings |
| `cohere` | `embed-english-v3.0` | `COHERE_API_KEY` | Cohere embeddings |
| `google` | `text-embedding-004` | `GOOGLE_API_KEY` | Google embeddings |

### Recommended Configuration

**Free/Local Setup (No Cost):**

```bash
TDC_AI_LLM_MODEL=openrouter/openrouter/free
TDC_AI_EMBEDDING_MODEL=ollama/embeddinggemma
OPENROUTER_API_KEY=your-free-api-key
```

- `--workspace`: Workspace name (defaults to 'default')
- `--json`: Output as JSON
- `create <name>`: Create a new workspace
- `list`: List all workspaces
- `get <name>`: Get workspace details
- `add-members <items...>`: Add source items to a workspace
- `list-members`: List members of a workspace
- `delete <name>`: Delete a workspace
**Production Setup (High Quality):**

```bash
TDC_AI_LLM_MODEL=anthropic/claude-3-sonnet
TDC_AI_EMBEDDING_MODEL=openai/text-embedding-3-small
ANTHROPIC_API_KEY=your-key
OPENAI_API_KEY=your-key
```

______________________________________________________________________

## Python API

@@ -162,25 +271,31 @@ from tdoc_crawler.ai import (
    get_status,
    query_embeddings,
    query_graph,
    create_workspace,
    get_workspace,
)

# Create workspace
workspace = create_workspace("my-project")

# Process single TDoc
status = process_tdoc("SP-123456", "/path/to/checkout")
status = process_tdoc("SP-240001", "/path/to/checkout", workspace="my-project")

# Batch processing
results = process_all(
    ["SP-123456", "SP-123457"],
    "/base/checkout/path"
    ["SP-240001", "SP-240002"],
    "/base/checkout/path",
    workspace="my-project"
)

# Get status
status = get_status("SP-123456")
status = get_status("SP-240001")

# Semantic search
results = query_embeddings("5G architecture", top_k=5)
results = query_embeddings("5G architecture", top_k=5, workspace="my-project")

# Query knowledge graph
graph_data = query_graph("evolution of 5G NR")
graph_data = query_graph("evolution of 5G NR", workspace="my-project")
```

### Models
@@ -192,6 +307,7 @@ from tdoc_crawler.ai import (
    DocumentClassification,
    DocumentSummary,
    DocumentChunk,
    Workspace,
)
```

@@ -200,7 +316,7 @@ from tdoc_crawler.ai import (
The AI processing pipeline consists of these stages:

1. **CLASSIFY** - Identify main document among multiple files
1. **EXTRACT** - Convert DOCX to Markdown
1. **EXTRACT** - Convert DOCX/PDF to Markdown (via Kreuzberg)
1. **EMBED** - Generate vector embeddings
1. **SUMMARIZE** - Create AI summaries
1. **GRAPH** - Build knowledge graph relationships
@@ -208,9 +324,9 @@ The AI processing pipeline consists of these stages:
## Supported File Types

- **DOCX** - Primary format for extraction (via Kreuzberg)
- **PDF** - Supported via Kreuzberg
- **XLSX** - Handled as secondary files
- **PPTX** - Handled as secondary files
- **PDF** - Supported via Kreuzberg

## Testing

@@ -226,26 +342,166 @@ uv run pytest tests/ai/test_ai_extraction.py -v

Test data is located in `tests/ai/data/`.

______________________________________________________________________

## Troubleshooting

### Kreuzberg not available
### Installation Issues

**Problem:** `ModuleNotFoundError: No module named 'kreuzberg'`

**Solution:** Install the AI optional dependencies:

```bash
uv add tdoc-crawler[ai]
```

**Problem:** `lancedb not available`

**Solution:** LanceDB is included in the `[ai]` extra. Reinstall:

```bash
uv sync --extra ai
```

### Model Configuration Errors

**Problem:** `ValueError: TDC_AI_LLM_MODEL must be in '<provider>/<model_name>' format`

**Solution:** Ensure your model identifier includes a provider prefix:

```bash
# Wrong
TDC_AI_LLM_MODEL=gpt-4o-mini

# Correct
TDC_AI_LLM_MODEL=openai/gpt-4o-mini
```

**Problem:** `ValueError: provider 'xyz' is not in supported provider allowlist`

**Solution:** Check the provider name spelling. See [Model Providers](#model-providers) for the full list. Provider names are case-insensitive.

### API Key Errors

**Problem:** `litellm.AuthenticationError: Invalid API key`

**Solution:** Verify your API key is set correctly:

```bash
# For OpenAI
export OPENAI_API_KEY=sk-...

# For Anthropic
export ANTHROPIC_API_KEY=sk-ant-...

# For OpenRouter
export OPENROUTER_API_KEY=...

# Check if set
echo $OPENAI_API_KEY
```

**Problem:** `Missing API key for provider 'openai'`

**Solution:** LiteLLM expects the API key in a standard environment variable named `<PROVIDER>_API_KEY`. See the [Model Providers](#model-providers) table for the correct variable name for each provider.

### Embedding Model Issues

**Problem:** `OSError: No sentence-transformers model found`

**Solution:** If using a Hugging Face embedding model, ensure `sentence-transformers` is installed:

```bash
uv add sentence-transformers
```

**Problem:** Ollama embedding model not found

**Solution:** Pull the model in Ollama first:

```bash
ollama pull embeddinggemma
```

### Workspace Issues

**Problem:** `Workspace 'my-project' not found`

**Solution:** Create the workspace first:

```bash
tdoc-crawler ai workspace create my-project
```

**Problem:** `No documents found in workspace`

**Solution:** Add documents to the workspace using `checkout` or `checkout-spec` commands, then process them:

```bash
uv add kreuzberg[all]
tdoc-crawler checkout --workspace my-project SP-240001
tdoc-crawler ai process-all --workspace my-project
```

### Embedding model issues
### Processing Errors

**Problem:** `TDoc 'SP-240001' not found in checkout path`

**Solution:** Ensure the TDoc has been downloaded to the specified path:

```bash
uv add sentence-transformers torch
tdoc-crawler checkout SP-240001
tdoc-crawler ai process --tdoc-id SP-240001 --checkout-path ~/.tdoc-crawler/checkout
```

### LLM API errors
**Problem:** `LLM API timeout`

Check your API key is set correctly:
**Solution:** Increase timeout or reduce token count:

```bash
export OPENAI_API_KEY=your-key
# or
export ANTHROPIC_API_KEY=your-key
# Increase timeout (if supported by provider)
export LITELLM_REQUEST_TIMEOUT=60

# Reduce max tokens
export TDC_AI_LLM_MAX_TOKENS=1000
```

### Performance Issues

**Problem:** Processing is very slow

**Solution:**

1. Increase parallelism:

   ```bash
   export TDC_AI_PARALLELISM=8
   ```

1. Use a faster LLM for summarization (e.g., `gpt-4o-mini` instead of `gpt-4o`)

1. For local models, ensure Ollama is running with GPU acceleration if available

### LanceDB Issues

**Problem:** `lancedb.errors.InternalError: Schema mismatch`

**Solution:** This can occur after upgrading the AI module. The LanceDB schema may need to be recreated. Backup your data and delete the LanceDB directory:

```bash
# Backup first!
cp -r ~/.tdoc-crawler/.ai/lancedb ~/.tdoc-crawler/.ai/lancedb.backup

# Delete and let it recreate
rm -rf ~/.tdoc-crawler/.ai/lancedb
```

**Note:** This will delete all processed embeddings and summaries. You'll need to re-process your documents.

______________________________________________________________________

## Additional Resources

- [LiteLLM Provider Documentation](https://docs.litellm.ai/docs/providers) - Complete list of 100+ supported LLM providers
- [Kreuzberg Documentation](https://docs.kreuzberg.dev/) - Document extraction library
- [LanceDB Documentation](https://lancedb.github.io/lancedb/) - Vector database
+6 −5
Original line number Diff line number Diff line
@@ -5,6 +5,7 @@
## Test Results Summary

**Total Tests**: 377

- **Passed**: 372
- **Failed**: 1 (pre-existing)
- **Skipped**: 5 (model-dependent tests)
@@ -15,11 +16,11 @@

| ID | Criterion | Status | Notes |
|----|-----------|--------|-------|
| SC-001 | Single TDoc extraction <30s | ✅ PASS | Unit tests verify extraction logic; actual performance depends on hardware |
| SC-001 | Single TDoc extraction \<30s | ✅ PASS | Unit tests verify extraction logic; actual performance depends on hardware |
| SC-002 | Main doc identification >90% | ✅ PASS | Heuristic-based classification with confidence scoring |
| SC-003 | Semantic search top-5 >80% | ⚠️ DEFERRED | Requires actual embedding model; test infrastructure in place |
| SC-004 | LLM abstracts 150-250 words | ✅ PASS | Word count validation in tests; requires LLM for E2E |
| SC-005 | Idempotent re-processing <10% | ✅ PASS | Hash-based skip logic implemented |
| SC-005 | Idempotent re-processing \<10% | ✅ PASS | Hash-based skip logic implemented |
| SC-006 | Resume after crash | ✅ PASS | Pipeline status tracking enables resume |
| SC-007 | Temporal graph ordering | ✅ PASS | Chronological sorting in query_graph |

@@ -45,10 +46,10 @@
## Known Issues

1. **Type Checking**: Pre-existing type errors in embeddings.py, graph.py, summarize.py - requires model validation pattern fixes
2. **One Test Failure**: `test_no_whatthespec_when_credentials_available` - pre-existing failure unrelated to AI features
1. **One Test Failure**: `test_no_whatthespec_when_credentials_available` - pre-existing failure unrelated to AI features

## Recommendations

1. Address type checking errors in follow-up PR
2. Add integration test markers for E2E tests requiring actual models
3. Consider adding SC-003 validation with actual embedding model
1. Add integration test markers for E2E tests requiring actual models
1. Consider adding SC-003 validation with actual embedding model
+11 −4
Original line number Diff line number Diff line
@@ -46,6 +46,7 @@ Phase 9 migration will refactor the extraction interface to better match Kreuzbe
patterns rather than maintaining API compatibility.

**Migration requirements**:

- ✅ Kreuzberg MUST provide equivalent or better DOCX-to-Markdown conversion capabilities
- ✅ The `extract_from_folder()` function signature MAY change to leverage Kreuzberg's native API
- ✅ Internal implementation in extract.py should be refactored to use Kreuzberg idioms
@@ -62,23 +63,27 @@ migration happens in Phase 9 when Kreuzberg integration is complete and tested.
**Decision**: Use a hybrid approach combining rule-based and LLM-powered extraction.

**Rule-based extraction (deterministic, fast):**

- TDoc ID patterns: `S[0-9]+-[0-9]+`, `[0-9]{5}-j[0-9]+` (regex matching)
- Meeting codes: `SA4#[0-9]+`, `RP-[0-9]+` (structured identifiers)
- Specification references: `TS [0-9]+.[0-9]+.[0-9]+` (3GPP spec format)
- Explicit cross-references: When TDoc content explicitly mentions another TDoc ID

**LLM-powered extraction (semantic, flexible):**

- Concept extraction: Identify technical concepts from content semantics
- Implicit relationships: Discover connections not explicitly stated (e.g., "similar approach to..." without TDoc ID)
- Work item identification: Map TDocs to work items based on topic discussion
- Relationship typing: Classify edge types (discusses, revises, extends, contradicts)

**Implementation phases:**

1. **Phase 9a**: Implement rule-based extraction first (quick wins, deterministic)
2. **Phase 9b**: Add LLM-powered semantic extraction (comprehensive but slower)
3. **Phase 9c**: Merge both sources with conflict resolution (rules take precedence for structured data)
1. **Phase 9b**: Add LLM-powered semantic extraction (comprehensive but slower)
1. **Phase 9c**: Merge both sources with conflict resolution (rules take precedence for structured data)

**Conflict resolution:**

- When rules and LLM disagree on structured data (TDoc IDs, meeting codes) → trust rules
- When LLM identifies implicit relationships → accept unless contradicted by rules
- Log all conflicts for debugging and manual review if needed
@@ -86,6 +91,7 @@ migration happens in Phase 9 when Kreuzberg integration is complete and tested.
  migration happens in Phase 9 when Kreuzberg integration is complete and tested.

**Authorized breaking changes**:

- Function signatures in `src/tdoc_crawler/ai/operations/extract.py`
- Internal data structures and intermediate representations
- Error handling patterns (may adopt Kreuzberg-specific exception types)
@@ -99,6 +105,7 @@ migration happens in Phase 9 when Kreuzberg integration is complete and tested.
Docling serves as an interim solution to unblock early phase implementation and testing.

**Migration requirements**:

- Kreuzberg MUST provide equivalent or better DOCX-to-Markdown conversion capabilities
- All imports of `docling.*` modules must be replaced with `kreuzberg.*` equivalents
- The `extract_from_folder()` function signature should remain stable to avoid breaking pipeline.py
Loading