Commit 4214a8c4 authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(04-01): Implement config validate/docs commands and auto-gen docs

- Add config validate command with exit codes (0=valid, 1=syntax, 2=error, 3=warnings)
- Add config docs command showing all config sections with field info
- Add --file and --strict options to validate command
- Integrate config_app into tdoc_app via add_typer
- Create scripts/generate_config_docs.py for pydantic introspection
- Create docs/config.md with full config reference and migration guide
- Update .env.example with all TDC_* variables organized by section
parent e6b53c99
Loading
Loading
Loading
Loading
+106 −77
Original line number Diff line number Diff line
# Environment Variables for TDoc Crawler
# =======================================
# This file documents all supported environment variables.
# For configuration file approach, see docs/config.md
#
# Copy to .env and replace values:
#   cp .env.example .env

# ETSI Online (EOL) Account Credentials
# These credentials are used for accessing certain 3GPP resources that require authentication
# Sign up for an EOL account at: https://portal.etsi.org/
# ============================================================================
# PATH CONFIGURATION
# ============================================================================

# Your ETSI Online username
TDC_EOL_USERNAME=your_username_here
# Cache directory for storing downloaded metadata and files (default: ~/.3gpp-crawler)
# TDC_CACHE_DIR=~/.3gpp-crawler

# Your ETSI Online password
TDC_EOL_PASSWORD=your_password_here
# SQLite database filename (default: 3gpp_crawler.db)
# TDC_DB_FILENAME=3gpp_crawler.db

# Checkout directory name (default: checkout)
# TDC_CHECKOUT_DIRNAME=checkout

# AI cache directory name (default: lightrag)
# TDC_AI_CACHE_DIRNAME=lightrag

# Whether to prompt for credentials when missing (default: false unless TDC_EOL_PROMPT=true)
# Set to "true", "1", or "yes" to enable interactive prompting
TDC_EOL_PROMPT=false
# ============================================================================
# ETSI ONLINE (EOL) CREDENTIALS
# ============================================================================
# Required for accessing protected 3GPP portal resources
# Sign up at: https://portal.etsi.org/

# Cache and Directory Configuration
# Your ETSI Online username
TDC_EOL_USERNAME=

# Cache directory for storing downloaded metadata and files (default: ~/.tdoc-crawler)
# TDC_CACHE_DIR=/path/to/cache/dir
# Your ETSI Online password
TDC_EOL_PASSWORD=

# Checkout directory for downloaded TDocs is managed under the cache directory
# by default: <cache_dir>/checkout (use `--cache-dir` or `TDC_CACHE_DIR` to change)
# Custom prompt message for interactive credential entry
# TDC_EOL_PROMPT=

# Crawler Configuration
# ============================================================================
# HTTP SETTINGS
# ============================================================================

# Number of parallel subinterpreter workers (default: 4)
TDC_WORKERS=4
# HTTP request timeout in seconds (default: 30)
TDC_TIMEOUT=30

# HTTP timeout in seconds (default: 60)
TDC_TIMEOUT=60
# Verify SSL certificates for HTTPS requests (default: true)
# TDC_VERIFY_SSL=true

# Maximum HTTP retry attempts (default: 3)
# Maximum number of retry attempts for failed requests (default: 3)
TDC_MAX_RETRIES=3

# Maximum total crawl duration in seconds (default: None = unlimited)
# TDC_OVERALL_TIMEOUT=
# Enable HTTP response caching (default: true)
# HTTP_CACHE_ENABLED=true

# Filtering and Limits
# Time-to-live for cached HTTP responses in seconds (default: 7200 = 2 hours)
# HTTP_CACHE_TTL=7200

# Filter by working group (comma-separated list)
# TDC_WORKING_GROUP=SA2,RAN1
# Refresh TTL when a cached response is accessed (default: true)
# HTTP_CACHE_REFRESH_ON_ACCESS=true

# Filter by sub-working group (comma-separated list)
# TDC_SUB_GROUP=RAN1,RAN2
# ============================================================================
# CRAWL FILTER CONFIGURATION
# ============================================================================

# Limit number of TDocs to crawl (default: None = no limit)
# TDC_LIMIT_TDOCS=100
# Filter by working group (e.g., SA2, RAN1, CT3)
# TDC_WORKING_GROUP=SA4

# Limit total meetings to crawl (default: None = no limit)
# TDC_LIMIT_MEETINGS=50
# Filter by sub-working group (e.g., RAN1, RAN2)
# TDC_SUB_GROUP=

# Query date range - start date (YYYY, YYYY-MM, or YYYY-MM-DD format)
# Start date filter (YYYY, YYYY-MM, or YYYY-MM-DD format)
# TDC_START_DATE=2024-01-01

# Query date range - end date (YYYY, YYYY-MM, or YYYY-MM-DD format)
# End date filter (YYYY, YYYY-MM, or YYYY-MM-DD format)
# TDC_END_DATE=2024-12-31

# Output Configuration

# Output format for query results (e.g., table, csv, json)
# TDC_OUTPUT=table
# SQL LIKE pattern to match document source
# TDC_SOURCE_LIKE=

# Logging
# SQL LIKE pattern to match agenda item
# TDC_AGENDA_LIKE=

# Enable verbose logging (default: false)
# Set to "true", "1", or "yes" to enable
TDC_VERBOSE=false

# HTTP Cache Configuration
# Controls caching behavior for all HTTP requests
# SQL LIKE pattern to match document title
# TDC_TITLE_LIKE=

# Time-to-live for cached HTTP responses in seconds (default: 7200 = 2 hours)
# HTTP_CACHE_TTL=7200
# Maximum number of documents to crawl (default: 1000)
# TDC_LIMIT_TDOCS=1000

# Whether to refresh TTL when a cached response is accessed (default: true)
# Set to "true", "1", "yes", or "on" to enable; anything else disables it
# HTTP_CACHE_REFRESH_ON_ACCESS=true
# Number of parallel subinterpreter workers (default: 4)
TDC_WORKERS=4

# AI Configuration (LightRAG)
# Note: AI module requires API keys for cloud providers. See docs/ai.md for details.
# ============================================================================
# AI CONFIGURATION (3GPP-AI)
# ============================================================================
# These settings are used by the 3gpp-ai package for document embeddings,
# knowledge graphs, and LLM-based processing.
# See: packages/3gpp-ai/docs/config.md

# LLM model in format <provider>/<model_name>
# Recommended: openrouter/openrouter/free (free tier, no subscription required)
@@ -88,49 +105,61 @@ TDC_AI_LLM_MODEL=openrouter/openrouter/free
# Optional custom base URL for LLM provider/proxy
# TDC_AI_LLM_API_BASE=

# Optional API key for LLM provider, will override default environment variable (e.g., OPENROUTER_API_KEY for OpenRouter)
# Optional API key for LLM provider (overrides default provider-specific env vars)
# TDC_AI_LLM_API_KEY=

# Embedding model in format <provider>/<model_name>
# Recommended: ollama/qwen3-embedding:0.6b (self-hosted, no subscription required)
TDC_AI_EMBEDDING_MODEL=ollama/vuongnguyen2212/CodeRankEmbed:latest

# Chunking parameters
# Maximum tokens per chunk (default: 1000)
TDC_AI_MAX_CHUNK_SIZE=1000

# Token overlap between chunks (default: 100)
TDC_AI_CHUNK_OVERLAP=100

# Whether to convert office documents to PDF during workspace add-members (default: false)
# Set to "true", "1", or "yes" to enable; anything else disables it
# TDC_AI_CONVERT_PDF=1
# Minimum abstract word count (default: 150)
TDC_AI_ABSTRACT_MIN_WORDS=150

# Whether to extract markdown from PDFs during workspace add-members (default: false)
# Set to "true", "1", or "yes" to enable; anything else disables it
# When enabled, implies TDC_AI_CONVERT_PDF=true
# TDC_AI_CONVERT_MD=1
# Maximum abstract word count (default: 250)
TDC_AI_ABSTRACT_MAX_WORDS=250

# Enable VLM picture description and formula enrichment during workspace process (default: false)
# Number of parallel workers for AI processing (default: 4)
TDC_AI_PARALLELISM=4

# Convert office documents to PDF during workspace add-members (default: false)
# Set to "true", "1", or "yes" to enable
# TDC_AI_VLM=1
# TDC_AI_CONVERT_PDF=false

# Summary constraints
TDC_AI_ABSTRACT_MIN_WORDS=150
TDC_AI_ABSTRACT_MAX_WORDS=250
# Extract markdown from PDFs during workspace add-members (default: false)
# When enabled, implies TDC_AI_CONVERT_PDF=true
# TDC_AI_CONVERT_MD=false

# Parallel processing
TDC_AI_PARALLELISM=4
# Enable VLM for figure descriptions and formula enrichment (default: false)
# Set to "true", "1", or "yes" to enable
# TDC_AI_VLM=false

# Graph query level (simple|medium|advanced) - default: simple
# Graph query level: simple|medium|advanced (default: simple)
# simple: Return count and list without synthesis
# medium: Parse query keywords, filter nodes, generate simple text summary
# advanced: Use LLM to synthesize answer from graph + embeddings (GraphRAG)
TDC_GRAPH_QUERY_LEVEL=simple
# TDC_GRAPH_QUERY_LEVEL=simple

# LightRAG-specific settings
# Enable shared embedding storage across workspaces (deduplication, default: true)
# Enable shared embedding storage across workspaces (default: true)
# TDC_LIGHTRAG_SHARED_STORAGE=true

# Note: Never commit actual .env file to version control!
# Copy this file to .env and replace placeholders with your actual credentials and preferences.

# Hugging Face API Token - recommended for authentication with the Hugging Face Hub and to avoid rate limits when downloading models and datasets. Sign up for a free account at https://huggingface.co/ and create an API token in your account settings.
HF_TOKEN=your_huggingface_api_token_here
 No newline at end of file
# ============================================================================
# ADDITIONAL SERVICES
# ============================================================================

# Hugging Face API Token
# Recommended for authentication with Hugging Face Hub to avoid rate limits
# when downloading models and datasets.
# Sign up at: https://huggingface.co/
# HF_TOKEN=

# ============================================================================
# NOTE
# ============================================================================
# For full configuration documentation, see: docs/config.md
# For migration from .env to config files, see: docs/config.md#migration-from-env

docs/config.md

0 → 100644
+261 −0
Original line number Diff line number Diff line
# Configuration Guide

This guide covers the configuration system for 3gpp-crawler, including config file discovery, validation, and migration from environment variables.

## Overview

The 3gpp-crawler uses a composable configuration system with two complementary parts:

1. **TDocCrawlerConfig** (pydantic-settings) — Type-safe configuration from files/env vars
2. **CacheManager** (runtime paths) — File system path resolution

Configuration can be provided via:
- Config files (TOML, YAML, JSON)
- Environment variables
- CLI arguments

### Config File Discovery

Config files are discovered in this order (later files override earlier):

1. `~/.config/3gpp-crawler/config.toml` (global)
2. `3gpp-crawler.toml`, `.3gpp-crawler.toml`, `.3gpp-crawler/config.toml` (project)
3. `.config/3gpp-crawler/conf.d/*.toml` (config directory, alphabetically)

**Precedence:** CLI args > Config file > Environment variables > Defaults

## Configuration Options

### Path Settings

*File system paths for cache, database, checkout, and AI storage*

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `cache_dir` | Path | ~/.3gpp-crawler | Root cache directory for storing downloaded files and metadata |
| `db_filename` | str | "3gpp_crawler.db" | SQLite database filename for storing crawl metadata |
| `checkout_dirname` | str | "checkout" | Subdirectory name for checked-out documents |
| `ai_cache_dirname` | str | "lightrag" | Subdirectory name for AI-related cache (embeddings, graphs) |

### HTTP Settings

*HTTP client behavior, caching, timeouts, and retries*

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `cache_ttl` | int | 7200 | Time-to-live for HTTP cache entries in seconds |
| `cache_enabled` | bool | true | Enable HTTP response caching |
| `cache_refresh_on_access` | bool | true | Refresh cache TTL on each access |
| `verify_ssl` | bool | true | Verify SSL certificates for HTTPS requests |
| `max_retries` | int | 3 | Maximum number of retry attempts for failed requests |
| `timeout` | int | 30 | HTTP request timeout in seconds |

### Credentials Settings

*ETSI Online (EOL) portal authentication credentials*

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `username` | str | (none) | Username for ETSI Online (EOL) portal authentication |
| `password` | str | (none) | Password for ETSI Online (EOL) portal authentication |
| `prompt` | str | (none) | Custom prompt message for interactive credential entry |

### Crawl Settings

*Crawling filters, limits, and worker configuration*

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `working_group` | str | (none) | Filter by working group (e.g., S4, RAN1, CT3) |
| `sub_group` | str | (none) | Filter by sub-working group |
| `date_start` | str | (none) | Start date filter (YYYY-MM-DD, YYYY-MM, or YYYY format) |
| `date_end` | str | (none) | End date filter (YYYY-MM-DD, YYYY-MM, or YYYY format) |
| `source_like` | str | (none) | SQL LIKE pattern to match document source |
| `agenda_like` | str | (none) | SQL LIKE pattern to match agenda item |
| `title_like` | str | (none) | SQL LIKE pattern to match document title |
| `limit` | int | 1000 | Maximum number of documents to crawl |
| `workers` | int | 4 | Number of concurrent workers for crawling |

## Config Validate Command

Validate your configuration files to catch errors before running:

```bash
# Validate discovered config
tdoc-crawler config validate

# Validate specific file
tdoc-crawler config validate --file custom.toml

# Treat warnings as errors
tdoc-crawler config validate --strict
```

**Exit codes:**

| Code | Meaning |
|------|---------|
| 0 | All valid |
| 1 | Syntax error (file doesn't parse) |
| 2 | Validation error (bad values) |
| 3 | Warnings only (ok to proceed) |

## Config Docs Command

Show configuration documentation from the command line:

```bash
# Show all sections
tdoc-crawler config docs

# Show specific section
tdoc-crawler config docs --section path
```

**Available sections:** `path`, `http`, `credentials`, `crawl`

## Config Init Command

Generate a default configuration file:

```bash
# Generate TOML config (default)
tdoc-crawler config init --output 3gpp-crawler.toml

# Generate YAML config
tdoc-crawler config init --output 3gpp-crawler.yaml --format yaml

# Force overwrite existing
tdoc-crawler config init --output 3gpp-crawler.toml --force
```

## Config Show Command

Display the current resolved configuration:

```bash
# Show as TOML (default)
tdoc-crawler config show

# Show as YAML
tdoc-crawler config show --format yaml

# Show as JSON
tdoc-crawler config show --format json
```

## Migration from .env

If you have an existing `.env` file, you can migrate to the new config file approach.

### Why Migrate?

- **Single source of truth:** Config files are validated by pydantic, env vars are not
- **Type safety:** Config files get validation errors before runtime
- **Documentation:** Config files can include comments explaining each option
- **IDE support:** Autocomplete and validation in editors

### Migration Steps

1. **Generate a default config file:**
   ```bash
   tdoc-crawler config init --output 3gpp-crawler.toml
   ```

2. **Copy values from your .env:**
   | .env Variable | Config File Setting |
   |----------------|---------------------|
   | TDC_CACHE_DIR | path.cache_dir |
   | TDC_EOL_USERNAME | credentials.username |
   | TDC_EOL_PASSWORD | credentials.password |
   | TDC_TIMEOUT | http.timeout |
   | TDC_WORKERS | crawl.workers |
   | TDC_VERIFY_SSL | http.verify_ssl |
   | TDC_MAX_RETRIES | http.max_retries |
   | HTTP_CACHE_TTL | http.cache_ttl |
   | TDC_WORKING_GROUP | crawl.working_group |
   | TDC_LIMIT_TDOCS | crawl.limit |

3. **Validate your config:**
   ```bash
   tdoc-crawler config validate
   ```

4. **Remove .env when ready:**
   ```bash
   rm .env  # after confirming config works
   ```

### Config File Precedence

Config files override env vars (later files override earlier):

1. `~/.config/3gpp-crawler/config.toml` (global)
2. `./3gpp-crawler.toml` (project)
3. `./.3gpp-crawler.toml` (project alternative)
4. `./.config/3gpp-crawler/conf.d/*.toml` (config dir)

**CLI args always win**`--cache-dir` overrides everything.

## Examples

### Minimal Config

Just specify the cache directory:

```toml
[path]
cache_dir = "~/.3gpp-crawler"
```

### Full Config with Credentials

```toml
[path]
cache_dir = "~/.3gpp-crawler"

[http]
timeout = 60
max_retries = 5

[credentials]
username = "your_username"
password = "your_password"

[crawl]
working_group = "SA4"
limit = 500
```

### Project-Level Override

Create `.3gpp-crawler.toml` in your project directory to override defaults just for that project:

```toml
[crawl]
workers = 8
limit = 100
```

## Environment Variables

For backward compatibility, environment variables are still supported:

| Variable | Description |
|----------|-------------|
| `TDC_CACHE_DIR` | Cache directory path |
| `TDC_EOL_USERNAME` | ETSI Online username |
| `TDC_EOL_PASSWORD` | ETSI Online password |
| `TDC_TIMEOUT` | HTTP timeout in seconds |
| `TDC_MAX_RETRIES` | Max retry attempts |
| `TDC_VERIFY_SSL` | Verify SSL certificates |
| `HTTP_CACHE_TTL` | HTTP cache TTL in seconds |
| `TDC_WORKING_GROUP` | Working group filter |
| `TDC_LIMIT_TDOCS` | Document crawl limit |
| `TDC_WORKERS` | Number of workers |

See `.env.example` for the complete list.

---

*This reference is auto-generated. Run `uv run python scripts/generate_config_docs.py` to update.*
+131 −0
Original line number Diff line number Diff line
# 3GPP-AI Configuration

This document describes configuration for the 3gpp-ai package, which provides AI-powered document processing including embeddings, knowledge graphs, and LLM-based analysis.

## Shared Configuration

The 3gpp-ai package shares cache paths with the main 3gpp-crawler:

| Path | Description |
|------|-------------|
| `<cache_dir>/lightrag/` | AI cache directory |
| `<cache_dir>/lightrag/<model>/` | Embedding model-specific storage |

These paths are managed by `CacheManager` (from `tdoc_crawler.config`) and are the **single source of truth** for all file paths.

**Cache directory:** Determined by `TDC_CACHE_DIR` or `path.cache_dir` in `3gpp-crawler.toml`

## Configuration Methods

The 3gpp-ai package supports two configuration approaches:

### 1. Environment Variables (Default)

3gpp-ai reads `TDC_AI_*` environment variables directly:

| Variable | Description | Default |
|----------|-------------|---------|
| `TDC_AI_LLM_MODEL` | LLM model in `<provider>/<model>` format | `openrouter/openrouter/free` |
| `TDC_AI_LLM_API_BASE` | Custom LLM API base URL | (none) |
| `TDC_AI_LLM_API_KEY` | LLM API key (overrides provider-specific env vars) | (none) |
| `TDC_AI_EMBEDDING_MODEL` | Embedding model ID | `sentence-transformers/all-MiniLM-L6-v2` |
| `TDC_AI_MAX_CHUNK_SIZE` | Max tokens per chunk | `1000` |
| `TDC_AI_CHUNK_OVERLAP` | Token overlap between chunks | `100` |
| `TDC_AI_ABSTRACT_MIN_WORDS` | Minimum abstract word count | `150` |
| `TDC_AI_ABSTRACT_MAX_WORDS` | Maximum abstract word count | `250` |
| `TDC_AI_PARALLELISM` | Parallel workers for processing | `4` |
| `TDC_AI_CONVERT_PDF` | Convert Office docs to PDF | `false` |
| `TDC_AI_CONVERT_MD` | Extract markdown from PDFs | `false` |
| `TDC_AI_VLM` | Enable vision for figure descriptions | `false` |
| `TDC_GRAPH_QUERY_LEVEL` | Graph query level | `simple` |
| `TDC_LIGHTRAG_SHARED_STORAGE` | Shared embedding storage | `true` |

### 2. Config File Approach

You can use `3gpp-crawler.toml` as base config and `3gpp-ai.toml` for AI-specific overrides:

**3gpp-crawler.toml (base):**
```toml
[path]
cache_dir = "~/.3gpp-crawler"

[http]
timeout = 30
```

**3gpp-ai.toml (override):**
```toml
[ai]
llm_model = "openrouter/anthropic/claude-3-sonnet"
embedding_model = "ollama/nomic-embed-text"
```

## Path Configuration

All paths use `CacheManager` from `tdoc_crawler.config`:

```python
from tdoc_crawler.config import resolve_cache_manager

manager = resolve_cache_manager()
manager.ai_cache_dir       # ~/.3gpp-crawler/lightrag/
manager.ai_embed_dir("qwen3-embedding:0.6b")  # ~/.3gpp-crawler/lightrag/qwen3-embedding:0.6b/
```

**NEVER hardcode paths** like `~/.3gpp-crawler` - always use `CacheManager`.

## Model Formats

### LLM Models

Format: `<provider>/<model_name>`

Examples:
- `openrouter/openrouter/free` - Free tier
- `openrouter/anthropic/claude-3-sonnet` - Anthropic via OpenRouter
- `ollama/llama3` - Local Ollama

### Embedding Models

Format: `<provider>/<model_name>`

Examples:
- `sentence-transformers/all-MiniLM-L6-v2` - Default
- `ollama/nomic-embed-text` - Local Ollama
- `ollama/qwen3-embedding:0.6b` - Qwen embedding

## Processing Options

### Document Conversion

| Option | Description |
|--------|-------------|
| `TDC_AI_CONVERT_PDF` | Convert Office docs (Word, Excel, PowerPoint) to PDF |
| `TDC_AI_CONVERT_MD` | Extract markdown from PDFs using Docling |
| `TDC_AI_VLM` | Use vision model for figure descriptions |

### Chunking

| Option | Description | Default |
|--------|-------------|---------|
| `TDC_AI_MAX_CHUNK_SIZE` | Maximum tokens per chunk | 1000 |
| `TDC_AI_CHUNK_OVERLAP` | Overlap between chunks | 100 |

### Graph Query Levels

| Level | Behavior |
|-------|----------|
| `simple` | Return count and list without synthesis |
| `medium` | Parse query keywords, filter nodes, generate simple text summary |
| `advanced` | Use LLM to synthesize answer from graph + embeddings (GraphRAG) |

## Decoupled Design

The 3gpp-ai package is designed to be **independently installable**:

- It reads `TDC_AI_*` env vars directly (not `TDocCrawlerConfig`)
- It uses `CacheManager` from tdoc_crawler for paths only
- This keeps packages decoupled while sharing infrastructure

For shared settings, use the main `3gpp-crawler.toml` file.
For AI-specific settings, use `TDC_AI_*` env vars or `3gpp-ai.toml`.
+209 −0

File added.

Preview size limit exceeded, changes collapsed.

+333 −0

File added.

Preview size limit exceeded, changes collapsed.

Loading