feat(04-01): Implement config validate/docs commands and auto-gen docs (4214a8c4) · Commits · Jan Reimes / 3gpp-crawler

.env.example

+106 −77

Original line number	Diff line number	Diff line
		# Environment Variables for TDoc Crawler
		# =======================================
		# This file documents all supported environment variables.
		# For configuration file approach, see docs/config.md
		#
		# Copy to .env and replace values:
		# cp .env.example .env

		# ETSI Online (EOL) Account Credentials
		# These credentials are used for accessing certain 3GPP resources that require authentication
		# Sign up for an EOL account at: https://portal.etsi.org/
		# ============================================================================
		# PATH CONFIGURATION
		# ============================================================================

		# Your ETSI Online username
		TDC_EOL_USERNAME=your_username_here
		# Cache directory for storing downloaded metadata and files (default: ~/.3gpp-crawler)
		# TDC_CACHE_DIR=~/.3gpp-crawler

		# Your ETSI Online password
		TDC_EOL_PASSWORD=your_password_here
		# SQLite database filename (default: 3gpp_crawler.db)
		# TDC_DB_FILENAME=3gpp_crawler.db

		# Checkout directory name (default: checkout)
		# TDC_CHECKOUT_DIRNAME=checkout

		# AI cache directory name (default: lightrag)
		# TDC_AI_CACHE_DIRNAME=lightrag

		# Whether to prompt for credentials when missing (default: false unless TDC_EOL_PROMPT=true)
		# Set to "true", "1", or "yes" to enable interactive prompting
		TDC_EOL_PROMPT=false
		# ============================================================================
		# ETSI ONLINE (EOL) CREDENTIALS
		# ============================================================================
		# Required for accessing protected 3GPP portal resources
		# Sign up at: https://portal.etsi.org/

		# Cache and Directory Configuration
		# Your ETSI Online username
		TDC_EOL_USERNAME=

		# Cache directory for storing downloaded metadata and files (default: ~/.tdoc-crawler)
		# TDC_CACHE_DIR=/path/to/cache/dir
		# Your ETSI Online password
		TDC_EOL_PASSWORD=

		# Checkout directory for downloaded TDocs is managed under the cache directory
		# by default: <cache_dir>/checkout (use `--cache-dir` or `TDC_CACHE_DIR` to change)
		# Custom prompt message for interactive credential entry
		# TDC_EOL_PROMPT=

		# Crawler Configuration
		# ============================================================================
		# HTTP SETTINGS
		# ============================================================================

		# Number of parallel subinterpreter workers (default: 4)
		TDC_WORKERS=4
		# HTTP request timeout in seconds (default: 30)
		TDC_TIMEOUT=30

		# HTTP timeout in seconds (default: 60)
		TDC_TIMEOUT=60
		# Verify SSL certificates for HTTPS requests (default: true)
		# TDC_VERIFY_SSL=true

		# Maximum HTTP retry attempts (default: 3)
		# Maximum number of retry attempts for failed requests (default: 3)
		TDC_MAX_RETRIES=3

		# Maximum total crawl duration in seconds (default: None = unlimited)
		# TDC_OVERALL_TIMEOUT=
		# Enable HTTP response caching (default: true)
		# HTTP_CACHE_ENABLED=true

		# Filtering and Limits
		# Time-to-live for cached HTTP responses in seconds (default: 7200 = 2 hours)
		# HTTP_CACHE_TTL=7200

		# Filter by working group (comma-separated list)
		# TDC_WORKING_GROUP=SA2,RAN1
		# Refresh TTL when a cached response is accessed (default: true)
		# HTTP_CACHE_REFRESH_ON_ACCESS=true

		# Filter by sub-working group (comma-separated list)
		# TDC_SUB_GROUP=RAN1,RAN2
		# ============================================================================
		# CRAWL FILTER CONFIGURATION
		# ============================================================================

		# Limit number of TDocs to crawl (default: None = no limit)
		# TDC_LIMIT_TDOCS=100
		# Filter by working group (e.g., SA2, RAN1, CT3)
		# TDC_WORKING_GROUP=SA4

		# Limit total meetings to crawl (default: None = no limit)
		# TDC_LIMIT_MEETINGS=50
		# Filter by sub-working group (e.g., RAN1, RAN2)
		# TDC_SUB_GROUP=

		# Query date range - start date (YYYY, YYYY-MM, or YYYY-MM-DD format)
		# Start date filter (YYYY, YYYY-MM, or YYYY-MM-DD format)
		# TDC_START_DATE=2024-01-01

		# Query date range - end date (YYYY, YYYY-MM, or YYYY-MM-DD format)
		# End date filter (YYYY, YYYY-MM, or YYYY-MM-DD format)
		# TDC_END_DATE=2024-12-31

		# Output Configuration

		# Output format for query results (e.g., table, csv, json)
		# TDC_OUTPUT=table
		# SQL LIKE pattern to match document source
		# TDC_SOURCE_LIKE=

		# Logging
		# SQL LIKE pattern to match agenda item
		# TDC_AGENDA_LIKE=

		# Enable verbose logging (default: false)
		# Set to "true", "1", or "yes" to enable
		TDC_VERBOSE=false

		# HTTP Cache Configuration
		# Controls caching behavior for all HTTP requests
		# SQL LIKE pattern to match document title
		# TDC_TITLE_LIKE=

		# Time-to-live for cached HTTP responses in seconds (default: 7200 = 2 hours)
		# HTTP_CACHE_TTL=7200
		# Maximum number of documents to crawl (default: 1000)
		# TDC_LIMIT_TDOCS=1000

		# Whether to refresh TTL when a cached response is accessed (default: true)
		# Set to "true", "1", "yes", or "on" to enable; anything else disables it
		# HTTP_CACHE_REFRESH_ON_ACCESS=true
		# Number of parallel subinterpreter workers (default: 4)
		TDC_WORKERS=4

		# AI Configuration (LightRAG)
		# Note: AI module requires API keys for cloud providers. See docs/ai.md for details.
		# ============================================================================
		# AI CONFIGURATION (3GPP-AI)
		# ============================================================================
		# These settings are used by the 3gpp-ai package for document embeddings,
		# knowledge graphs, and LLM-based processing.
		# See: packages/3gpp-ai/docs/config.md

		# LLM model in format <provider>/<model_name>
		# Recommended: openrouter/openrouter/free (free tier, no subscription required)
		@@ -88,49 +105,61 @@ TDC_AI_LLM_MODEL=openrouter/openrouter/free
		# Optional custom base URL for LLM provider/proxy
		# TDC_AI_LLM_API_BASE=

		# Optional API key for LLM provider, will override default environment variable (e.g., OPENROUTER_API_KEY for OpenRouter)
		# Optional API key for LLM provider (overrides default provider-specific env vars)
		# TDC_AI_LLM_API_KEY=

		# Embedding model in format <provider>/<model_name>
		# Recommended: ollama/qwen3-embedding:0.6b (self-hosted, no subscription required)
		TDC_AI_EMBEDDING_MODEL=ollama/vuongnguyen2212/CodeRankEmbed:latest

		# Chunking parameters
		# Maximum tokens per chunk (default: 1000)
		TDC_AI_MAX_CHUNK_SIZE=1000

		# Token overlap between chunks (default: 100)
		TDC_AI_CHUNK_OVERLAP=100

		# Whether to convert office documents to PDF during workspace add-members (default: false)
		# Set to "true", "1", or "yes" to enable; anything else disables it
		# TDC_AI_CONVERT_PDF=1
		# Minimum abstract word count (default: 150)
		TDC_AI_ABSTRACT_MIN_WORDS=150

		# Whether to extract markdown from PDFs during workspace add-members (default: false)
		# Set to "true", "1", or "yes" to enable; anything else disables it
		# When enabled, implies TDC_AI_CONVERT_PDF=true
		# TDC_AI_CONVERT_MD=1
		# Maximum abstract word count (default: 250)
		TDC_AI_ABSTRACT_MAX_WORDS=250

		# Enable VLM picture description and formula enrichment during workspace process (default: false)
		# Number of parallel workers for AI processing (default: 4)
		TDC_AI_PARALLELISM=4

		# Convert office documents to PDF during workspace add-members (default: false)
		# Set to "true", "1", or "yes" to enable
		# TDC_AI_VLM=1
		# TDC_AI_CONVERT_PDF=false

		# Summary constraints
		TDC_AI_ABSTRACT_MIN_WORDS=150
		TDC_AI_ABSTRACT_MAX_WORDS=250
		# Extract markdown from PDFs during workspace add-members (default: false)
		# When enabled, implies TDC_AI_CONVERT_PDF=true
		# TDC_AI_CONVERT_MD=false

		# Parallel processing
		TDC_AI_PARALLELISM=4
		# Enable VLM for figure descriptions and formula enrichment (default: false)
		# Set to "true", "1", or "yes" to enable
		# TDC_AI_VLM=false

		# Graph query level (simple\|medium\|advanced) - default: simple
		# Graph query level: simple\|medium\|advanced (default: simple)
		# simple: Return count and list without synthesis
		# medium: Parse query keywords, filter nodes, generate simple text summary
		# advanced: Use LLM to synthesize answer from graph + embeddings (GraphRAG)
		TDC_GRAPH_QUERY_LEVEL=simple
		# TDC_GRAPH_QUERY_LEVEL=simple

		# LightRAG-specific settings
		# Enable shared embedding storage across workspaces (deduplication, default: true)
		# Enable shared embedding storage across workspaces (default: true)
		# TDC_LIGHTRAG_SHARED_STORAGE=true

		# Note: Never commit actual .env file to version control!
		# Copy this file to .env and replace placeholders with your actual credentials and preferences.

		# Hugging Face API Token - recommended for authentication with the Hugging Face Hub and to avoid rate limits when downloading models and datasets. Sign up for a free account at https://huggingface.co/ and create an API token in your account settings.
		HF_TOKEN=your_huggingface_api_token_here
		No newline at end of file
		# ============================================================================
		# ADDITIONAL SERVICES
		# ============================================================================

		# Hugging Face API Token
		# Recommended for authentication with Hugging Face Hub to avoid rate limits
		# when downloading models and datasets.
		# Sign up at: https://huggingface.co/
		# HF_TOKEN=

		# ============================================================================
		# NOTE
		# ============================================================================
		# For full configuration documentation, see: docs/config.md
		# For migration from .env to config files, see: docs/config.md#migration-from-env

docs/config.md

0 → 100644

+261 −0

Original line number	Diff line number	Diff line
		# Configuration Guide

		This guide covers the configuration system for 3gpp-crawler, including config file discovery, validation, and migration from environment variables.

		## Overview

		The 3gpp-crawler uses a composable configuration system with two complementary parts:

		1. TDocCrawlerConfig (pydantic-settings) — Type-safe configuration from files/env vars
		2. CacheManager (runtime paths) — File system path resolution

		Configuration can be provided via:
		- Config files (TOML, YAML, JSON)
		- Environment variables
		- CLI arguments

		### Config File Discovery

		Config files are discovered in this order (later files override earlier):

		1. `~/.config/3gpp-crawler/config.toml` (global)
		2. `3gpp-crawler.toml`, `.3gpp-crawler.toml`, `.3gpp-crawler/config.toml` (project)
		3. `.config/3gpp-crawler/conf.d/*.toml` (config directory, alphabetically)

		Precedence: CLI args > Config file > Environment variables > Defaults

		## Configuration Options

		### Path Settings

		File system paths for cache, database, checkout, and AI storage

		\| Field \| Type \| Default \| Description \|
		\|-------\|------\|---------\|-------------\|
		\| `cache_dir` \| Path \| ~/.3gpp-crawler \| Root cache directory for storing downloaded files and metadata \|
		\| `db_filename` \| str \| "3gpp_crawler.db" \| SQLite database filename for storing crawl metadata \|
		\| `checkout_dirname` \| str \| "checkout" \| Subdirectory name for checked-out documents \|
		\| `ai_cache_dirname` \| str \| "lightrag" \| Subdirectory name for AI-related cache (embeddings, graphs) \|

		### HTTP Settings

		HTTP client behavior, caching, timeouts, and retries

		\| Field \| Type \| Default \| Description \|
		\|-------\|------\|---------\|-------------\|
		\| `cache_ttl` \| int \| 7200 \| Time-to-live for HTTP cache entries in seconds \|
		\| `cache_enabled` \| bool \| true \| Enable HTTP response caching \|
		\| `cache_refresh_on_access` \| bool \| true \| Refresh cache TTL on each access \|
		\| `verify_ssl` \| bool \| true \| Verify SSL certificates for HTTPS requests \|
		\| `max_retries` \| int \| 3 \| Maximum number of retry attempts for failed requests \|
		\| `timeout` \| int \| 30 \| HTTP request timeout in seconds \|

		### Credentials Settings

		ETSI Online (EOL) portal authentication credentials

		\| Field \| Type \| Default \| Description \|
		\|-------\|------\|---------\|-------------\|
		\| `username` \| str \| (none) \| Username for ETSI Online (EOL) portal authentication \|
		\| `password` \| str \| (none) \| Password for ETSI Online (EOL) portal authentication \|
		\| `prompt` \| str \| (none) \| Custom prompt message for interactive credential entry \|

		### Crawl Settings

		Crawling filters, limits, and worker configuration

		\| Field \| Type \| Default \| Description \|
		\|-------\|------\|---------\|-------------\|
		\| `working_group` \| str \| (none) \| Filter by working group (e.g., S4, RAN1, CT3) \|
		\| `sub_group` \| str \| (none) \| Filter by sub-working group \|
		\| `date_start` \| str \| (none) \| Start date filter (YYYY-MM-DD, YYYY-MM, or YYYY format) \|
		\| `date_end` \| str \| (none) \| End date filter (YYYY-MM-DD, YYYY-MM, or YYYY format) \|
		\| `source_like` \| str \| (none) \| SQL LIKE pattern to match document source \|
		\| `agenda_like` \| str \| (none) \| SQL LIKE pattern to match agenda item \|
		\| `title_like` \| str \| (none) \| SQL LIKE pattern to match document title \|
		\| `limit` \| int \| 1000 \| Maximum number of documents to crawl \|
		\| `workers` \| int \| 4 \| Number of concurrent workers for crawling \|

		## Config Validate Command

		Validate your configuration files to catch errors before running:

		```bash
		# Validate discovered config
		tdoc-crawler config validate

		# Validate specific file
		tdoc-crawler config validate --file custom.toml

		# Treat warnings as errors
		tdoc-crawler config validate --strict
		```

		Exit codes:

		\| Code \| Meaning \|
		\|------\|---------\|
		\| 0 \| All valid \|
		\| 1 \| Syntax error (file doesn't parse) \|
		\| 2 \| Validation error (bad values) \|
		\| 3 \| Warnings only (ok to proceed) \|

		## Config Docs Command

		Show configuration documentation from the command line:

		```bash
		# Show all sections
		tdoc-crawler config docs

		# Show specific section
		tdoc-crawler config docs --section path
		```

		Available sections: `path`, `http`, `credentials`, `crawl`

		## Config Init Command

		Generate a default configuration file:

		```bash
		# Generate TOML config (default)
		tdoc-crawler config init --output 3gpp-crawler.toml

		# Generate YAML config
		tdoc-crawler config init --output 3gpp-crawler.yaml --format yaml

		# Force overwrite existing
		tdoc-crawler config init --output 3gpp-crawler.toml --force
		```

		## Config Show Command

		Display the current resolved configuration:

		```bash
		# Show as TOML (default)
		tdoc-crawler config show

		# Show as YAML
		tdoc-crawler config show --format yaml

		# Show as JSON
		tdoc-crawler config show --format json
		```

		## Migration from .env

		If you have an existing `.env` file, you can migrate to the new config file approach.

		### Why Migrate?

		- Single source of truth: Config files are validated by pydantic, env vars are not
		- Type safety: Config files get validation errors before runtime
		- Documentation: Config files can include comments explaining each option
		- IDE support: Autocomplete and validation in editors

		### Migration Steps

		1. Generate a default config file:
		```bash
		tdoc-crawler config init --output 3gpp-crawler.toml
		```

		2. Copy values from your .env:
		\| .env Variable \| Config File Setting \|
		\|----------------\|---------------------\|
		\| TDC_CACHE_DIR \| path.cache_dir \|
		\| TDC_EOL_USERNAME \| credentials.username \|
		\| TDC_EOL_PASSWORD \| credentials.password \|
		\| TDC_TIMEOUT \| http.timeout \|
		\| TDC_WORKERS \| crawl.workers \|
		\| TDC_VERIFY_SSL \| http.verify_ssl \|
		\| TDC_MAX_RETRIES \| http.max_retries \|
		\| HTTP_CACHE_TTL \| http.cache_ttl \|
		\| TDC_WORKING_GROUP \| crawl.working_group \|
		\| TDC_LIMIT_TDOCS \| crawl.limit \|

		3. Validate your config:
		```bash
		tdoc-crawler config validate
		```

		4. Remove .env when ready:
		```bash
		rm .env # after confirming config works
		```

		### Config File Precedence

		Config files override env vars (later files override earlier):

		1. `~/.config/3gpp-crawler/config.toml` (global)
		2. `./3gpp-crawler.toml` (project)
		3. `./.3gpp-crawler.toml` (project alternative)
		4. `./.config/3gpp-crawler/conf.d/*.toml` (config dir)

		CLI args always win — `--cache-dir` overrides everything.

		## Examples

		### Minimal Config

		Just specify the cache directory:

		```toml
		[path]
		cache_dir = "~/.3gpp-crawler"
		```

		### Full Config with Credentials

		```toml
		[path]
		cache_dir = "~/.3gpp-crawler"

		[http]
		timeout = 60
		max_retries = 5

		[credentials]
		username = "your_username"
		password = "your_password"

		[crawl]
		working_group = "SA4"
		limit = 500
		```

		### Project-Level Override

		Create `.3gpp-crawler.toml` in your project directory to override defaults just for that project:

		```toml
		[crawl]
		workers = 8
		limit = 100
		```

		## Environment Variables

		For backward compatibility, environment variables are still supported:

		\| Variable \| Description \|
		\|----------\|-------------\|
		\| `TDC_CACHE_DIR` \| Cache directory path \|
		\| `TDC_EOL_USERNAME` \| ETSI Online username \|
		\| `TDC_EOL_PASSWORD` \| ETSI Online password \|
		\| `TDC_TIMEOUT` \| HTTP timeout in seconds \|
		\| `TDC_MAX_RETRIES` \| Max retry attempts \|
		\| `TDC_VERIFY_SSL` \| Verify SSL certificates \|
		\| `HTTP_CACHE_TTL` \| HTTP cache TTL in seconds \|
		\| `TDC_WORKING_GROUP` \| Working group filter \|
		\| `TDC_LIMIT_TDOCS` \| Document crawl limit \|
		\| `TDC_WORKERS` \| Number of workers \|

		See `.env.example` for the complete list.

		---

		This reference is auto-generated. Run `uv run python scripts/generate_config_docs.py` to update.

packages/3gpp-ai/docs/config.md

0 → 100644

+131 −0

Original line number	Diff line number	Diff line
		# 3GPP-AI Configuration

		This document describes configuration for the 3gpp-ai package, which provides AI-powered document processing including embeddings, knowledge graphs, and LLM-based analysis.

		## Shared Configuration

		The 3gpp-ai package shares cache paths with the main 3gpp-crawler:

		\| Path \| Description \|
		\|------\|-------------\|
		\| `<cache_dir>/lightrag/` \| AI cache directory \|
		\| `<cache_dir>/lightrag/<model>/` \| Embedding model-specific storage \|

		These paths are managed by `CacheManager` (from `tdoc_crawler.config`) and are the single source of truth for all file paths.

		Cache directory: Determined by `TDC_CACHE_DIR` or `path.cache_dir` in `3gpp-crawler.toml`

		## Configuration Methods

		The 3gpp-ai package supports two configuration approaches:

		### 1. Environment Variables (Default)

		3gpp-ai reads `TDC_AI_*` environment variables directly:

		\| Variable \| Description \| Default \|
		\|----------\|-------------\|---------\|
		\| `TDC_AI_LLM_MODEL` \| LLM model in `<provider>/<model>` format \| `openrouter/openrouter/free` \|
		\| `TDC_AI_LLM_API_BASE` \| Custom LLM API base URL \| (none) \|
		\| `TDC_AI_LLM_API_KEY` \| LLM API key (overrides provider-specific env vars) \| (none) \|
		\| `TDC_AI_EMBEDDING_MODEL` \| Embedding model ID \| `sentence-transformers/all-MiniLM-L6-v2` \|
		\| `TDC_AI_MAX_CHUNK_SIZE` \| Max tokens per chunk \| `1000` \|
		\| `TDC_AI_CHUNK_OVERLAP` \| Token overlap between chunks \| `100` \|
		\| `TDC_AI_ABSTRACT_MIN_WORDS` \| Minimum abstract word count \| `150` \|
		\| `TDC_AI_ABSTRACT_MAX_WORDS` \| Maximum abstract word count \| `250` \|
		\| `TDC_AI_PARALLELISM` \| Parallel workers for processing \| `4` \|
		\| `TDC_AI_CONVERT_PDF` \| Convert Office docs to PDF \| `false` \|
		\| `TDC_AI_CONVERT_MD` \| Extract markdown from PDFs \| `false` \|
		\| `TDC_AI_VLM` \| Enable vision for figure descriptions \| `false` \|
		\| `TDC_GRAPH_QUERY_LEVEL` \| Graph query level \| `simple` \|
		\| `TDC_LIGHTRAG_SHARED_STORAGE` \| Shared embedding storage \| `true` \|

		### 2. Config File Approach

		You can use `3gpp-crawler.toml` as base config and `3gpp-ai.toml` for AI-specific overrides:

		3gpp-crawler.toml (base):
		```toml
		[path]
		cache_dir = "~/.3gpp-crawler"

		[http]
		timeout = 30
		```

		3gpp-ai.toml (override):
		```toml
		[ai]
		llm_model = "openrouter/anthropic/claude-3-sonnet"
		embedding_model = "ollama/nomic-embed-text"
		```

		## Path Configuration

		All paths use `CacheManager` from `tdoc_crawler.config`:

		```python
		from tdoc_crawler.config import resolve_cache_manager

		manager = resolve_cache_manager()
		manager.ai_cache_dir # ~/.3gpp-crawler/lightrag/
		manager.ai_embed_dir("qwen3-embedding:0.6b") # ~/.3gpp-crawler/lightrag/qwen3-embedding:0.6b/
		```

		NEVER hardcode paths like `~/.3gpp-crawler` - always use `CacheManager`.

		## Model Formats

		### LLM Models

		Format: `<provider>/<model_name>`

		Examples:
		- `openrouter/openrouter/free` - Free tier
		- `openrouter/anthropic/claude-3-sonnet` - Anthropic via OpenRouter
		- `ollama/llama3` - Local Ollama

		### Embedding Models

		Format: `<provider>/<model_name>`

		Examples:
		- `sentence-transformers/all-MiniLM-L6-v2` - Default
		- `ollama/nomic-embed-text` - Local Ollama
		- `ollama/qwen3-embedding:0.6b` - Qwen embedding

		## Processing Options

		### Document Conversion

		\| Option \| Description \|
		\|--------\|-------------\|
		\| `TDC_AI_CONVERT_PDF` \| Convert Office docs (Word, Excel, PowerPoint) to PDF \|
		\| `TDC_AI_CONVERT_MD` \| Extract markdown from PDFs using Docling \|
		\| `TDC_AI_VLM` \| Use vision model for figure descriptions \|

		### Chunking

		\| Option \| Description \| Default \|
		\|--------\|-------------\|---------\|
		\| `TDC_AI_MAX_CHUNK_SIZE` \| Maximum tokens per chunk \| 1000 \|
		\| `TDC_AI_CHUNK_OVERLAP` \| Overlap between chunks \| 100 \|

		### Graph Query Levels

		\| Level \| Behavior \|
		\|-------\|----------\|
		\| `simple` \| Return count and list without synthesis \|
		\| `medium` \| Parse query keywords, filter nodes, generate simple text summary \|
		\| `advanced` \| Use LLM to synthesize answer from graph + embeddings (GraphRAG) \|

		## Decoupled Design

		The 3gpp-ai package is designed to be independently installable:

		- It reads `TDC_AI_*` env vars directly (not `TDocCrawlerConfig`)
		- It uses `CacheManager` from tdoc_crawler for paths only
		- This keeps packages decoupled while sharing infrastructure

		For shared settings, use the main `3gpp-crawler.toml` file.
		For AI-specific settings, use `TDC_AI_*` env vars or `3gpp-ai.toml`.

scripts/generate_config_docs.py

0 → 100644

+209 −0

File added.

Preview size limit exceeded, changes collapsed.

src/tdoc_crawler/cli/config_cmd.py

0 → 100644

+333 −0

File added.

Preview size limit exceeded, changes collapsed.