chore: Integrate unified extraction pipeline and cleanup (5cae87f3) · Commits · Jan Reimes / 3gpp-crawler

.planning/codebase/ARCHITECTURE.md

+170 −220

Original line number	Diff line number	Diff line
		# Architecture

		Analysis Date: 2026-03-25
		Analysis Date: 2026-03-26

		## Pattern Overview

		Overall: Domain-oriented modular architecture with clear separation of concerns
		Overall: Domain-Oriented Modular Architecture with Source Abstraction

		Key Characteristics:
		- Domain-oriented packages: `tdocs/`, `meetings/`, `specs/` contain related business logic
		- Source abstraction: Each domain uses sources/ subpackage for data fetching strategies
		- Operations layer: Business operations separated from sources and models
		- Pydantic models: All data structures use Pydantic for validation
		- Single source of truth: `CacheManager` for all file paths and configuration
		- Domain-oriented packages (tdocs, meetings, specs) with consistent internal structure
		- Source abstraction pattern for data fetching (Protocol-based)
		- Layered database with inheritance (DocDatabase → MeetingDatabase → TDocDatabase)
		- Centralized configuration via CacheManager singleton
		- CLI as thin orchestration layer over core library

		## Layers

		### CLI Layer
		- Purpose: Command-line interface using Typer framework
		CLI Layer:
		- Purpose: Command definitions, argument parsing, user interaction
		- Location: `src/tdoc_crawler/cli/`
		- Contains: Command definitions, argument types, output formatting
		- Depends on: Business logic layers (database, operations, sources)
		- Used by: End users via `tdoc-crawler` and `spec-crawler` commands

		### Domain Layer (TDocs)
		- Purpose: TDoc crawling and metadata management
		- Location: `src/tdoc_crawler/tdocs/`
		- Contains: Sources (WhatTheSpec, Portal, DocList), operations (fetch, checkout, crawl)
		- Depends on: `models/`, `database/`, `config/`, `http_client/`
		- Used by: CLI commands (`crawl-tdocs`, `query-tdocs`, `open`, `checkout`)

		### Domain Layer (Meetings)
		- Purpose: Meeting metadata and TDoc list retrieval
		- Location: `src/tdoc_crawler/meetings/`
		- Contains: Sources (FTP, Portal), operations, MeetingCrawler facade
		- Depends on: `models/`, `database/`, `config/`
		- Used by: CLI commands (`crawl-meetings`, `query-meetings`)

		### Domain Layer (Specs)
		- Purpose: 3GPP specification management
		- Location: `src/tdoc_crawler/specs/`
		- Contains: Sources (FTP, Portal), operations (checkout, download)
		- Depends on: `models/`, `database/`, `config/`
		- Used by: CLI commands via `spec-crawler` entry point

		### Data Access Layer
		- Purpose: SQLite database abstraction using pydantic-sqlite
		- Contains: Typer commands, Rich console output, argument types
		- Depends on: Core library (tdocs, meetings, specs, database)
		- Used by: End users via `tdoc-crawler` command

		Domain Layer:
		- Purpose: Business logic for 3GPP entities (TDocs, Meetings, Specs)
		- Location: `src/tdoc_crawler/{tdocs,meetings,specs}/`
		- Contains: Operations, sources, models, utilities per domain
		- Depends on: Database layer, config, http_client, parsers
		- Used by: CLI layer

		Database Layer:
		- Purpose: Persistent storage and query operations
		- Location: `src/tdoc_crawler/database/`
		- Contains: `TDocDatabase`, `MeetingDatabase`, `SpecDatabase`, `WorkingGroupDatabase`
		- Depends on: `models/`, `config/` (for path resolution)
		- Used by: All domain operations

		### Models Layer
		- Purpose: Pydantic data models for validation and serialization
		- Location: `src/tdoc_crawler/models/`
		- Contains: Base models, domain-specific models (TDocMetadata, MeetingMetadata, SpecMetadata)
		- Depends on: Pydantic library only
		- Contains: Database classes with upsert/query operations
		- Depends on: Models layer, pydantic-sqlite
		- Used by: Domain layer operations

		Models Layer:
		- Purpose: Data structures and validation
		- Location: `src/tdoc_crawler/models/`, `src/tdoc_crawler/{domain}/models.py`
		- Contains: Pydantic models for metadata, configs, query parameters
		- Depends on: Pydantic
		- Used by: All layers

		### HTTP Client Layer
		- Purpose: HTTP caching and session management using hishel
		- Location: `src/tdoc_crawler/http_client/`
		- Contains: `create_cached_session()`, `PoolConfig`, `download_to_file()`
		- Depends on: `config/` (for cache file path)
		- Used by: All sources that make HTTP requests

		### Configuration Layer
		- Purpose: Centralized path and configuration management
		- Location: `src/tdoc_crawler/config/`
		- Contains: `CacheManager`, `resolve_cache_manager()`, path constants
		- Depends on: Environment variables (`TDC_CACHE_DIR`, `TDC_AI_STORE_PATH`)
		- Used by: All layers (MUST be registered at startup)

		### Parsers Layer
		- Purpose: HTML, Excel, and portal page parsing
		Infrastructure Layer:
		- Purpose: Cross-cutting concerns (HTTP, caching, logging, config)
		- Location: `src/tdoc_crawler/{config,http_client,logging,clients,utils}/`
		- Contains: CacheManager, HTTP session factory, logging setup
		- Depends on: hishel, httpx, requests
		- Used by: All layers

		Parsers Layer:
		- Purpose: Data extraction from various formats (HTML, Excel, portal)
		- Location: `src/tdoc_crawler/parsers/`
		- Contains: `MeetingParser`, `PortalParser`, `parse_tdoc_portal_page()`
		- Depends on: BeautifulSoup, lxml, python-calamine
		- Used by: Sources layer
		- Contains: HTML/Excel parsers for 3GPP data
		- Depends on: beautifulsoup4, lxml, xlsxwriter
		- Used by: Domain sources

		### AI/Processing Layer (packages/3gpp-ai)
		- Purpose: AI-powered document processing, LightRAG integration
		- Location: `packages/3gpp-ai/threegpp_ai/`
		- Contains: LightRAG config, TDocProcessor, workspace management, RAG operations
		- Depends on: `tdoc_crawler` (for TDoc metadata), LightRAG library, litellm
		- Used by: CLI via `tdoc-crawler ai` commands
		## Data Flow

		### Conversion Layer (packages/convert-lo)
		- Purpose: LibreOffice-based document conversion
		- Location: `packages/convert-lo/convert_lo/`
		- Contains: `convert_to_pdf()`, `convert_to_markdown()`, format detection
		- Depends on: LibreOffice (external)
		- Used by: AI processing layer
		TDoc Crawl Flow:

		### Pool Executors (packages/pool_executors)
		- Purpose: Serial and parallel execution utilities
		- Location: `packages/pool_executors/pool_executors/`
		- Contains: `SerialPoolExecutor`, `ParallelPoolExecutor`, factory functions
		- Depends on: concurrent.futures
		- Used by: Crawlers and batch operations
		1. CLI `crawl-tdocs` command receives arguments
		2. Creates `TDocCrawlConfig` from parsed arguments
		3. `CacheManager` registered with paths
		4. `TDocCrawler.crawl()` orchestrates:
		- Query meetings from database
		- For each meeting, download Excel document list
		- Parse Excel via `DocumentListSource`
		- Upsert TDoc metadata to database
		5. Results displayed via Rich console

		## Data Flow
		TDoc Query Flow:

		1. CLI `query-tdocs` creates `TDocQueryConfig`
		2. `TDocDatabase.query_tdocs()` filters by:
		- TDoc IDs (exact match)
		- Working groups (via meeting join)
		- Date ranges (retrieved date, meeting date)
		- Glob patterns (source, title, agenda)
		3. Returns list of `TDocMetadata`
		4. Output formatted via Rich tables or JSON/YAML

		### TDoc Crawl Flow
		TDoc Fetch Flow (Missing TDocs):

		1. CLI command `crawl-tdocs` invoked with meeting ID
		2. `MeetingDatabase` queried for meeting info
		3. `DocumentListSource.fetch()` retrieves TDoc list from 3GPP FTP
		4. `TDocDatabase.upsert_tdocs()` stores metadata
		5. `crawl_log` table updated with status
		1. `fetch_missing_tdocs()` receives requested IDs not in DB
		2. Creates appropriate source (`PortalSource` or `WhatTheSpecSource`)
		3. Source fetches metadata via HTTP
		4. Upserts to database
		5. Falls back to WhatTheSpec if portal fails

		6. Duplicate detection prevents re-crawling
		Spec Crawl Flow:

		1. CLI `crawl-specs` collects spec numbers
		2. `SpecDatabase.crawl_specs()` iterates sources
		3. Each `SpecSource.fetch()` retrieves metadata
		4. Results stored and optionally checked out

		State Management:
		- SQLite database via pydantic-sqlite for persistent storage
		- File-based HTTP cache (hishel) for response caching
		- Workspace state in `workspaces.json` for AI processing

		### TDoc Fetch Flow

		1. `TDocDatabase.query_tdocs()` retrieves known TDocs
		2. `fetch_missing_tdocs()` identifies missing metadata
		3. Source strategy selection:
		- WhatTheSpec: Fast, unauthenticated API
		- Portal: Authenticated, full metadata
		4. `create_source()` returns appropriate source instance
		5. Source fetches and stores metadata in database

		6. Results returned to caller

		### Spec Checkout Flow

		1. `SpecDatabase.query()` retrieves spec info
		2. `checkout_specs()` initiates download process
		3. For each spec:
		- FTP server queried for latest version
		- `SpecDownloads` table checked for existing files
		- Download to `checkout_dir` if needed
		4. ZIP extraction and file organization
		5. Results returned with checkout paths

		### AI Processing Flow

		1. `WorkspaceRegistry` loads/creates workspace
		2. `TDocProcessor.process()` iterates through documents
		3. For each TDoc:
		- Checkout from database (metadata)
		- Download document if missing
		- Extract text via `kreuzberg`
		- Chunk text
		- Generate embeddings
		- Build knowledge graph (LightRAG)
		4. Results stored in workspace storage

		5. Query via `TDocRAG.query()` performs semantic search
		- SQLite database stores all persistent state
		- HTTP cache (hishel) prevents redundant network requests
		- `CacheManager` provides single source of truth for paths
		- No in-memory state between commands

		## Key Abstractions

		### TDocSource (Abstract Base)
		- Purpose: Common interface for TDoc data sources
		- Examples: `src/tdoc_crawler/tdocs/sources/base.py`
		- Pattern: Strategy pattern with `fetch_by_id()` and `fetch_batch()` methods

		```python
		class TDocSource(ABC):
		@abstractmethod
		def fetch_by_id(self, tdoc_id: str, ...) -> TDocMetadata \| None:
		...
		@abstractmethod
		def fetch_batch(self, tdoc_ids: list[str], ...) -> list[TDocMetadata \| None:
		...
		```

		- Subclasses: `WhatTheSpecSource`, `PortalSource`, `DocumentListSource`

		### CacheManager
		- Purpose: Single source of truth for all file paths
		CacheManager:
		- Purpose: Central path management for all file operations
		- Examples: `src/tdoc_crawler/config/__init__.py`
		- Pattern: Singleton with global registration
		```python
		manager = resolve_cache_manager()
		manager.db_file # Database path
		manager.http_cache_file # HTTP cache path
		manager.checkout_dir # Document checkout directory
		manager.ai_cache_dir # AI storage directory
		```

		### Database (Pydantic-SQLite)
		- Purpose: Type-safe database access with Pydantic models
		- Examples: `src/tdoc_crawler/database/__init__.py`
		- Pattern: Each domain has dedicated database class
		```python
		# TDoc operations
		db = TDocDatabase(manager.db_file)
		tdocs = db.query_tdocs(config)
		db.upsert_tdocs(metadata)

		# Meeting operations
		db = MeetingDatabase(manager.db_file)
		meetings = db.query_meetings(filters)

		# Spec operations
		db = SpecDatabase(manager.db_file)
		specs = db.query_specs(filters)
		```
		- Pattern: Singleton registry with name-based resolution
		- Properties: `root`, `db_file`, `http_cache_file`, `checkout_dir`, `ai_cache_dir`
		- Usage: `resolve_cache_manager(name)` returns registered instance

		TDocSource Protocol:
		- Purpose: Abstract interface for TDoc metadata sources
		- Examples: `src/tdoc_crawler/tdocs/sources/base.py`
		- Pattern: Protocol-based abstraction
		- Implementations: `DocumentListSource`, `PortalSource`, `WhatTheSpecSource`
		- Methods: `fetch_by_id()`, `fetch_by_meeting()`, `source_name`, `requires_authentication`

		SpecSource Protocol:
		- Purpose: Abstract interface for specification metadata sources
		- Examples: `src/tdoc_crawler/specs/sources/base.py`
		- Pattern: Protocol-based abstraction
		- Implementations: `ThreeGppSpecSource`, `WhatTheSpecSpecSource`
		- Methods: `name`, `fetch()`

		Database Inheritance:
		- Purpose: Layered database operations with shared base
		- Pattern: Inheritance chain
		- Hierarchy: `DocDatabase` → `MeetingDatabase` → `TDocDatabase`
		- Shared: Connection management, table operations, crawl logging
		- Specialized: Domain-specific queries and upserts

		Crawler Pattern:
		- Purpose: Orchestrate crawling operations with progress tracking
		- Examples: `src/tdoc_crawler/tdocs/operations/crawl.py`, `src/tdoc_crawler/meetings/operations/crawl.py`
		- Pattern: Class with `crawl()` method taking config
		- Returns: `TDocCrawlResult` with processed/inserted/updated/error counts

		BaseConfigModel:
		- Purpose: Shared configuration base with HTTP cache and cache manager
		- Examples: `src/tdoc_crawler/models/base.py`
		- Pattern: Pydantic model with common fields
		- Subclasses: `TDocCrawlConfig`, `TDocQueryConfig`, `MeetingCrawlConfig`, `MeetingQueryConfig`

		## Entry Points

		### Main CLI Entry Point
		- Location: `src/tdoc_crawler/cli/tdoc_app.py`
		CLI Main Entry:
		- Location: `src/tdoc_crawler/cli/app.py`
		- Triggers: `tdoc-crawler` command
		- Responsibilities: Register all commands, initialize CacheManager, handle global options
		- Responsibilities: Register commands, global options, cache manager setup

		### Spec CLI Entry Point
		- Location: `src/tdoc_crawler/cli/spec_app.py`
		- Triggers: `spec-crawler` command
		- Responsibilities: Spec-specific commands (checkout, download)
		Domain Crawler Entry:
		- Location: `src/tdoc_crawler/{domain}/operations/crawl.py`
		- Triggers: CLI crawl commands
		- Responsibilities: Orchestrate domain-specific crawling

		### Library Entry Point
		- Location: `src/tdoc_crawler/__init__.py`
		- Triggers: `import tdoc_crawler` in external code
		- Responsibilities: Export public API components
		Database Entry:
		- Location: `src/tdoc_crawler/database/{domain}.py`
		- Triggers: Domain operations
		- Responsibilities: CRUD operations, queries, upserts

		## Error Handling

		Strategy: Fail fast with clear error messages
		Strategy: Layered exception handling with domain-specific errors

		Patterns:
		- Type validation via Pydantic (`ValidationError`)
		- Custom exceptions for domain errors (`PortalParsingError`, `PortalAuthenticationError`)
		- Database errors wrapped via pydantic-sqlite
		- HTTP errors via requests library

		```python
		# Source pattern - raise clear errors
		if metadata is None:
		raise ValueError(f"TDoc {tdoc_id} not found")

		# Database pattern - let pydantic handle validation
		try:
		db.upsert_tdocs([metadata])
		except ValidationError as exc:
		raise ValueError(f"Invalid metadata: {exc}") from exc
		```
		- `DatabaseError` with typed error codes (connection_not_open, etc.)
		- `PortalParsingError` for HTML extraction failures
		- `ConversionError` hierarchy in convert-lo package
		- Validation errors via Pydantic field validators
		- Graceful fallbacks (Portal → WhatTheSpec)

		## Cross-Cutting Concerns

		Logging: Python `logging` module via `src/tdoc_crawler/logging/`
		- `get_logger(__name__)` for module-specific loggers
		- `set_verbosity()` for CLI control
		- Environment variable `TDC_VERBOSITY` for default level
		Logging:
		- Module: `src/tdoc_crawler/logging/`
		- Pattern: `get_logger(__name__)` returns configured logger
		- Levels controlled via `--verbosity` CLI flag

		Validation:
		- Via Pydantic validators in models
		- Field validators for normalization (uppercase IDs, enum conversion)
		- Model validators for relationship integrity

		Validation: Pydantic models with strict validation
		- All data models inherit from `BaseModel`
		- Type hints mandatory (`T \| None` not `Optional[T]`)
		Authentication:
		- Portal credentials via environment variables or CLI flags
		- `resolve_credentials()` for centralized credential access
		- Sources declare `requires_authentication` property

		Authentication: Portal credentials via `src/tdoc_crawler/credentials/`
		- Environment variables: `EOL_USERNAME`, `EOL_PASSWORD`
		- `PromptCredentialsOption` for interactive credential input
		HTTP Caching:
		- Module: `src/tdoc_crawler/http_client/`
		- Pattern: `create_cached_session()` with hishel
		- Configuration via `HttpCacheConfig` (TTL, refresh on access)

		HTTP Caching: Hishel with file-based SQLite cache
		- `create_cached_session()` must be used for ALL 3gpp.org requests
		- 50-90% faster incremental crawls
		- Prevents rate-limiting
		## Package Architecture

		3gpp-ai Package:
		- Location: `packages/3gpp-ai/threegpp_ai/`
		- Purpose: AI-powered document processing (LightRAG integration)
		- Key modules: `lightrag/`, `operations/`
		- Uses: CacheManager for paths, environment for LLM config

		convert-lo Package:
		- Location: `packages/convert-lo/convert_lo/`
		- Purpose: LibreOffice headless document conversion
		- Key class: `Converter` with `convert()` method

		pool_executors Package:
		- Location: `packages/pool_executors/pool_executors/`
		- Purpose: Serial/parallel execution utilities
		- Key classes: `SerialPoolExecutor`, `Runner`, `create_executor()`

		---

		Architecture analysis: 2026-03-25
		Architecture analysis: 2026-03-26

.planning/codebase/CONCERNS.md

+180 −172

File changed.

Preview size limit exceeded, changes collapsed.

.planning/codebase/CONVENTIONS.md

+202 −69

File changed.

Preview size limit exceeded, changes collapsed.

.planning/codebase/STRUCTURE.md

+205 −575

File changed.

Preview size limit exceeded, changes collapsed.

.planning/codebase/TESTING.md

+285 −123

File changed.

Preview size limit exceeded, changes collapsed.