Commit 5cae87f3 authored by Jan Reimes's avatar Jan Reimes
Browse files

chore: Integrate unified extraction pipeline and cleanup

- cli.py: Load .env variables, update imports, use DocumentProcessor
- __init__.py: Export DocumentProcessor
- demo.bat: Fix workspace delete flag (--delete-artifacts)
- ruff.toml: Add configuration for 3gpp-ai package
- Test files: Minor updates for compatibility with refactored code
- Various operation files: Remove extra blank lines (formatting)

Integration changes to support the unified extraction pipeline refactoring.
parent 3333538a
Loading
Loading
Loading
Loading
+170 −220
Original line number Diff line number Diff line
# Architecture

**Analysis Date:** 2026-03-25
**Analysis Date:** 2026-03-26

## Pattern Overview

**Overall:** Domain-oriented modular architecture with clear separation of concerns
**Overall:** Domain-Oriented Modular Architecture with Source Abstraction

**Key Characteristics:**
- **Domain-oriented packages**: `tdocs/`, `meetings/`, `specs/` contain related business logic
- **Source abstraction**: Each domain uses sources/ subpackage for data fetching strategies
- **Operations layer**: Business operations separated from sources and models
- **Pydantic models**: All data structures use Pydantic for validation
- **Single source of truth**: `CacheManager` for all file paths and configuration
- Domain-oriented packages (tdocs, meetings, specs) with consistent internal structure
- Source abstraction pattern for data fetching (Protocol-based)
- Layered database with inheritance (DocDatabase → MeetingDatabase → TDocDatabase)
- Centralized configuration via CacheManager singleton
- CLI as thin orchestration layer over core library

## Layers

### CLI Layer
- Purpose: Command-line interface using Typer framework
**CLI Layer:**
- Purpose: Command definitions, argument parsing, user interaction
- Location: `src/tdoc_crawler/cli/`
- Contains: Command definitions, argument types, output formatting
- Depends on: Business logic layers (database, operations, sources)
- Used by: End users via `tdoc-crawler` and `spec-crawler` commands

### Domain Layer (TDocs)
- Purpose: TDoc crawling and metadata management
- Location: `src/tdoc_crawler/tdocs/`
- Contains: Sources (WhatTheSpec, Portal, DocList), operations (fetch, checkout, crawl)
- Depends on: `models/`, `database/`, `config/`, `http_client/`
- Used by: CLI commands (`crawl-tdocs`, `query-tdocs`, `open`, `checkout`)

### Domain Layer (Meetings)
- Purpose: Meeting metadata and TDoc list retrieval
- Location: `src/tdoc_crawler/meetings/`
- Contains: Sources (FTP, Portal), operations, MeetingCrawler facade
- Depends on: `models/`, `database/`, `config/`
- Used by: CLI commands (`crawl-meetings`, `query-meetings`)

### Domain Layer (Specs)
- Purpose: 3GPP specification management
- Location: `src/tdoc_crawler/specs/`
- Contains: Sources (FTP, Portal), operations (checkout, download)
- Depends on: `models/`, `database/`, `config/`
- Used by: CLI commands via `spec-crawler` entry point

### Data Access Layer
- Purpose: SQLite database abstraction using pydantic-sqlite
- Contains: Typer commands, Rich console output, argument types
- Depends on: Core library (tdocs, meetings, specs, database)
- Used by: End users via `tdoc-crawler` command

**Domain Layer:**
- Purpose: Business logic for 3GPP entities (TDocs, Meetings, Specs)
- Location: `src/tdoc_crawler/{tdocs,meetings,specs}/`
- Contains: Operations, sources, models, utilities per domain
- Depends on: Database layer, config, http_client, parsers
- Used by: CLI layer

**Database Layer:**
- Purpose: Persistent storage and query operations
- Location: `src/tdoc_crawler/database/`
- Contains: `TDocDatabase`, `MeetingDatabase`, `SpecDatabase`, `WorkingGroupDatabase`
- Depends on: `models/`, `config/` (for path resolution)
- Used by: All domain operations

### Models Layer
- Purpose: Pydantic data models for validation and serialization
- Location: `src/tdoc_crawler/models/`
- Contains: Base models, domain-specific models (TDocMetadata, MeetingMetadata, SpecMetadata)
- Depends on: Pydantic library only
- Contains: Database classes with upsert/query operations
- Depends on: Models layer, pydantic-sqlite
- Used by: Domain layer operations

**Models Layer:**
- Purpose: Data structures and validation
- Location: `src/tdoc_crawler/models/`, `src/tdoc_crawler/{domain}/models.py`
- Contains: Pydantic models for metadata, configs, query parameters
- Depends on: Pydantic
- Used by: All layers

### HTTP Client Layer
- Purpose: HTTP caching and session management using hishel
- Location: `src/tdoc_crawler/http_client/`
- Contains: `create_cached_session()`, `PoolConfig`, `download_to_file()`
- Depends on: `config/` (for cache file path)
- Used by: All sources that make HTTP requests

### Configuration Layer
- Purpose: Centralized path and configuration management
- Location: `src/tdoc_crawler/config/`
- Contains: `CacheManager`, `resolve_cache_manager()`, path constants
- Depends on: Environment variables (`TDC_CACHE_DIR`, `TDC_AI_STORE_PATH`)
- Used by: All layers (MUST be registered at startup)

### Parsers Layer
- Purpose: HTML, Excel, and portal page parsing
**Infrastructure Layer:**
- Purpose: Cross-cutting concerns (HTTP, caching, logging, config)
- Location: `src/tdoc_crawler/{config,http_client,logging,clients,utils}/`
- Contains: CacheManager, HTTP session factory, logging setup
- Depends on: hishel, httpx, requests
- Used by: All layers

**Parsers Layer:**
- Purpose: Data extraction from various formats (HTML, Excel, portal)
- Location: `src/tdoc_crawler/parsers/`
- Contains: `MeetingParser`, `PortalParser`, `parse_tdoc_portal_page()`
- Depends on: BeautifulSoup, lxml, python-calamine
- Used by: Sources layer
- Contains: HTML/Excel parsers for 3GPP data
- Depends on: beautifulsoup4, lxml, xlsxwriter
- Used by: Domain sources

### AI/Processing Layer (packages/3gpp-ai)
- Purpose: AI-powered document processing, LightRAG integration
- Location: `packages/3gpp-ai/threegpp_ai/`
- Contains: LightRAG config, TDocProcessor, workspace management, RAG operations
- Depends on: `tdoc_crawler` (for TDoc metadata), LightRAG library, litellm
- Used by: CLI via `tdoc-crawler ai` commands
## Data Flow

### Conversion Layer (packages/convert-lo)
- Purpose: LibreOffice-based document conversion
- Location: `packages/convert-lo/convert_lo/`
- Contains: `convert_to_pdf()`, `convert_to_markdown()`, format detection
- Depends on: LibreOffice (external)
- Used by: AI processing layer
**TDoc Crawl Flow:**

### Pool Executors (packages/pool_executors)
- Purpose: Serial and parallel execution utilities
- Location: `packages/pool_executors/pool_executors/`
- Contains: `SerialPoolExecutor`, `ParallelPoolExecutor`, factory functions
- Depends on: concurrent.futures
- Used by: Crawlers and batch operations
1. CLI `crawl-tdocs` command receives arguments
2. Creates `TDocCrawlConfig` from parsed arguments
3. `CacheManager` registered with paths
4. `TDocCrawler.crawl()` orchestrates:
   - Query meetings from database
   - For each meeting, download Excel document list
   - Parse Excel via `DocumentListSource`
   - Upsert TDoc metadata to database
5. Results displayed via Rich console

## Data Flow
**TDoc Query Flow:**

1. CLI `query-tdocs` creates `TDocQueryConfig`
2. `TDocDatabase.query_tdocs()` filters by:
   - TDoc IDs (exact match)
   - Working groups (via meeting join)
   - Date ranges (retrieved date, meeting date)
   - Glob patterns (source, title, agenda)
3. Returns list of `TDocMetadata`
4. Output formatted via Rich tables or JSON/YAML

### TDoc Crawl Flow
**TDoc Fetch Flow (Missing TDocs):**

1. CLI command `crawl-tdocs` invoked with meeting ID
2. `MeetingDatabase` queried for meeting info
3. `DocumentListSource.fetch()` retrieves TDoc list from 3GPP FTP
4. `TDocDatabase.upsert_tdocs()` stores metadata
5. `crawl_log` table updated with status
1. `fetch_missing_tdocs()` receives requested IDs not in DB
2. Creates appropriate source (`PortalSource` or `WhatTheSpecSource`)
3. Source fetches metadata via HTTP
4. Upserts to database
5. Falls back to WhatTheSpec if portal fails

6. Duplicate detection prevents re-crawling
**Spec Crawl Flow:**

1. CLI `crawl-specs` collects spec numbers
2. `SpecDatabase.crawl_specs()` iterates sources
3. Each `SpecSource.fetch()` retrieves metadata
4. Results stored and optionally checked out

**State Management:**
- SQLite database via pydantic-sqlite for persistent storage
- File-based HTTP cache (hishel) for response caching
- Workspace state in `workspaces.json` for AI processing

### TDoc Fetch Flow

1. `TDocDatabase.query_tdocs()` retrieves known TDocs
2. `fetch_missing_tdocs()` identifies missing metadata
3. Source strategy selection:
   - **WhatTheSpec**: Fast, unauthenticated API
   - **Portal**: Authenticated, full metadata
4. `create_source()` returns appropriate source instance
5. Source fetches and stores metadata in database

6. Results returned to caller

### Spec Checkout Flow

1. `SpecDatabase.query()` retrieves spec info
2. `checkout_specs()` initiates download process
3. For each spec:
   - FTP server queried for latest version
   - `SpecDownloads` table checked for existing files
   - Download to `checkout_dir` if needed
4. ZIP extraction and file organization
5. Results returned with checkout paths

### AI Processing Flow

1. `WorkspaceRegistry` loads/creates workspace
2. `TDocProcessor.process()` iterates through documents
3. For each TDoc:
   - Checkout from database (metadata)
   - Download document if missing
   - Extract text via `kreuzberg`
   - Chunk text
   - Generate embeddings
   - Build knowledge graph (LightRAG)
4. Results stored in workspace storage

5. Query via `TDocRAG.query()` performs semantic search
- SQLite database stores all persistent state
- HTTP cache (hishel) prevents redundant network requests
- `CacheManager` provides single source of truth for paths
- No in-memory state between commands

## Key Abstractions

### TDocSource (Abstract Base)
- Purpose: Common interface for TDoc data sources
- Examples: `src/tdoc_crawler/tdocs/sources/base.py`
- Pattern: Strategy pattern with `fetch_by_id()` and `fetch_batch()` methods

```python
class TDocSource(ABC):
    @abstractmethod
    def fetch_by_id(self, tdoc_id: str, ...) -> TDocMetadata | None:
        ...
    @abstractmethod
    def fetch_batch(self, tdoc_ids: list[str], ...) -> list[TDocMetadata | None:
        ...
```

- Subclasses: `WhatTheSpecSource`, `PortalSource`, `DocumentListSource`

### CacheManager
- Purpose: Single source of truth for all file paths
**CacheManager:**
- Purpose: Central path management for all file operations
- Examples: `src/tdoc_crawler/config/__init__.py`
- Pattern: Singleton with global registration
```python
manager = resolve_cache_manager()
manager.db_file           # Database path
manager.http_cache_file   # HTTP cache path
manager.checkout_dir      # Document checkout directory
manager.ai_cache_dir      # AI storage directory
```

### Database (Pydantic-SQLite)
- Purpose: Type-safe database access with Pydantic models
- Examples: `src/tdoc_crawler/database/__init__.py`
- Pattern: Each domain has dedicated database class
```python
# TDoc operations
db = TDocDatabase(manager.db_file)
tdocs = db.query_tdocs(config)
db.upsert_tdocs(metadata)

# Meeting operations
db = MeetingDatabase(manager.db_file)
meetings = db.query_meetings(filters)

# Spec operations  
db = SpecDatabase(manager.db_file)
specs = db.query_specs(filters)
```
- Pattern: Singleton registry with name-based resolution
- Properties: `root`, `db_file`, `http_cache_file`, `checkout_dir`, `ai_cache_dir`
- Usage: `resolve_cache_manager(name)` returns registered instance

**TDocSource Protocol:**
- Purpose: Abstract interface for TDoc metadata sources
- Examples: `src/tdoc_crawler/tdocs/sources/base.py`
- Pattern: Protocol-based abstraction
- Implementations: `DocumentListSource`, `PortalSource`, `WhatTheSpecSource`
- Methods: `fetch_by_id()`, `fetch_by_meeting()`, `source_name`, `requires_authentication`

**SpecSource Protocol:**
- Purpose: Abstract interface for specification metadata sources
- Examples: `src/tdoc_crawler/specs/sources/base.py`
- Pattern: Protocol-based abstraction
- Implementations: `ThreeGppSpecSource`, `WhatTheSpecSpecSource`
- Methods: `name`, `fetch()`

**Database Inheritance:**
- Purpose: Layered database operations with shared base
- Pattern: Inheritance chain
- Hierarchy: `DocDatabase``MeetingDatabase``TDocDatabase`
- Shared: Connection management, table operations, crawl logging
- Specialized: Domain-specific queries and upserts

**Crawler Pattern:**
- Purpose: Orchestrate crawling operations with progress tracking
- Examples: `src/tdoc_crawler/tdocs/operations/crawl.py`, `src/tdoc_crawler/meetings/operations/crawl.py`
- Pattern: Class with `crawl()` method taking config
- Returns: `TDocCrawlResult` with processed/inserted/updated/error counts

**BaseConfigModel:**
- Purpose: Shared configuration base with HTTP cache and cache manager
- Examples: `src/tdoc_crawler/models/base.py`
- Pattern: Pydantic model with common fields
- Subclasses: `TDocCrawlConfig`, `TDocQueryConfig`, `MeetingCrawlConfig`, `MeetingQueryConfig`

## Entry Points

### Main CLI Entry Point
- Location: `src/tdoc_crawler/cli/tdoc_app.py`
**CLI Main Entry:**
- Location: `src/tdoc_crawler/cli/app.py`
- Triggers: `tdoc-crawler` command
- Responsibilities: Register all commands, initialize CacheManager, handle global options
- Responsibilities: Register commands, global options, cache manager setup

### Spec CLI Entry Point
- Location: `src/tdoc_crawler/cli/spec_app.py`
- Triggers: `spec-crawler` command
- Responsibilities: Spec-specific commands (checkout, download)
**Domain Crawler Entry:**
- Location: `src/tdoc_crawler/{domain}/operations/crawl.py`
- Triggers: CLI crawl commands
- Responsibilities: Orchestrate domain-specific crawling

### Library Entry Point
- Location: `src/tdoc_crawler/__init__.py`
- Triggers: `import tdoc_crawler` in external code
- Responsibilities: Export public API components
**Database Entry:**
- Location: `src/tdoc_crawler/database/{domain}.py`
- Triggers: Domain operations
- Responsibilities: CRUD operations, queries, upserts

## Error Handling

**Strategy:** Fail fast with clear error messages
**Strategy:** Layered exception handling with domain-specific errors

**Patterns:**
- Type validation via Pydantic (`ValidationError`)
- Custom exceptions for domain errors (`PortalParsingError`, `PortalAuthenticationError`)
- Database errors wrapped via pydantic-sqlite
- HTTP errors via requests library

```python
# Source pattern - raise clear errors
if metadata is None:
    raise ValueError(f"TDoc {tdoc_id} not found")

# Database pattern - let pydantic handle validation
try:
    db.upsert_tdocs([metadata])
except ValidationError as exc:
    raise ValueError(f"Invalid metadata: {exc}") from exc
```
- `DatabaseError` with typed error codes (connection_not_open, etc.)
- `PortalParsingError` for HTML extraction failures
- `ConversionError` hierarchy in convert-lo package
- Validation errors via Pydantic field validators
- Graceful fallbacks (Portal → WhatTheSpec)

## Cross-Cutting Concerns

**Logging:** Python `logging` module via `src/tdoc_crawler/logging/`
- `get_logger(__name__)` for module-specific loggers
- `set_verbosity()` for CLI control
- Environment variable `TDC_VERBOSITY` for default level
**Logging:** 
- Module: `src/tdoc_crawler/logging/`
- Pattern: `get_logger(__name__)` returns configured logger
- Levels controlled via `--verbosity` CLI flag

**Validation:**
- Via Pydantic validators in models
- Field validators for normalization (uppercase IDs, enum conversion)
- Model validators for relationship integrity

**Validation:** Pydantic models with strict validation
- All data models inherit from `BaseModel`
- Type hints mandatory (`T | None` not `Optional[T]`)
**Authentication:**
- Portal credentials via environment variables or CLI flags
- `resolve_credentials()` for centralized credential access
- Sources declare `requires_authentication` property

**Authentication:** Portal credentials via `src/tdoc_crawler/credentials/`
- Environment variables: `EOL_USERNAME`, `EOL_PASSWORD`
- `PromptCredentialsOption` for interactive credential input
**HTTP Caching:**
- Module: `src/tdoc_crawler/http_client/`
- Pattern: `create_cached_session()` with hishel
- Configuration via `HttpCacheConfig` (TTL, refresh on access)

**HTTP Caching:** Hishel with file-based SQLite cache
- `create_cached_session()` must be used for ALL 3gpp.org requests
- 50-90% faster incremental crawls
- Prevents rate-limiting
## Package Architecture

**3gpp-ai Package:**
- Location: `packages/3gpp-ai/threegpp_ai/`
- Purpose: AI-powered document processing (LightRAG integration)
- Key modules: `lightrag/`, `operations/`
- Uses: CacheManager for paths, environment for LLM config

**convert-lo Package:**
- Location: `packages/convert-lo/convert_lo/`
- Purpose: LibreOffice headless document conversion
- Key class: `Converter` with `convert()` method

**pool_executors Package:**
- Location: `packages/pool_executors/pool_executors/`
- Purpose: Serial/parallel execution utilities
- Key classes: `SerialPoolExecutor`, `Runner`, `create_executor()`

---

*Architecture analysis: 2026-03-25*
*Architecture analysis: 2026-03-26*
+180 −172

File changed.

Preview size limit exceeded, changes collapsed.

+202 −69

File changed.

Preview size limit exceeded, changes collapsed.

+205 −575

File changed.

Preview size limit exceeded, changes collapsed.

+285 −123

File changed.

Preview size limit exceeded, changes collapsed.

Loading