Commit d1973b0c authored by Jan Reimes's avatar Jan Reimes
Browse files

refresh: codebase analysis — 7 mapper agents (stack, integrations,...

refresh: codebase analysis — 7 mapper agents (stack, integrations, architecture, structure, conventions, testing, concerns)
parent 2569a7af
Loading
Loading
Loading
Loading
+268 −86

File changed.

Preview size limit exceeded, changes collapsed.

+138 −0
Original line number Diff line number Diff line
# Architecture

**Analysis Date:** 2026-04-27

## Pattern Overview

**Overall:** Layered CLI-first monorepo with domain modules and shared infrastructure services.

**Key Characteristics:**

- Command entrypoints are thin Typer adapters that delegate to domain operations in `src/tdoc_crawler/cli/tdoc_app.py`, `src/tdoc_crawler/cli/spec_app.py`, and `packages/3gpp-ai/threegpp_ai/cli.py`.
- Domain logic is split by business area (`tdocs`, `meetings`, `specs`) with consistent `models`/`operations`/`sources` separation under `src/tdoc_crawler/`.
- Persistence and transport are centralized in reusable layers (`src/tdoc_crawler/database/`, `src/tdoc_crawler/http_client/`, `src/tdoc_crawler/config/`).

## Layers

**CLI Layer:**

- Purpose: Parse options, configure runtime context, and render output.
- Location: `src/tdoc_crawler/cli/`, `packages/3gpp-ai/threegpp_ai/cli/`
- Contains: Typer apps, option aliases, output formatting adapters.
- Depends on: Domain models/operations, config loader, logging.
- Used by: Script entrypoints in `pyproject.toml` (`tdoc-crawler`, `spec-crawler`, `3gpp-ai`).

**Domain Operations Layer:**

- Purpose: Execute crawl/query/checkout workflows.
- Location: `src/tdoc_crawler/tdocs/operations/`, `src/tdoc_crawler/meetings/operations/`, `src/tdoc_crawler/specs/operations/`
- Contains: Orchestrators such as `TDocCrawler`, `MeetingCrawler`, and spec checkout orchestration.
- Depends on: Database facades, source adapters, parsers, utility normalization.
- Used by: CLI layer and package integrations (notably `packages/3gpp-ai/threegpp_ai/operations/`).

**Source/Client Layer:**

- Purpose: Fetch and normalize data from external systems.
- Location: `src/tdoc_crawler/tdocs/sources/`, `src/tdoc_crawler/specs/sources/`, `src/tdoc_crawler/clients/`, `src/tdoc_crawler/parsers/`
- Contains: Portal/WhatTheSpec/doclist source implementations and HTML parsing.
- Depends on: HTTP client and credential resolution.
- Used by: Domain operations.

**Infrastructure Layer:**

- Purpose: Provide shared runtime services (config, HTTP cache/session, logging, worker execution).
- Location: `src/tdoc_crawler/config/`, `src/tdoc_crawler/http_client/`, `src/tdoc_crawler/logging/`, `src/tdoc_crawler/workers/`, `packages/pool_executors/pool_executors/`
- Contains: `ThreeGPPConfig`, `CacheManager`, cached session factory, subinterpreter worker functions.
- Depends on: pydantic-settings, niquests/hishel, pool executor package.
- Used by: CLI and domain operations.

**Persistence Layer:**

- Purpose: Store and query crawler state and metadata.
- Location: `src/tdoc_crawler/database/`
- Contains: `DocDatabase` lifecycle, table management, and typed facades (`TDocDatabase`, `MeetingDatabase`, `SpecDatabase`).
- Depends on: Oxyde async ORM and model definitions in `src/tdoc_crawler/database/oxyde_models.py`.
- Used by: Domain operations and some CLI query paths.

## Data Flow

**TDoc Crawl Flow:**

1. User executes `tdoc-crawler crawl` (`src/tdoc_crawler/cli/tdoc_app.py` routes to `crawl_tdocs` in `src/tdoc_crawler/cli/crawl.py`).
2. CLI builds `TDocCrawlConfig`, opens `TDocDatabase`, and instantiates `TDocCrawler`.
3. `TDocCrawler.crawl()` loads meetings from DB and dispatches per-meeting worker tasks through `pool_executors.create_executor()` in `src/tdoc_crawler/tdocs/operations/crawl.py`.
4. Worker entrypoint `fetch_meeting_document_list_subinterpreter()` in `src/tdoc_crawler/workers/tdoc_worker.py` fetches doclists and returns JSON payloads.
5. Orchestrator normalizes/deduplicates metadata and persists via `TDocDatabase.bulk_upsert_tdocs()`.

**TDoc Query + On-Demand Fetch Flow:**

1. User executes `tdoc-crawler query` handled by `query_tdocs` in `src/tdoc_crawler/cli/query.py`.
2. CLI queries `TDocDatabase.query_tdocs()` with `TDocQueryConfig`.
3. Missing IDs are resolved by `fetch_missing_tdocs()` in `src/tdoc_crawler/tdocs/operations/fetch.py` using source strategy/fallback.
4. Output is rendered through `src/tdoc_crawler/cli/printing.py` and `src/tdoc_crawler/cli/formatting.py`.

**State Management:**

- Runtime state is file-backed and config-driven (`PathConfig` in `src/tdoc_crawler/config/settings.py`).
- Shared mutable runtime objects are minimized; DB and HTTP sessions are short-lived context-managed instances.
- Parallel crawl state exchange uses serialized JSON payloads between worker boundaries.

## Key Abstractions

**Configuration Abstraction (`ThreeGPPConfig` + `CacheManager`):**

- Purpose: Centralize config loading and path resolution.
- Examples: `src/tdoc_crawler/config/settings.py`, `src/tdoc_crawler/config/cache_manager.py`
- Pattern: Pydantic settings model + registered runtime path manager.

**Source Abstraction (`TDocSource` protocol):**

- Purpose: Hide source-specific fetch details behind a common interface.
- Examples: `src/tdoc_crawler/tdocs/sources/base.py`, `src/tdoc_crawler/tdocs/sources/portal.py`, `src/tdoc_crawler/tdocs/sources/whatthespec.py`
- Pattern: Protocol-driven adapters selected by fetch orchestrators.

**Database Facade Abstraction:**

- Purpose: Expose domain-friendly methods over Oxyde models and SQL lifecycle.
- Examples: `src/tdoc_crawler/database/base.py`, `src/tdoc_crawler/database/tdocs.py`, `src/tdoc_crawler/database/specs.py`
- Pattern: Async facade classes inheriting shared lifecycle behavior.

## Entry Points

**TDoc/Meeting CLI:**

- Location: `src/tdoc_crawler/cli/tdoc_app.py`
- Triggers: `tdoc-crawler` script in root `pyproject.toml` and `python -m tdoc_crawler` via `src/tdoc_crawler/__main__.py`
- Responsibilities: Register command groups, initialize config/cache manager, dispatch to crawl/query/open/checkout paths.

**Spec CLI:**

- Location: `src/tdoc_crawler/cli/spec_app.py`
- Triggers: `spec-crawler` script in root `pyproject.toml`
- Responsibilities: Spec crawl/query/checkout/open workflows.

**AI Extension CLI:**

- Location: `packages/3gpp-ai/threegpp_ai/cli.py`
- Triggers: `3gpp-ai` script in `packages/3gpp-ai/pyproject.toml`
- Responsibilities: Workspace/document AI workflows reusing core crawler storage/query components.

## Error Handling

**Strategy:** Boundary-level exception handling with typed domain errors and CLI-friendly exit behavior.

**Patterns:**

- Database lifecycle wraps failures in `DatabaseError` in `src/tdoc_crawler/database/base.py`.
- Source/client fetch paths catch transport and parse exceptions and either return `None` or aggregate error messages (`src/tdoc_crawler/tdocs/operations/fetch.py`, `src/tdoc_crawler/clients/portal.py`).
- CLI commands convert validation/runtime failures to `typer.Exit` with user-facing Rich output (`src/tdoc_crawler/cli/config_app.py`, `src/tdoc_crawler/cli/query.py`).

## Cross-Cutting Concerns

**Logging:** `tdoc_crawler.logging` logger setup is consumed across core and AI package modules.
**Validation:** Pydantic models/settings validate CLI inputs, config, and metadata schemas.
**Authentication:** Credentials are resolved via `src/tdoc_crawler/credentials.py`; authenticated portal flows run through `PortalClient`.

---

*Architecture analysis: 2026-04-27*
+446 −101

File changed.

Preview size limit exceeded, changes collapsed.

+179 −72
Original line number Diff line number Diff line
# External Integrations

**Analysis Date:** 2026-04-27
**Analysis Date:** 2026-04-30

## APIs & External Services

**3GPP public endpoints:**

- 3GPP meetings/spec pages (`www.3gpp.org`) - Meeting/spec metadata lookups
 	- SDK/Client: `niquests` sessions created by `create_cached_session()` in `src/tdoc_crawler/http_client/session.py`
 	- Auth: None
- 3GPP FTP spec archive (`https://www.3gpp.org/ftp/Specs/archive/...`) - Spec file downloads
 	- SDK/Client: URL templates in `src/tdoc_crawler/constants/urls.py`, downloads through `src/tdoc_crawler/specs/downloads.py`
 	- Auth: None

**3GPP portal endpoints:**

- Portal login and TDoc endpoints (`portal.3gpp.org`) - Authenticated fallback for TDoc metadata
 	- SDK/Client: `PortalClient` in `src/tdoc_crawler/clients/portal.py`
 	- Auth: `TDC_EOL_USERNAME`, `TDC_EOL_PASSWORD` (mapped in `src/tdoc_crawler/config/env_vars.py`)
- Meeting document list endpoint (`GenerateDocumentList.aspx`) - Unauthenticated Excel document list fetch
 	- SDK/Client: `src/tdoc_crawler/tdocs/sources/doclist.py`
 	- Auth: None

**Community metadata API:**

- WhatTheSpec (`whatthespec.net`) - Preferred unauthenticated TDoc/spec metadata source and fallback path
 	- SDK/Client: `src/tdoc_crawler/tdocs/sources/whatthespec.py`, `src/tdoc_crawler/specs/sources/whatthespec.py`
 	- Auth: None

**AI and conversion services:**

- LLM providers through LiteLLM - Summarization/figure description/completions
 	- SDK/Client: `litellm` via `packages/3gpp-ai/threegpp_ai/operations/llm_client.py`
 	- Auth: `TDC_AI_LLM_API_KEY` or provider-specific API key env vars
- Remote Office-to-PDF conversion service (`pdf-convert.3gpp.org`) - Fallback when local LibreOffice conversion fails
 	- SDK/Client: `packages/3gpp-ai/threegpp_ai/operations/conversion.py`
 	- Auth: `PDF_REMOTE_API_KEY` (Bearer token, optional)
**3GPP Portal (portal.3gpp.org):**
- Authentication: EOL (Escape Online) credentials required
- Purpose: TDoc metadata extraction, unauthenticated document URL extraction
- Client: `PortalClient` in `src/tdoc_crawler/clients/portal.py`
- URLs:
  - Login: `https://portal.3gpp.org/login.aspx`
  - TDoc view: `https://portal.3gpp.org/ngppapp/CreateTdoc.Aspx`
  - TDoc download: `https://portal.3gpp.org/ngppapp/DownloadTDoc.aspx`
- Auth Pattern: HTTP POST with username/password, session cookie retention
- Session Management:
  - Cached session reused across requests
  - Cache disabled for login requests (explicit `http_cache_enabled=False`)
  - Browser user-agent required to avoid 403 Forbidden

**3GPP Public Website (3gpp.org):**
- Purpose: Meeting pages, spec FTP archive access, TDoc search
- URLs:
  - Meetings: `https://www.3gpp.org/dynareport?code=Meetings-{code}.htm`
  - Spec archive: `https://www.3gpp.org/ftp/Specs/archive/{series}/{normalized}/{file_name}`
- Auth: None (public, requires User-Agent header)
- Data formats: HTML (parsed via BeautifulSoup), Excel (document lists), PDF/Office documents

**Document List Service (3GPP Portal):**
- Purpose: Meeting TDoc metadata (title, status, contact, URL)
- Format: Excel (.xlsx) files
- Download via: HTTP GET to portal
- Parsing: `calamine` (Rust-backed Excel reader) → `TDocMetadata` models

**PDF Remote Conversion API (pdf-convert.3gpp.org):**
- Purpose: Fallback Office → PDF conversion (when local LibreOffice unavailable)
- Base URL: `https://pdf-convert.3gpp.org` (env: `PDF_REMOTE_API_BASE`)
- Auth: API key (env: `PDF_REMOTE_API_KEY`)
- Trigger: Automatic fallback when `ConverterBackend.AUTO` selected
- Formats supported: DOCX, PPTX, XLSX, DOC, PPT, XLS

## Data Storage

**Databases:**

- SQLite (primary metadata store)
 	- Connection: local file path via `PathConfig.db_file` in `src/tdoc_crawler/config/settings.py`
 	- Client: Oxyde `AsyncDatabase` in `src/tdoc_crawler/database/base.py` with `sqlite:///...` URL
- SQLite (HTTP cache store)
 	- Connection: local cache DB path via `PathConfig.http_cache_file`
 	- Client: `hishel.SyncSqliteStorage` in `src/tdoc_crawler/http_client/session.py`

**File Storage:**

- Local filesystem only (cache, checkout, AI workspace folders) managed by `PathConfig` and `CacheManager` in `src/tdoc_crawler/config/settings.py` and `src/tdoc_crawler/config/cache_manager.py`

**Caching:**

- HTTP response caching via `hishel` + SQLite in `src/tdoc_crawler/http_client/session.py`
- SQLite 3 (file-based: `~/.3gpp-crawler/3gpp_crawler.db`)
  - ORM: Oxyde (async)
  - Tables: TDocMetadata, MeetingMetadata, Specification, SpecificationVersion, SpecificationDownload, SpecificationSourceRecord, CrawlLogEntry, WorkingGroupRecord, SubWorkingGroupRecord
  - Schema auto-migrated via `extract_current_schema()` at startup

**HTTP Cache Database:**
- SQLite 3 (file-based: `~/.3gpp-crawler/http-cache.sqlite3`)
  - Backend: Hishel `SyncSqliteStorage`
  - Stores: Full HTTP responses (headers + body) keyed by request URL
  - Expiration: Honors Cache-Control, ETags, Last-Modified
  - Fallback: Returns stale cache on 5xx errors (conditional)

**File Storage (Checkout Directory):**
- Local filesystem: `~/.3gpp-crawler/checkout/`
  - TDoc files: `checkout/{meeting}_{tdoc_id}/`
  - Spec files: `checkout/Specs/{series}/{normalized}/`
  - Wiki extraction: `wiki/{workspace_id}/members/{member_id}/`

**Workspace Registry:**
- File-based JSON: `~/.3gpp-crawler/workspaces.json`
- Stores: Workspace metadata, extraction profiles, member specs
- Mutation: In-memory model serialized back to disk

## Authentication & Identity

**Auth Provider:**

- Custom credential-based portal auth for 3GPP EOL portal
 	- Implementation: username/password in `CredentialsConfig` (`src/tdoc_crawler/config/settings.py`) consumed by `PortalClient` (`src/tdoc_crawler/clients/portal.py`)
- Custom (3GPP EOL credentials)
  - Pattern: HTTP Basic-like (username + password sent as POST form)
  - Storage: `PortalCredentials` model (username, password fields)
  - Initialization: `set_credentials(username, password)` called at CLI startup
  - Retrieval: `resolve_credentials()` from global registry
  - Validation: Attempted on first portal request; `PortalAuthenticationError` on failure

**Session Management:**
- Stateful session: `requests.Session` object with hishel `CacheAdapter`
- Cookie jar: Automatic per `requests` library
- Retry logic: `urllib3.Retry` with exponential backoff (5xx errors, timeouts)
- SSL verification: Configurable (default: system CA bundle)

## Monitoring & Observability

**Error Tracking:**

- None detected (no external Sentry/New Relic/etc. integration)
- None (no Sentry/Rollbar integration)
- Errors logged via `logging` module (`get_logger(__name__)`)
- Crawl failures recorded in `CrawlLogEntry` table with exception text

**Logs:**

- Python logging with centralized helpers (`tdoc_crawler.logging.get_logger`) used across `src/tdoc_crawler/` and `packages/3gpp-ai/threegpp_ai/`
- File-based: `~/.3gpp-crawler/logs/` (optional, configured via settings)
- Console: Rich-formatted output (colors, tables, progress bars)
- Levels: DEBUG, INFO, WARNING, ERROR (controlled via `--verbosity` CLI flag)
- Structured logging via Pydantic validation errors and context-passing

## CI/CD & Deployment

**Hosting:**

- Not detected (repository is CLI/package oriented)
- On-premises or Docker deployment (user-managed)
- CLI entry points:
  - `tdoc-crawler` — TDoc crawling and querying
  - `spec-crawler` — Specification crawling and querying
  - `3gpp-crawler` — Workspace management and configuration

**CI Pipeline:**
- GitHub Actions (optional, not checked in)
- Pre-commit hooks available (ruff, pytest, deptry)
- Tox integration for multi-Python testing

- No `.github/workflows/` directory detected
- Local/portable test orchestration via `tox.ini` (`tox` + `tox-uv`)
**Docker Support:**
- Not pre-configured; user builds own image from `Dockerfile`
- Key dependencies: Python 3.14, libssl-dev, libxml2-dev (for lxml)

## Environment Configuration

**Required env vars:**

- Base crawler/runtime: `TDC_CACHE_DIR`, `HTTP_CACHE_TTL`, `TDC_TIMEOUT`, `TDC_MAX_RETRIES`
- Portal auth fallback: `TDC_EOL_USERNAME`, `TDC_EOL_PASSWORD`
- AI pipeline: `TDC_AI_LLM_MODEL`, `TDC_AI_LLM_API_KEY`, `TDC_AI_LLM_API_BASE`, `TDC_AI_PARALLELISM`
- Optional remote conversion: `PDF_REMOTE_API_KEY`, `PDF_REMOTE_API_BASE`
- `TDC_EOL_USERNAME` — 3GPP portal username
- `TDC_EOL_PASSWORD` — 3GPP portal password

**Optional env vars:**
- `TDC_CACHE_DIR` — Cache directory (default: `~/.3gpp-crawler`)
- `TDC_HTTP_TIMEOUT` — Request timeout in seconds (default: 30)
- `TDC_HTTP_RETRIES` — Max retries on 5xx (default: 3)
- `TDC_OUTPUT` — Default output format: `table`, `json`, `ison`, `toon`, `yaml`
- `PDF_REMOTE_API_KEY` — Remote PDF converter API key
- `PDF_REMOTE_API_BASE` — Remote PDF converter base URL (default: `https://pdf-convert.3gpp.org`)
- `TDC_LOG_LEVEL` — Logging level (default: `INFO`)

**Secrets location:**

- Environment variables (primary)
- Optional local config files loaded by settings discovery (`src/tdoc_crawler/config/sources.py`); secrets should remain environment-backed
- `.env` file (auto-loaded by `python-dotenv`)
- Environment variables (direct export)
- `~/.3gpp-crawler/config.toml` (optional, includes credentials section)

## Webhooks & Callbacks

**Incoming:**

- None
- None (CLI-only, no HTTP server)

**Outgoing:**

- None
- None (no push notifications, webhooks, or async job submission)

## Data Sources & Crawling

**TDoc Sources:**
1. **Portal-based (authenticated):**
   - `PortalClient.fetch_tdoc_metadata()` — Crawl TDoc URLs from portal search
   - Returns: Full metadata (title, contact, status, file size, dates)
   - Frequency: Once per meeting per crawl session

2. **Document List (Excel):**
   - `fetch_meeting_tdocs_from_doclist()` — Excel document list parsing
   - Source: `{portal_base}/ngppapp/DownloadTDocFile.aspx?filename={meeting_doclist}.xlsx`
   - Returns: Flattened TDoc metadata from Excel rows
   - Faster than portal crawl, requires parsing

**Specification Sources:**
1. **Direct 3GPP FTP Download:**
   - Base URL: `https://www.3gpp.org/ftp/Specs/archive/{series}/{normalized}/`
   - Pattern: Follows 3GPP FTP directory structure (TS/TR numbering)
   - Formats: PDF, Word, ZIP archives

2. **Meeting Document Sources (optional):**
   - Specs referenced in TDoc metadata
   - Cross-indexed via working group + release

**Meeting Sources:**
1. **3GPP Meetings Portal:**
   - URL: `https://www.3gpp.org/dynareport?code=Meetings-{code}.htm`
   - Format: HTML table (parsed via BeautifulSoup)
   - Fields: Meeting ID, date range, location, files URL
   - Frequency: Annually scanned (stable data)

## Performance Characteristics

**HTTP Caching:**
- Layer: Hishel `SyncSqliteStorage`
- Backend: SQLite3 with in-memory index
- Cache ratio: ~60-80% hit rate on re-crawls (depends on ETags)
- TTL: Per Cache-Control header (3GPP files: often indefinite/immutable)
- Warmup: First crawl populates cache; subsequent runs use cached responses

**Async Architecture:**
- Crawling: Async/await for concurrent network I/O (`asyncio`)
- Database: Async context manager (`AsyncDatabase.__aenter__/__aexit__`)
- Executor pools: `aiointerpreters` + `pool-executors` for CPU-bound work (PDF parsing)
- Concurrency: Up to 10 concurrent requests (configurable via `PoolConfig`)

**Document Processing Pipeline:**
- PDF extraction: OpenDataLoader (fast, hybrid LLM-based)
- Format conversion: LibreOffice UNO (5–15s per file) or remote API (2–5s)
- Caching: Converted PDFs cached in checkout directory (no re-conversion)
- Performance: ~2–5 documents/second (dependent on file size and LLM load)

## Failure Modes & Resilience

**Portal Authentication Failures:**
- Symptom: 401/403 on metadata fetch
- Recovery: Automatic retry with fresh session (up to 3 attempts)
- Fallback: Use document list (Excel) as alternative source

**Network Timeouts:**
- Default: 30 seconds per request
- Retry: Exponential backoff (1s, 2s, 4s)
- Fallback: Cached response (if available and not stale)
- User notice: Progress bar shows "⚠ timeout, retrying..."

**PDF Conversion Failures:**
- Local LibreOffice crashes: Fallback to remote API
- Remote API 500s: Manual intervention required (or skip document)
- Logging: Full exception stack in `~/.3gpp-crawler/logs/`

**Database Corruption:**
- Strategy: WAL mode (write-ahead logging) enabled by default
- Recovery: Manual backup + schema re-creation via `--clear-db` flag
- Prevention: No concurrent writers (single CLI process)

---

*Integration audit: 2026-04-27*
*Integration audit: 2026-04-30*
+98 −59

File changed.

Preview size limit exceeded, changes collapsed.

Loading