refresh: codebase analysis — 7 mapper agents (stack, integrations,... (d1973b0c) · Commits · Jan Reimes / 3gpp-crawler

.planning/codebase/ARCHITECTURE.md

+268 −86

File changed.

Preview size limit exceeded, changes collapsed.

.planning/codebase/ARCHITECTURE.md.bak

0 → 100644

+138 −0

Original line number	Diff line number	Diff line
		# Architecture

		Analysis Date: 2026-04-27

		## Pattern Overview

		Overall: Layered CLI-first monorepo with domain modules and shared infrastructure services.

		Key Characteristics:

		- Command entrypoints are thin Typer adapters that delegate to domain operations in `src/tdoc_crawler/cli/tdoc_app.py`, `src/tdoc_crawler/cli/spec_app.py`, and `packages/3gpp-ai/threegpp_ai/cli.py`.
		- Domain logic is split by business area (`tdocs`, `meetings`, `specs`) with consistent `models`/`operations`/`sources` separation under `src/tdoc_crawler/`.
		- Persistence and transport are centralized in reusable layers (`src/tdoc_crawler/database/`, `src/tdoc_crawler/http_client/`, `src/tdoc_crawler/config/`).

		## Layers

		CLI Layer:

		- Purpose: Parse options, configure runtime context, and render output.
		- Location: `src/tdoc_crawler/cli/`, `packages/3gpp-ai/threegpp_ai/cli/`
		- Contains: Typer apps, option aliases, output formatting adapters.
		- Depends on: Domain models/operations, config loader, logging.
		- Used by: Script entrypoints in `pyproject.toml` (`tdoc-crawler`, `spec-crawler`, `3gpp-ai`).

		Domain Operations Layer:

		- Purpose: Execute crawl/query/checkout workflows.
		- Location: `src/tdoc_crawler/tdocs/operations/`, `src/tdoc_crawler/meetings/operations/`, `src/tdoc_crawler/specs/operations/`
		- Contains: Orchestrators such as `TDocCrawler`, `MeetingCrawler`, and spec checkout orchestration.
		- Depends on: Database facades, source adapters, parsers, utility normalization.
		- Used by: CLI layer and package integrations (notably `packages/3gpp-ai/threegpp_ai/operations/`).

		Source/Client Layer:

		- Purpose: Fetch and normalize data from external systems.
		- Location: `src/tdoc_crawler/tdocs/sources/`, `src/tdoc_crawler/specs/sources/`, `src/tdoc_crawler/clients/`, `src/tdoc_crawler/parsers/`
		- Contains: Portal/WhatTheSpec/doclist source implementations and HTML parsing.
		- Depends on: HTTP client and credential resolution.
		- Used by: Domain operations.

		Infrastructure Layer:

		- Purpose: Provide shared runtime services (config, HTTP cache/session, logging, worker execution).
		- Location: `src/tdoc_crawler/config/`, `src/tdoc_crawler/http_client/`, `src/tdoc_crawler/logging/`, `src/tdoc_crawler/workers/`, `packages/pool_executors/pool_executors/`
		- Contains: `ThreeGPPConfig`, `CacheManager`, cached session factory, subinterpreter worker functions.
		- Depends on: pydantic-settings, niquests/hishel, pool executor package.
		- Used by: CLI and domain operations.

		Persistence Layer:

		- Purpose: Store and query crawler state and metadata.
		- Location: `src/tdoc_crawler/database/`
		- Contains: `DocDatabase` lifecycle, table management, and typed facades (`TDocDatabase`, `MeetingDatabase`, `SpecDatabase`).
		- Depends on: Oxyde async ORM and model definitions in `src/tdoc_crawler/database/oxyde_models.py`.
		- Used by: Domain operations and some CLI query paths.

		## Data Flow

		TDoc Crawl Flow:

		1. User executes `tdoc-crawler crawl` (`src/tdoc_crawler/cli/tdoc_app.py` routes to `crawl_tdocs` in `src/tdoc_crawler/cli/crawl.py`).
		2. CLI builds `TDocCrawlConfig`, opens `TDocDatabase`, and instantiates `TDocCrawler`.
		3. `TDocCrawler.crawl()` loads meetings from DB and dispatches per-meeting worker tasks through `pool_executors.create_executor()` in `src/tdoc_crawler/tdocs/operations/crawl.py`.
		4. Worker entrypoint `fetch_meeting_document_list_subinterpreter()` in `src/tdoc_crawler/workers/tdoc_worker.py` fetches doclists and returns JSON payloads.
		5. Orchestrator normalizes/deduplicates metadata and persists via `TDocDatabase.bulk_upsert_tdocs()`.

		TDoc Query + On-Demand Fetch Flow:

		1. User executes `tdoc-crawler query` handled by `query_tdocs` in `src/tdoc_crawler/cli/query.py`.
		2. CLI queries `TDocDatabase.query_tdocs()` with `TDocQueryConfig`.
		3. Missing IDs are resolved by `fetch_missing_tdocs()` in `src/tdoc_crawler/tdocs/operations/fetch.py` using source strategy/fallback.
		4. Output is rendered through `src/tdoc_crawler/cli/printing.py` and `src/tdoc_crawler/cli/formatting.py`.

		State Management:

		- Runtime state is file-backed and config-driven (`PathConfig` in `src/tdoc_crawler/config/settings.py`).
		- Shared mutable runtime objects are minimized; DB and HTTP sessions are short-lived context-managed instances.
		- Parallel crawl state exchange uses serialized JSON payloads between worker boundaries.

		## Key Abstractions

		Configuration Abstraction (`ThreeGPPConfig` + `CacheManager`):

		- Purpose: Centralize config loading and path resolution.
		- Examples: `src/tdoc_crawler/config/settings.py`, `src/tdoc_crawler/config/cache_manager.py`
		- Pattern: Pydantic settings model + registered runtime path manager.

		Source Abstraction (`TDocSource` protocol):

		- Purpose: Hide source-specific fetch details behind a common interface.
		- Examples: `src/tdoc_crawler/tdocs/sources/base.py`, `src/tdoc_crawler/tdocs/sources/portal.py`, `src/tdoc_crawler/tdocs/sources/whatthespec.py`
		- Pattern: Protocol-driven adapters selected by fetch orchestrators.

		Database Facade Abstraction:

		- Purpose: Expose domain-friendly methods over Oxyde models and SQL lifecycle.
		- Examples: `src/tdoc_crawler/database/base.py`, `src/tdoc_crawler/database/tdocs.py`, `src/tdoc_crawler/database/specs.py`
		- Pattern: Async facade classes inheriting shared lifecycle behavior.

		## Entry Points

		TDoc/Meeting CLI:

		- Location: `src/tdoc_crawler/cli/tdoc_app.py`
		- Triggers: `tdoc-crawler` script in root `pyproject.toml` and `python -m tdoc_crawler` via `src/tdoc_crawler/__main__.py`
		- Responsibilities: Register command groups, initialize config/cache manager, dispatch to crawl/query/open/checkout paths.

		Spec CLI:

		- Location: `src/tdoc_crawler/cli/spec_app.py`
		- Triggers: `spec-crawler` script in root `pyproject.toml`
		- Responsibilities: Spec crawl/query/checkout/open workflows.

		AI Extension CLI:

		- Location: `packages/3gpp-ai/threegpp_ai/cli.py`
		- Triggers: `3gpp-ai` script in `packages/3gpp-ai/pyproject.toml`
		- Responsibilities: Workspace/document AI workflows reusing core crawler storage/query components.

		## Error Handling

		Strategy: Boundary-level exception handling with typed domain errors and CLI-friendly exit behavior.

		Patterns:

		- Database lifecycle wraps failures in `DatabaseError` in `src/tdoc_crawler/database/base.py`.
		- Source/client fetch paths catch transport and parse exceptions and either return `None` or aggregate error messages (`src/tdoc_crawler/tdocs/operations/fetch.py`, `src/tdoc_crawler/clients/portal.py`).
		- CLI commands convert validation/runtime failures to `typer.Exit` with user-facing Rich output (`src/tdoc_crawler/cli/config_app.py`, `src/tdoc_crawler/cli/query.py`).

		## Cross-Cutting Concerns

		Logging: `tdoc_crawler.logging` logger setup is consumed across core and AI package modules.
		Validation: Pydantic models/settings validate CLI inputs, config, and metadata schemas.
		Authentication: Credentials are resolved via `src/tdoc_crawler/credentials.py`; authenticated portal flows run through `PortalClient`.

		---

		Architecture analysis: 2026-04-27

.planning/codebase/CONCERNS.md

+446 −101

File changed.

Preview size limit exceeded, changes collapsed.

.planning/codebase/INTEGRATIONS.md

+179 −72

Original line number	Diff line number	Diff line
		# External Integrations

		Analysis Date: 2026-04-27
		Analysis Date: 2026-04-30

		## APIs & External Services

		3GPP public endpoints:

		- 3GPP meetings/spec pages (`www.3gpp.org`) - Meeting/spec metadata lookups
		- SDK/Client: `niquests` sessions created by `create_cached_session()` in `src/tdoc_crawler/http_client/session.py`
		- Auth: None
		- 3GPP FTP spec archive (`https://www.3gpp.org/ftp/Specs/archive/...`) - Spec file downloads
		- SDK/Client: URL templates in `src/tdoc_crawler/constants/urls.py`, downloads through `src/tdoc_crawler/specs/downloads.py`
		- Auth: None

		3GPP portal endpoints:

		- Portal login and TDoc endpoints (`portal.3gpp.org`) - Authenticated fallback for TDoc metadata
		- SDK/Client: `PortalClient` in `src/tdoc_crawler/clients/portal.py`
		- Auth: `TDC_EOL_USERNAME`, `TDC_EOL_PASSWORD` (mapped in `src/tdoc_crawler/config/env_vars.py`)
		- Meeting document list endpoint (`GenerateDocumentList.aspx`) - Unauthenticated Excel document list fetch
		- SDK/Client: `src/tdoc_crawler/tdocs/sources/doclist.py`
		- Auth: None

		Community metadata API:

		- WhatTheSpec (`whatthespec.net`) - Preferred unauthenticated TDoc/spec metadata source and fallback path
		- SDK/Client: `src/tdoc_crawler/tdocs/sources/whatthespec.py`, `src/tdoc_crawler/specs/sources/whatthespec.py`
		- Auth: None

		AI and conversion services:

		- LLM providers through LiteLLM - Summarization/figure description/completions
		- SDK/Client: `litellm` via `packages/3gpp-ai/threegpp_ai/operations/llm_client.py`
		- Auth: `TDC_AI_LLM_API_KEY` or provider-specific API key env vars
		- Remote Office-to-PDF conversion service (`pdf-convert.3gpp.org`) - Fallback when local LibreOffice conversion fails
		- SDK/Client: `packages/3gpp-ai/threegpp_ai/operations/conversion.py`
		- Auth: `PDF_REMOTE_API_KEY` (Bearer token, optional)
		3GPP Portal (portal.3gpp.org):
		- Authentication: EOL (Escape Online) credentials required
		- Purpose: TDoc metadata extraction, unauthenticated document URL extraction
		- Client: `PortalClient` in `src/tdoc_crawler/clients/portal.py`
		- URLs:
		- Login: `https://portal.3gpp.org/login.aspx`
		- TDoc view: `https://portal.3gpp.org/ngppapp/CreateTdoc.Aspx`
		- TDoc download: `https://portal.3gpp.org/ngppapp/DownloadTDoc.aspx`
		- Auth Pattern: HTTP POST with username/password, session cookie retention
		- Session Management:
		- Cached session reused across requests
		- Cache disabled for login requests (explicit `http_cache_enabled=False`)
		- Browser user-agent required to avoid 403 Forbidden

		3GPP Public Website (3gpp.org):
		- Purpose: Meeting pages, spec FTP archive access, TDoc search
		- URLs:
		- Meetings: `https://www.3gpp.org/dynareport?code=Meetings-{code}.htm`
		- Spec archive: `https://www.3gpp.org/ftp/Specs/archive/{series}/{normalized}/{file_name}`
		- Auth: None (public, requires User-Agent header)
		- Data formats: HTML (parsed via BeautifulSoup), Excel (document lists), PDF/Office documents

		Document List Service (3GPP Portal):
		- Purpose: Meeting TDoc metadata (title, status, contact, URL)
		- Format: Excel (.xlsx) files
		- Download via: HTTP GET to portal
		- Parsing: `calamine` (Rust-backed Excel reader) → `TDocMetadata` models

		PDF Remote Conversion API (pdf-convert.3gpp.org):
		- Purpose: Fallback Office → PDF conversion (when local LibreOffice unavailable)
		- Base URL: `https://pdf-convert.3gpp.org` (env: `PDF_REMOTE_API_BASE`)
		- Auth: API key (env: `PDF_REMOTE_API_KEY`)
		- Trigger: Automatic fallback when `ConverterBackend.AUTO` selected
		- Formats supported: DOCX, PPTX, XLSX, DOC, PPT, XLS

		## Data Storage

		Databases:

		- SQLite (primary metadata store)
		- Connection: local file path via `PathConfig.db_file` in `src/tdoc_crawler/config/settings.py`
		- Client: Oxyde `AsyncDatabase` in `src/tdoc_crawler/database/base.py` with `sqlite:///...` URL
		- SQLite (HTTP cache store)
		- Connection: local cache DB path via `PathConfig.http_cache_file`
		- Client: `hishel.SyncSqliteStorage` in `src/tdoc_crawler/http_client/session.py`

		File Storage:

		- Local filesystem only (cache, checkout, AI workspace folders) managed by `PathConfig` and `CacheManager` in `src/tdoc_crawler/config/settings.py` and `src/tdoc_crawler/config/cache_manager.py`

		Caching:

		- HTTP response caching via `hishel` + SQLite in `src/tdoc_crawler/http_client/session.py`
		- SQLite 3 (file-based: `~/.3gpp-crawler/3gpp_crawler.db`)
		- ORM: Oxyde (async)
		- Tables: TDocMetadata, MeetingMetadata, Specification, SpecificationVersion, SpecificationDownload, SpecificationSourceRecord, CrawlLogEntry, WorkingGroupRecord, SubWorkingGroupRecord
		- Schema auto-migrated via `extract_current_schema()` at startup

		HTTP Cache Database:
		- SQLite 3 (file-based: `~/.3gpp-crawler/http-cache.sqlite3`)
		- Backend: Hishel `SyncSqliteStorage`
		- Stores: Full HTTP responses (headers + body) keyed by request URL
		- Expiration: Honors Cache-Control, ETags, Last-Modified
		- Fallback: Returns stale cache on 5xx errors (conditional)

		File Storage (Checkout Directory):
		- Local filesystem: `~/.3gpp-crawler/checkout/`
		- TDoc files: `checkout/{meeting}_{tdoc_id}/`
		- Spec files: `checkout/Specs/{series}/{normalized}/`
		- Wiki extraction: `wiki/{workspace_id}/members/{member_id}/`

		Workspace Registry:
		- File-based JSON: `~/.3gpp-crawler/workspaces.json`
		- Stores: Workspace metadata, extraction profiles, member specs
		- Mutation: In-memory model serialized back to disk

		## Authentication & Identity

		Auth Provider:

		- Custom credential-based portal auth for 3GPP EOL portal
		- Implementation: username/password in `CredentialsConfig` (`src/tdoc_crawler/config/settings.py`) consumed by `PortalClient` (`src/tdoc_crawler/clients/portal.py`)
		- Custom (3GPP EOL credentials)
		- Pattern: HTTP Basic-like (username + password sent as POST form)
		- Storage: `PortalCredentials` model (username, password fields)
		- Initialization: `set_credentials(username, password)` called at CLI startup
		- Retrieval: `resolve_credentials()` from global registry
		- Validation: Attempted on first portal request; `PortalAuthenticationError` on failure

		Session Management:
		- Stateful session: `requests.Session` object with hishel `CacheAdapter`
		- Cookie jar: Automatic per `requests` library
		- Retry logic: `urllib3.Retry` with exponential backoff (5xx errors, timeouts)
		- SSL verification: Configurable (default: system CA bundle)

		## Monitoring & Observability

		Error Tracking:

		- None detected (no external Sentry/New Relic/etc. integration)
		- None (no Sentry/Rollbar integration)
		- Errors logged via `logging` module (`get_logger(__name__)`)
		- Crawl failures recorded in `CrawlLogEntry` table with exception text

		Logs:

		- Python logging with centralized helpers (`tdoc_crawler.logging.get_logger`) used across `src/tdoc_crawler/` and `packages/3gpp-ai/threegpp_ai/`
		- File-based: `~/.3gpp-crawler/logs/` (optional, configured via settings)
		- Console: Rich-formatted output (colors, tables, progress bars)
		- Levels: DEBUG, INFO, WARNING, ERROR (controlled via `--verbosity` CLI flag)
		- Structured logging via Pydantic validation errors and context-passing

		## CI/CD & Deployment

		Hosting:

		- Not detected (repository is CLI/package oriented)
		- On-premises or Docker deployment (user-managed)
		- CLI entry points:
		- `tdoc-crawler` — TDoc crawling and querying
		- `spec-crawler` — Specification crawling and querying
		- `3gpp-crawler` — Workspace management and configuration

		CI Pipeline:
		- GitHub Actions (optional, not checked in)
		- Pre-commit hooks available (ruff, pytest, deptry)
		- Tox integration for multi-Python testing

		- No `.github/workflows/` directory detected
		- Local/portable test orchestration via `tox.ini` (`tox` + `tox-uv`)
		Docker Support:
		- Not pre-configured; user builds own image from `Dockerfile`
		- Key dependencies: Python 3.14, libssl-dev, libxml2-dev (for lxml)

		## Environment Configuration

		Required env vars:

		- Base crawler/runtime: `TDC_CACHE_DIR`, `HTTP_CACHE_TTL`, `TDC_TIMEOUT`, `TDC_MAX_RETRIES`
		- Portal auth fallback: `TDC_EOL_USERNAME`, `TDC_EOL_PASSWORD`
		- AI pipeline: `TDC_AI_LLM_MODEL`, `TDC_AI_LLM_API_KEY`, `TDC_AI_LLM_API_BASE`, `TDC_AI_PARALLELISM`
		- Optional remote conversion: `PDF_REMOTE_API_KEY`, `PDF_REMOTE_API_BASE`
		- `TDC_EOL_USERNAME` — 3GPP portal username
		- `TDC_EOL_PASSWORD` — 3GPP portal password

		Optional env vars:
		- `TDC_CACHE_DIR` — Cache directory (default: `~/.3gpp-crawler`)
		- `TDC_HTTP_TIMEOUT` — Request timeout in seconds (default: 30)
		- `TDC_HTTP_RETRIES` — Max retries on 5xx (default: 3)
		- `TDC_OUTPUT` — Default output format: `table`, `json`, `ison`, `toon`, `yaml`
		- `PDF_REMOTE_API_KEY` — Remote PDF converter API key
		- `PDF_REMOTE_API_BASE` — Remote PDF converter base URL (default: `https://pdf-convert.3gpp.org`)
		- `TDC_LOG_LEVEL` — Logging level (default: `INFO`)

		Secrets location:

		- Environment variables (primary)
		- Optional local config files loaded by settings discovery (`src/tdoc_crawler/config/sources.py`); secrets should remain environment-backed
		- `.env` file (auto-loaded by `python-dotenv`)
		- Environment variables (direct export)
		- `~/.3gpp-crawler/config.toml` (optional, includes credentials section)

		## Webhooks & Callbacks

		Incoming:

		- None
		- None (CLI-only, no HTTP server)

		Outgoing:

		- None
		- None (no push notifications, webhooks, or async job submission)

		## Data Sources & Crawling

		TDoc Sources:
		1. Portal-based (authenticated):
		- `PortalClient.fetch_tdoc_metadata()` — Crawl TDoc URLs from portal search
		- Returns: Full metadata (title, contact, status, file size, dates)
		- Frequency: Once per meeting per crawl session

		2. Document List (Excel):
		- `fetch_meeting_tdocs_from_doclist()` — Excel document list parsing
		- Source: `{portal_base}/ngppapp/DownloadTDocFile.aspx?filename={meeting_doclist}.xlsx`
		- Returns: Flattened TDoc metadata from Excel rows
		- Faster than portal crawl, requires parsing

		Specification Sources:
		1. Direct 3GPP FTP Download:
		- Base URL: `https://www.3gpp.org/ftp/Specs/archive/{series}/{normalized}/`
		- Pattern: Follows 3GPP FTP directory structure (TS/TR numbering)
		- Formats: PDF, Word, ZIP archives

		2. Meeting Document Sources (optional):
		- Specs referenced in TDoc metadata
		- Cross-indexed via working group + release

		Meeting Sources:
		1. 3GPP Meetings Portal:
		- URL: `https://www.3gpp.org/dynareport?code=Meetings-{code}.htm`
		- Format: HTML table (parsed via BeautifulSoup)
		- Fields: Meeting ID, date range, location, files URL
		- Frequency: Annually scanned (stable data)

		## Performance Characteristics

		HTTP Caching:
		- Layer: Hishel `SyncSqliteStorage`
		- Backend: SQLite3 with in-memory index
		- Cache ratio: ~60-80% hit rate on re-crawls (depends on ETags)
		- TTL: Per Cache-Control header (3GPP files: often indefinite/immutable)
		- Warmup: First crawl populates cache; subsequent runs use cached responses

		Async Architecture:
		- Crawling: Async/await for concurrent network I/O (`asyncio`)
		- Database: Async context manager (`AsyncDatabase.__aenter__/__aexit__`)
		- Executor pools: `aiointerpreters` + `pool-executors` for CPU-bound work (PDF parsing)
		- Concurrency: Up to 10 concurrent requests (configurable via `PoolConfig`)

		Document Processing Pipeline:
		- PDF extraction: OpenDataLoader (fast, hybrid LLM-based)
		- Format conversion: LibreOffice UNO (5–15s per file) or remote API (2–5s)
		- Caching: Converted PDFs cached in checkout directory (no re-conversion)
		- Performance: ~2–5 documents/second (dependent on file size and LLM load)

		## Failure Modes & Resilience

		Portal Authentication Failures:
		- Symptom: 401/403 on metadata fetch
		- Recovery: Automatic retry with fresh session (up to 3 attempts)
		- Fallback: Use document list (Excel) as alternative source

		Network Timeouts:
		- Default: 30 seconds per request
		- Retry: Exponential backoff (1s, 2s, 4s)
		- Fallback: Cached response (if available and not stale)
		- User notice: Progress bar shows "⚠ timeout, retrying..."

		PDF Conversion Failures:
		- Local LibreOffice crashes: Fallback to remote API
		- Remote API 500s: Manual intervention required (or skip document)
		- Logging: Full exception stack in `~/.3gpp-crawler/logs/`

		Database Corruption:
		- Strategy: WAL mode (write-ahead logging) enabled by default
		- Recovery: Manual backup + schema re-creation via `--clear-db` flag
		- Prevention: No concurrent writers (single CLI process)

		---

		Integration audit: 2026-04-27
		Integration audit: 2026-04-30

.planning/codebase/STACK.md

+98 −59

File changed.

Preview size limit exceeded, changes collapsed.