refresh: codebase analysis — rewrite 7 architecture docs (stack, integrations,... (1f10675f) · Commits · Jan Reimes / 3gpp-crawler

.planning/codebase/ARCHITECTURE.md

+195 −274

File changed.

Preview size limit exceeded, changes collapsed.

.planning/codebase/ARCHITECTURE.md.bak

deleted100644 → 0

+0 −138

Original line number	Diff line number	Diff line
		# Architecture

		Analysis Date: 2026-04-27

		## Pattern Overview

		Overall: Layered CLI-first monorepo with domain modules and shared infrastructure services.

		Key Characteristics:

		- Command entrypoints are thin Typer adapters that delegate to domain operations in `src/tdoc_crawler/cli/tdoc_app.py`, `src/tdoc_crawler/cli/spec_app.py`, and `packages/3gpp-ai/threegpp_ai/cli.py`.
		- Domain logic is split by business area (`tdocs`, `meetings`, `specs`) with consistent `models`/`operations`/`sources` separation under `src/tdoc_crawler/`.
		- Persistence and transport are centralized in reusable layers (`src/tdoc_crawler/database/`, `src/tdoc_crawler/http_client/`, `src/tdoc_crawler/config/`).

		## Layers

		CLI Layer:

		- Purpose: Parse options, configure runtime context, and render output.
		- Location: `src/tdoc_crawler/cli/`, `packages/3gpp-ai/threegpp_ai/cli/`
		- Contains: Typer apps, option aliases, output formatting adapters.
		- Depends on: Domain models/operations, config loader, logging.
		- Used by: Script entrypoints in `pyproject.toml` (`tdoc-crawler`, `spec-crawler`, `3gpp-ai`).

		Domain Operations Layer:

		- Purpose: Execute crawl/query/checkout workflows.
		- Location: `src/tdoc_crawler/tdocs/operations/`, `src/tdoc_crawler/meetings/operations/`, `src/tdoc_crawler/specs/operations/`
		- Contains: Orchestrators such as `TDocCrawler`, `MeetingCrawler`, and spec checkout orchestration.
		- Depends on: Database facades, source adapters, parsers, utility normalization.
		- Used by: CLI layer and package integrations (notably `packages/3gpp-ai/threegpp_ai/operations/`).

		Source/Client Layer:

		- Purpose: Fetch and normalize data from external systems.
		- Location: `src/tdoc_crawler/tdocs/sources/`, `src/tdoc_crawler/specs/sources/`, `src/tdoc_crawler/clients/`, `src/tdoc_crawler/parsers/`
		- Contains: Portal/WhatTheSpec/doclist source implementations and HTML parsing.
		- Depends on: HTTP client and credential resolution.
		- Used by: Domain operations.

		Infrastructure Layer:

		- Purpose: Provide shared runtime services (config, HTTP cache/session, logging, worker execution).
		- Location: `src/tdoc_crawler/config/`, `src/tdoc_crawler/http_client/`, `src/tdoc_crawler/logging/`, `src/tdoc_crawler/workers/`, `packages/pool_executors/pool_executors/`
		- Contains: `ThreeGPPConfig`, `CacheManager`, cached session factory, subinterpreter worker functions.
		- Depends on: pydantic-settings, niquests/hishel, pool executor package.
		- Used by: CLI and domain operations.

		Persistence Layer:

		- Purpose: Store and query crawler state and metadata.
		- Location: `src/tdoc_crawler/database/`
		- Contains: `DocDatabase` lifecycle, table management, and typed facades (`TDocDatabase`, `MeetingDatabase`, `SpecDatabase`).
		- Depends on: Oxyde async ORM and model definitions in `src/tdoc_crawler/database/oxyde_models.py`.
		- Used by: Domain operations and some CLI query paths.

		## Data Flow

		TDoc Crawl Flow:

		1. User executes `tdoc-crawler crawl` (`src/tdoc_crawler/cli/tdoc_app.py` routes to `crawl_tdocs` in `src/tdoc_crawler/cli/crawl.py`).
		2. CLI builds `TDocCrawlConfig`, opens `TDocDatabase`, and instantiates `TDocCrawler`.
		3. `TDocCrawler.crawl()` loads meetings from DB and dispatches per-meeting worker tasks through `pool_executors.create_executor()` in `src/tdoc_crawler/tdocs/operations/crawl.py`.
		4. Worker entrypoint `fetch_meeting_document_list_subinterpreter()` in `src/tdoc_crawler/workers/tdoc_worker.py` fetches doclists and returns JSON payloads.
		5. Orchestrator normalizes/deduplicates metadata and persists via `TDocDatabase.bulk_upsert_tdocs()`.

		TDoc Query + On-Demand Fetch Flow:

		1. User executes `tdoc-crawler query` handled by `query_tdocs` in `src/tdoc_crawler/cli/query.py`.
		2. CLI queries `TDocDatabase.query_tdocs()` with `TDocQueryConfig`.
		3. Missing IDs are resolved by `fetch_missing_tdocs()` in `src/tdoc_crawler/tdocs/operations/fetch.py` using source strategy/fallback.
		4. Output is rendered through `src/tdoc_crawler/cli/printing.py` and `src/tdoc_crawler/cli/formatting.py`.

		State Management:

		- Runtime state is file-backed and config-driven (`PathConfig` in `src/tdoc_crawler/config/settings.py`).
		- Shared mutable runtime objects are minimized; DB and HTTP sessions are short-lived context-managed instances.
		- Parallel crawl state exchange uses serialized JSON payloads between worker boundaries.

		## Key Abstractions

		Configuration Abstraction (`ThreeGPPConfig` + `CacheManager`):

		- Purpose: Centralize config loading and path resolution.
		- Examples: `src/tdoc_crawler/config/settings.py`, `src/tdoc_crawler/config/cache_manager.py`
		- Pattern: Pydantic settings model + registered runtime path manager.

		Source Abstraction (`TDocSource` protocol):

		- Purpose: Hide source-specific fetch details behind a common interface.
		- Examples: `src/tdoc_crawler/tdocs/sources/base.py`, `src/tdoc_crawler/tdocs/sources/portal.py`, `src/tdoc_crawler/tdocs/sources/whatthespec.py`
		- Pattern: Protocol-driven adapters selected by fetch orchestrators.

		Database Facade Abstraction:

		- Purpose: Expose domain-friendly methods over Oxyde models and SQL lifecycle.
		- Examples: `src/tdoc_crawler/database/base.py`, `src/tdoc_crawler/database/tdocs.py`, `src/tdoc_crawler/database/specs.py`
		- Pattern: Async facade classes inheriting shared lifecycle behavior.

		## Entry Points

		TDoc/Meeting CLI:

		- Location: `src/tdoc_crawler/cli/tdoc_app.py`
		- Triggers: `tdoc-crawler` script in root `pyproject.toml` and `python -m tdoc_crawler` via `src/tdoc_crawler/__main__.py`
		- Responsibilities: Register command groups, initialize config/cache manager, dispatch to crawl/query/open/checkout paths.

		Spec CLI:

		- Location: `src/tdoc_crawler/cli/spec_app.py`
		- Triggers: `spec-crawler` script in root `pyproject.toml`
		- Responsibilities: Spec crawl/query/checkout/open workflows.

		AI Extension CLI:

		- Location: `packages/3gpp-ai/threegpp_ai/cli.py`
		- Triggers: `3gpp-ai` script in `packages/3gpp-ai/pyproject.toml`
		- Responsibilities: Workspace/document AI workflows reusing core crawler storage/query components.

		## Error Handling

		Strategy: Boundary-level exception handling with typed domain errors and CLI-friendly exit behavior.

		Patterns:

		- Database lifecycle wraps failures in `DatabaseError` in `src/tdoc_crawler/database/base.py`.
		- Source/client fetch paths catch transport and parse exceptions and either return `None` or aggregate error messages (`src/tdoc_crawler/tdocs/operations/fetch.py`, `src/tdoc_crawler/clients/portal.py`).
		- CLI commands convert validation/runtime failures to `typer.Exit` with user-facing Rich output (`src/tdoc_crawler/cli/config_app.py`, `src/tdoc_crawler/cli/query.py`).

		## Cross-Cutting Concerns

		Logging: `tdoc_crawler.logging` logger setup is consumed across core and AI package modules.
		Validation: Pydantic models/settings validate CLI inputs, config, and metadata schemas.
		Authentication: Credentials are resolved via `src/tdoc_crawler/credentials.py`; authenticated portal flows run through `PortalClient`.

		---

		Architecture analysis: 2026-04-27

.planning/codebase/CONCERNS.md

+177 −402

File changed.

Preview size limit exceeded, changes collapsed.

.planning/codebase/CONVENTIONS.md

+201 −70

Original line number	Diff line number	Diff line
		# Coding Conventions
		# CONVENTIONS

		Analysis Date: 2026-04-27
		Code style, naming, error handling, imports, CLI patterns, logging, docstrings, and
		modelling conventions for the 3gpp-crawler project.

		## Naming Patterns
		## Python Version

		Files:
		Python 3.14+ only (requires-python = ">=3.14,<4.0").
		Code can use all 3.14 features (T \| None is native and preferred).

		- `snake_case.py` for source modules (`src/tdoc_crawler/http_client/session.py`, `src/tdoc_crawler/config/settings.py`)
		- `test_*.py` for tests (`tests/test_crawler.py`, `tests/test_http_client.py`)
		- package `__init__.py` files used for controlled re-exports (`src/tdoc_crawler/http_client/__init__.py`)
		## Code Style

		Functions:
		### Type Hints

		- `snake_case` for functions and methods (`create_cached_session`, `resolve_ssl_verify`, `test_crawl_collects_tdocs`)
		- private helpers prefixed with `_` (`_load_tdoc_metadata` in `src/tdoc_crawler/cli/tdoc_app.py`)
		- Use T \| None, never Optional[T]. The project uses Python 3.14 and
		`from __future__ import annotations` in all files, so `str \| None` is universal.
		- Use list[X], dict[K, V], tuple[X, ...] from builtins -- never
		typing.List, typing.Dict, etc.
		- Use Self for return type annotations of __aenter__ and factory methods
		(from typing import Self).
		- Use TYPE_CHECKING guards for expensive type-only imports
		(e.g., from collections.abc import Iterable, from pathlib import Path).
		NOTE: TYPE_CHECKING is acceptable here (unlike the older tdoc-crawler project).
		- Avoid Any where possible. Narrow types. Use cast() from typing
		when the type checker cannot infer.
		- Function return types are always annotated.
		No bare `def foo()` without `-> ReturnType`.
		- Parameters use Annotated[type, ...] for Typer CLI options/arguments.

		Variables:
		### f-strings

		- `snake_case` locals and parameters (`test_cache_dir`, `http_cache_file`, `pool_config`)
		- constants in `UPPER_SNAKE_CASE` (`DEFAULT_LEVEL`, `WORKSPACE_REGISTRY_FILENAME`)
		- Prefer f-strings over % formatting or .format().
		Logging exception: `_logger.exception("Failed: %s", var)`
		(lazy evaluation in stdlib logging).
		- No str() calls where an f-string would do.

		Types:
		### pathlib

		- `PascalCase` for classes/dataclasses (`PathConfig`, `HttpConfig`, `PoolConfig`, `TestTDocCrawler`)
		- Use pathlib.Path, never os.path.
		No os.path.join(), os.path.exists(), etc.
		Directories: path.mkdir(parents=True, exist_ok=True).
		- Paths resolved with .resolve() and expanded with .expanduser().
		- ~ expansion: Path("~/.3gpp-crawler").expanduser(), not os.path.expanduser.

		## Code Style
		### Naming

		- Snake case everywhere: snake_case for variables, functions, methods, modules.
		- PascalCase for classes: Pydantic models, database classes, custom exceptions.
		- UPPER_CASE for module-level constants.
		- Private methods/functions prefixed with _
		(e.g., _normalize_tdoc_id, _ensure_tables_exist).

		Formatting:
		#### Path Variable Suffixes (CRITICAL)

		- Tool: Ruff formatter/linter via `ruff.toml`
		- Key settings:
		- `target-version = "py314"`
		- `line-length = 160`
		- `fix = true`
		- `unsafe-fixes = true`
		- `docstring-code-format = true`
		Variables holding file paths MUST use _file suffix;
		variables holding directory paths MUST use _dir suffix. Examples:

		Linting:
		- db_file: Path -- path to a database file
		- cache_dir: Path -- path to a cache directory
		- checkout_dir: Path -- path to a checkout directory
		- config_file: Path \| None -- path to a config file
		- http_cache_file: Path -- path to HTTP cache file
		- output_dir: Path -- path to an output directory

		- Tool: Ruff (`[lint]` in `ruff.toml`)
		- Key rules:
		- broad selection includes `E`, `F`, `I`, `D`, `PT`, `PL`, `ANN`, `B`, `S`
		- import-at-top enforced by `E402` and `PLC0415` (with explicit per-file exceptions)
		- Google docstring convention via `[lint.pydocstyle] convention = "google"`
		- tests allow assertion/secret subprocess exceptions via `ruff.toml` per-file ignores for `tests/*/.py`
		Applies to all variables: function parameters, locals, dataclass/pydantic fields.

		## Import Organization
		#### Test Naming

		Order:
		Tests follow test_<scenario> pattern for functions.
		Test classes use Test<ClassName> or Test<Feature>:

		1. future import (`from __future__ import annotations`)
		2. standard library imports
		3. third-party imports
		4. local package imports (`tdoc_crawler.*`)
		```python
		class TestTDocDatabase:
		async def test_upsert_tdoc(self) - None: ...
		async def test_case_insensitive_query(self) - None: ...
		```

		Path Aliases:
		### Code Size Limits (Soft)

		- Not detected; repository uses package-qualified absolute imports (`from tdoc_crawler...`) rather than alias mapping.
		\| Scope \| Limit \|
		\|-------\|-------\|
		\| Module \| < 250 lines \|
		\| Function \| < 75 lines \|
		\| Class \| < 200 lines \|

		## Error Handling

		Patterns:
		### Exception Hierarchy

		- boundary failures are converted to typed exits in CLI (`raise typer.Exit(code=1) from exc` in `src/tdoc_crawler/cli/tdoc_app.py`)
		- validation errors use explicit exceptions (for example unsupported URL scheme in `download_to_file` in `src/tdoc_crawler/http_client/session.py`)
		- tests assert concrete failure paths with `pytest.raises(...)` (`tests/test_checkout.py`, `tests/test_cache_manager.py`)
		- DatabaseError(RuntimeError) -- Base exception for database failures.
		Uses code: str and optional detail: str.
		Factory classmethods like DatabaseError.connection_not_open().
		- CacheManagerNotRegisteredError(RuntimeError) -- Singleton not yet registered.
		- NormalizationError(ValueError) -- Input normalization failures.

		## Logging
		### Patterns

		- Raise specific exceptions. Never raise Exception("...").
		Each domain has its own exception class.
		- No bare except:.
		- raise ... from exc for exception chaining in context managers.
		- Let it fail: Do not defensively wrap resolve_cache_manager().
		If not registered, that is a dev error.
		- pytest.raises(SomeError, match="...") for asserting specific errors.

		Framework: `logging` with Rich integration (`src/tdoc_crawler/logging/__init__.py`)
		### Return Type Discipline

		Patterns:
		- Avoid inconsistent return types. A function returns X \| None,
		not sometimes X, sometimes str, sometimes None.
		See docs/development.md for the documented antipattern.
		- Raise, don't return error strings.
		Return None for "not found"; raise for invalid input.

		- module logger retrieval via `get_logger(__name__)`
		- root logger configured once (`configure_logger`) with `RichHandler`
		- verbosity controlled centrally via `set_verbosity(...)`, used by CLI command handlers in `src/tdoc_crawler/cli/tdoc_app.py`
		## Imports

		## Comments
		### Order

		When to Comment:
		1. from __future__ import annotations (always first, mandatory)
		2. Standard library
		3. Third-party (pydantic, typer, pytest, etc.)
		4. First-party: tdoc_crawler.*
		5. Local (relative imports only in tests: from .conftest import ...)

		- comments are used to explain intent or behavior boundaries, especially in tests and adapters (for example niquests/hishel bridge rationale in `src/tdoc_crawler/http_client/session.py`)
		### Style

		JSDoc/TSDoc:
		- Absolute imports within tdoc_crawler. No relative imports in src/.
		- Never import * -- project uses explicit __all__ in every module.
		- Prefer importing classes/functions over modules:
		from tdoc_crawler.config import CacheManager.
		- No TYPE_CHECKING for circular import workarounds.
		Extract shared types to models/ instead.
		- TYPE_CHECKING acceptable for expensive type-only imports
		(Iterable, Path, Version) in non-hot paths.

		- Not applicable (Python codebase). Python docstrings are standard and widely used in source and tests (`tests/conftest.py`, `src/tdoc_crawler/config/settings.py`).
		## Pydantic Models vs Dataclasses

		## Function Design
		### Pydantic Models (pydantic.BaseModel)

		Size:
		Use for:
		- Data crossing system boundaries (database, JSON, YAML, CLI)
		- Data needing validation (field validators, type coercion)
		- Configuration classes (BaseSettings via pydantic-settings)

		- mixed; module functions are typically medium-sized, while command handlers in `src/tdoc_crawler/cli/tdoc_app.py` can be larger and orchestrate nested helpers.
		Pattern:
		```python
		class TDocMetadata(BaseModel):
		tdoc_id: str = Field(..., description="...")
		meeting_id: int = Field(..., description="...")

		Parameters:
		@field_validator("tdoc_id")
		@classmethod
		def _normalize_tdoc_id(cls, value: str) -> str:
		return normalize_tdoc_id(value)

		- explicit type hints are common in both source and tests (`Path \| None`, `list[TDocMetadata]`, `requests.Session \| None`)
		- config-heavy functions use keyword-friendly optional parameters (for example `create_cached_session(...)`)
		@field_serializer("agenda_item_nbr")
		def _serialize_agenda_item_nbr(cls, value: AgendaItemNumber) -> str: ...
		```

		Return Values:
		Key practices:
		- Field(..., description="...") on every field
		- Validators: _validate_<field> or _normalize_<field>
		- Serializers: _serialize_<field>
		- model_config = {"str_strip_whitespace": True} on config models
		- model_dump(mode="json") for serialization

		- typed return annotations are standard (`-> requests.Session`, `-> bool`, `-> list[TDocMetadata]`)
		- async database workflows return domain models and are usually wrapped in `async with` contexts (`tests/test_crawler.py`, `tests/conftest.py`)
		### Dataclasses

		## Module Design
		Use @dataclass for:
		- Simple DTOs without validation
		- Internal data transfer objects
		- Example: PortalCredentials(username: str, password: str)

		Exports:
		## CLI Conventions

		- controlled exports via `__all__` are used for package-facing APIs (`src/tdoc_crawler/http_client/__init__.py`, `src/tdoc_crawler/http_client/session.py`)
		### Architecture

		Barrel Files:
		- cli/ is thin -- Only Typer command definitions and Rich formatting.
		All logic belongs in core library modules.
		- Never duplicate core library logic in CLI.
		See src/tdoc_crawler/cli/AGENTS.md.

		- used selectively for compatibility and API clarity (for example `src/tdoc_crawler/http_client/__init__.py` re-exports session-level functions and types)
		### Typer Annotated Pattern

		---
		All CLI parameters use Annotated[type, typer.Option(...)] in cli/args.py:

		```python
		# In args.py:
		WorkingGroupOption = Annotated[
		list[str] \| None,
		typer.Option("--working-group", "-w", help="...", envvar="TDC_WORKING_GROUP"),
		]

		# In command definition:
		def crawl_tdocs(
		working_group: WorkingGroupOption = None,
		limit_tdocs: LimitTDocsOption = None,
		) -> None: ...
		```

		### Rich Output

		- console from tdoc_crawler.cli._shared or tdoc_crawler.logging.get_console()
		- Rich Tables with TableColumnSpec for structured data
		- Rich Markup: [red]Error[/red], [green]OK[/green]
		- Progress bars via create_progress_bar() from tdoc_crawler.cli._shared

		### Command Registration

		Commands registered with app.command("name", rich_help_panel=...).
		Aliases use hidden=True:
		```python
		tdoc_app.command("crawl", rich_help_panel=HELP_PANEL_CRAWLING)(crawl_tdocs)
		tdoc_app.command("ct", rich_help_panel=HELP_PANEL_CRAWLING, hidden=True)(crawl_tdocs)
		```

		## Logging

		Convention analysis: 2026-04-27
		- Use get_logger(__name__) from tdoc_crawler.logging.
		- Logger variable always named _logger.
		- %-style formatting for lazy evaluation:
		_logger.info("Found %d items", count).
		- _logger.exception(...) for exception logging (includes traceback).
		- No logging.basicConfig() -- root logger configured via configure_logger().
		- Verbosity via set_verbosity(level).

		## Docstrings

		- Google-style (enforced by ruff: [lint.pydocstyle] convention = "google").
		- Required for all public functions, classes, and methods (ruff D group).
		- Optional for private/dunder methods (D100-D107 selectively ignored).
		- Module-level docstrings are optional (D100 ignored).

		## Ruff Configuration

		See ruff.toml:
		- target-version = "py314"
		- line-length = 160
		- preview = true
		- Selected rule sets: E, F, C4, C90, D, I, PT, PL, SIM, UP, W, S, ANN, B, NPY
		- Tests ignore: S101 (assert), S106, PLR6301, S603, PLW1510

		## HTTP Caching

		- All HTTP requests MUST use create_cached_session() from tdoc_crawler.http_client.
		- Cache: SQLite via hishel.
		- Session uses niquests (not requests).
		- Pool configuration via PoolConfig.
		- SSL verification via resolve_ssl_verify().
		No newline at end of file

.planning/codebase/INTEGRATIONS.md

+370 −204

File changed.

Preview size limit exceeded, changes collapsed.