Commit 1f10675f authored by Jan Reimes's avatar Jan Reimes
Browse files

refresh: codebase analysis — rewrite 7 architecture docs (stack, integrations,...

refresh: codebase analysis — rewrite 7 architecture docs (stack, integrations, architecture, structure, conventions, testing, concerns)
parent d1973b0c
Loading
Loading
Loading
Loading
+195 −274

File changed.

Preview size limit exceeded, changes collapsed.

+0 −138
Original line number Diff line number Diff line
# Architecture

**Analysis Date:** 2026-04-27

## Pattern Overview

**Overall:** Layered CLI-first monorepo with domain modules and shared infrastructure services.

**Key Characteristics:**

- Command entrypoints are thin Typer adapters that delegate to domain operations in `src/tdoc_crawler/cli/tdoc_app.py`, `src/tdoc_crawler/cli/spec_app.py`, and `packages/3gpp-ai/threegpp_ai/cli.py`.
- Domain logic is split by business area (`tdocs`, `meetings`, `specs`) with consistent `models`/`operations`/`sources` separation under `src/tdoc_crawler/`.
- Persistence and transport are centralized in reusable layers (`src/tdoc_crawler/database/`, `src/tdoc_crawler/http_client/`, `src/tdoc_crawler/config/`).

## Layers

**CLI Layer:**

- Purpose: Parse options, configure runtime context, and render output.
- Location: `src/tdoc_crawler/cli/`, `packages/3gpp-ai/threegpp_ai/cli/`
- Contains: Typer apps, option aliases, output formatting adapters.
- Depends on: Domain models/operations, config loader, logging.
- Used by: Script entrypoints in `pyproject.toml` (`tdoc-crawler`, `spec-crawler`, `3gpp-ai`).

**Domain Operations Layer:**

- Purpose: Execute crawl/query/checkout workflows.
- Location: `src/tdoc_crawler/tdocs/operations/`, `src/tdoc_crawler/meetings/operations/`, `src/tdoc_crawler/specs/operations/`
- Contains: Orchestrators such as `TDocCrawler`, `MeetingCrawler`, and spec checkout orchestration.
- Depends on: Database facades, source adapters, parsers, utility normalization.
- Used by: CLI layer and package integrations (notably `packages/3gpp-ai/threegpp_ai/operations/`).

**Source/Client Layer:**

- Purpose: Fetch and normalize data from external systems.
- Location: `src/tdoc_crawler/tdocs/sources/`, `src/tdoc_crawler/specs/sources/`, `src/tdoc_crawler/clients/`, `src/tdoc_crawler/parsers/`
- Contains: Portal/WhatTheSpec/doclist source implementations and HTML parsing.
- Depends on: HTTP client and credential resolution.
- Used by: Domain operations.

**Infrastructure Layer:**

- Purpose: Provide shared runtime services (config, HTTP cache/session, logging, worker execution).
- Location: `src/tdoc_crawler/config/`, `src/tdoc_crawler/http_client/`, `src/tdoc_crawler/logging/`, `src/tdoc_crawler/workers/`, `packages/pool_executors/pool_executors/`
- Contains: `ThreeGPPConfig`, `CacheManager`, cached session factory, subinterpreter worker functions.
- Depends on: pydantic-settings, niquests/hishel, pool executor package.
- Used by: CLI and domain operations.

**Persistence Layer:**

- Purpose: Store and query crawler state and metadata.
- Location: `src/tdoc_crawler/database/`
- Contains: `DocDatabase` lifecycle, table management, and typed facades (`TDocDatabase`, `MeetingDatabase`, `SpecDatabase`).
- Depends on: Oxyde async ORM and model definitions in `src/tdoc_crawler/database/oxyde_models.py`.
- Used by: Domain operations and some CLI query paths.

## Data Flow

**TDoc Crawl Flow:**

1. User executes `tdoc-crawler crawl` (`src/tdoc_crawler/cli/tdoc_app.py` routes to `crawl_tdocs` in `src/tdoc_crawler/cli/crawl.py`).
2. CLI builds `TDocCrawlConfig`, opens `TDocDatabase`, and instantiates `TDocCrawler`.
3. `TDocCrawler.crawl()` loads meetings from DB and dispatches per-meeting worker tasks through `pool_executors.create_executor()` in `src/tdoc_crawler/tdocs/operations/crawl.py`.
4. Worker entrypoint `fetch_meeting_document_list_subinterpreter()` in `src/tdoc_crawler/workers/tdoc_worker.py` fetches doclists and returns JSON payloads.
5. Orchestrator normalizes/deduplicates metadata and persists via `TDocDatabase.bulk_upsert_tdocs()`.

**TDoc Query + On-Demand Fetch Flow:**

1. User executes `tdoc-crawler query` handled by `query_tdocs` in `src/tdoc_crawler/cli/query.py`.
2. CLI queries `TDocDatabase.query_tdocs()` with `TDocQueryConfig`.
3. Missing IDs are resolved by `fetch_missing_tdocs()` in `src/tdoc_crawler/tdocs/operations/fetch.py` using source strategy/fallback.
4. Output is rendered through `src/tdoc_crawler/cli/printing.py` and `src/tdoc_crawler/cli/formatting.py`.

**State Management:**

- Runtime state is file-backed and config-driven (`PathConfig` in `src/tdoc_crawler/config/settings.py`).
- Shared mutable runtime objects are minimized; DB and HTTP sessions are short-lived context-managed instances.
- Parallel crawl state exchange uses serialized JSON payloads between worker boundaries.

## Key Abstractions

**Configuration Abstraction (`ThreeGPPConfig` + `CacheManager`):**

- Purpose: Centralize config loading and path resolution.
- Examples: `src/tdoc_crawler/config/settings.py`, `src/tdoc_crawler/config/cache_manager.py`
- Pattern: Pydantic settings model + registered runtime path manager.

**Source Abstraction (`TDocSource` protocol):**

- Purpose: Hide source-specific fetch details behind a common interface.
- Examples: `src/tdoc_crawler/tdocs/sources/base.py`, `src/tdoc_crawler/tdocs/sources/portal.py`, `src/tdoc_crawler/tdocs/sources/whatthespec.py`
- Pattern: Protocol-driven adapters selected by fetch orchestrators.

**Database Facade Abstraction:**

- Purpose: Expose domain-friendly methods over Oxyde models and SQL lifecycle.
- Examples: `src/tdoc_crawler/database/base.py`, `src/tdoc_crawler/database/tdocs.py`, `src/tdoc_crawler/database/specs.py`
- Pattern: Async facade classes inheriting shared lifecycle behavior.

## Entry Points

**TDoc/Meeting CLI:**

- Location: `src/tdoc_crawler/cli/tdoc_app.py`
- Triggers: `tdoc-crawler` script in root `pyproject.toml` and `python -m tdoc_crawler` via `src/tdoc_crawler/__main__.py`
- Responsibilities: Register command groups, initialize config/cache manager, dispatch to crawl/query/open/checkout paths.

**Spec CLI:**

- Location: `src/tdoc_crawler/cli/spec_app.py`
- Triggers: `spec-crawler` script in root `pyproject.toml`
- Responsibilities: Spec crawl/query/checkout/open workflows.

**AI Extension CLI:**

- Location: `packages/3gpp-ai/threegpp_ai/cli.py`
- Triggers: `3gpp-ai` script in `packages/3gpp-ai/pyproject.toml`
- Responsibilities: Workspace/document AI workflows reusing core crawler storage/query components.

## Error Handling

**Strategy:** Boundary-level exception handling with typed domain errors and CLI-friendly exit behavior.

**Patterns:**

- Database lifecycle wraps failures in `DatabaseError` in `src/tdoc_crawler/database/base.py`.
- Source/client fetch paths catch transport and parse exceptions and either return `None` or aggregate error messages (`src/tdoc_crawler/tdocs/operations/fetch.py`, `src/tdoc_crawler/clients/portal.py`).
- CLI commands convert validation/runtime failures to `typer.Exit` with user-facing Rich output (`src/tdoc_crawler/cli/config_app.py`, `src/tdoc_crawler/cli/query.py`).

## Cross-Cutting Concerns

**Logging:** `tdoc_crawler.logging` logger setup is consumed across core and AI package modules.
**Validation:** Pydantic models/settings validate CLI inputs, config, and metadata schemas.
**Authentication:** Credentials are resolved via `src/tdoc_crawler/credentials.py`; authenticated portal flows run through `PortalClient`.

---

*Architecture analysis: 2026-04-27*
+177 −402

File changed.

Preview size limit exceeded, changes collapsed.

+201 −70
Original line number Diff line number Diff line
# Coding Conventions
# CONVENTIONS

**Analysis Date:** 2026-04-27
Code style, naming, error handling, imports, CLI patterns, logging, docstrings, and
modelling conventions for the 3gpp-crawler project.

## Naming Patterns
## Python Version

**Files:**
Python 3.14+ only (requires-python = ">=3.14,<4.0").
Code can use all 3.14 features (T | None is native and preferred).

- `snake_case.py` for source modules (`src/tdoc_crawler/http_client/session.py`, `src/tdoc_crawler/config/settings.py`)
- `test_*.py` for tests (`tests/test_crawler.py`, `tests/test_http_client.py`)
- package `__init__.py` files used for controlled re-exports (`src/tdoc_crawler/http_client/__init__.py`)
## Code Style

**Functions:**
### Type Hints

- `snake_case` for functions and methods (`create_cached_session`, `resolve_ssl_verify`, `test_crawl_collects_tdocs`)
- private helpers prefixed with `_` (`_load_tdoc_metadata` in `src/tdoc_crawler/cli/tdoc_app.py`)
- **Use T | None, never Optional[T].** The project uses Python 3.14 and
  `from __future__ import annotations` in all files, so `str | None` is universal.
- **Use list[X], dict[K, V], tuple[X, ...]** from builtins -- never
  typing.List, typing.Dict, etc.
- **Use Self** for return type annotations of __aenter__ and factory methods
  (from typing import Self).
- **Use TYPE_CHECKING guards** for expensive type-only imports
  (e.g., from collections.abc import Iterable, from pathlib import Path).
  NOTE: TYPE_CHECKING is acceptable here (unlike the older tdoc-crawler project).
- **Avoid Any where possible.** Narrow types. Use cast() from typing
  when the type checker cannot infer.
- **Function return types are always annotated.**
  No bare `def foo()` without `-> ReturnType`.
- **Parameters use Annotated[type, ...]** for Typer CLI options/arguments.

**Variables:**
### f-strings

- `snake_case` locals and parameters (`test_cache_dir`, `http_cache_file`, `pool_config`)
- constants in `UPPER_SNAKE_CASE` (`DEFAULT_LEVEL`, `WORKSPACE_REGISTRY_FILENAME`)
- **Prefer f-strings** over % formatting or .format().
  Logging exception: `_logger.exception("Failed: %s", var)`
  (lazy evaluation in stdlib logging).
- No str() calls where an f-string would do.

**Types:**
### pathlib

- `PascalCase` for classes/dataclasses (`PathConfig`, `HttpConfig`, `PoolConfig`, `TestTDocCrawler`)
- **Use pathlib.Path, never os.path.**
  No os.path.join(), os.path.exists(), etc.
  Directories: path.mkdir(parents=True, exist_ok=True).
- Paths resolved with .resolve() and expanded with .expanduser().
- ~ expansion: Path("~/.3gpp-crawler").expanduser(), not os.path.expanduser.

## Code Style
### Naming

- **Snake case everywhere:** snake_case for variables, functions, methods, modules.
- **PascalCase for classes**: Pydantic models, database classes, custom exceptions.
- **UPPER_CASE for module-level constants.**
- **Private methods/functions** prefixed with _
  (e.g., _normalize_tdoc_id, _ensure_tables_exist).

**Formatting:**
#### Path Variable Suffixes (CRITICAL)

- Tool: Ruff formatter/linter via `ruff.toml`
- Key settings:
  - `target-version = "py314"`
  - `line-length = 160`
  - `fix = true`
  - `unsafe-fixes = true`
  - `docstring-code-format = true`
Variables holding file paths MUST use _file suffix;
variables holding directory paths MUST use _dir suffix. Examples:

**Linting:**
- db_file: Path -- path to a database file
- cache_dir: Path -- path to a cache directory
- checkout_dir: Path -- path to a checkout directory
- config_file: Path | None -- path to a config file
- http_cache_file: Path -- path to HTTP cache file
- output_dir: Path -- path to an output directory

- Tool: Ruff (`[lint]` in `ruff.toml`)
- Key rules:
  - broad selection includes `E`, `F`, `I`, `D`, `PT`, `PL`, `ANN`, `B`, `S`
  - import-at-top enforced by `E402` and `PLC0415` (with explicit per-file exceptions)
  - Google docstring convention via `[lint.pydocstyle] convention = "google"`
  - tests allow assertion/secret subprocess exceptions via `ruff.toml` per-file ignores for `tests/**/*.py`
Applies to all variables: function parameters, locals, dataclass/pydantic fields.

## Import Organization
#### Test Naming

**Order:**
Tests follow test_<scenario> pattern for functions.
Test classes use Test<ClassName> or Test<Feature>:

1. future import (`from __future__ import annotations`)
2. standard library imports
3. third-party imports
4. local package imports (`tdoc_crawler.*`)
```python
class TestTDocDatabase:
    async def test_upsert_tdoc(self) - None: ...
    async def test_case_insensitive_query(self) - None: ...
```

**Path Aliases:**
### Code Size Limits (Soft)

- Not detected; repository uses package-qualified absolute imports (`from tdoc_crawler...`) rather than alias mapping.
| Scope | Limit |
|-------|-------|
| Module | < 250 lines |
| Function | < 75 lines |
| Class | < 200 lines |

## Error Handling

**Patterns:**
### Exception Hierarchy

- boundary failures are converted to typed exits in CLI (`raise typer.Exit(code=1) from exc` in `src/tdoc_crawler/cli/tdoc_app.py`)
- validation errors use explicit exceptions (for example unsupported URL scheme in `download_to_file` in `src/tdoc_crawler/http_client/session.py`)
- tests assert concrete failure paths with `pytest.raises(...)` (`tests/test_checkout.py`, `tests/test_cache_manager.py`)
- **DatabaseError(RuntimeError)** -- Base exception for database failures.
  Uses code: str and optional detail: str.
  Factory classmethods like DatabaseError.connection_not_open().
- **CacheManagerNotRegisteredError(RuntimeError)** -- Singleton not yet registered.
- **NormalizationError(ValueError)** -- Input normalization failures.

## Logging
### Patterns

- **Raise specific exceptions.** Never raise Exception("...").
  Each domain has its own exception class.
- **No bare except:**.
- **raise ... from exc** for exception chaining in context managers.
- **Let it fail:** Do not defensively wrap resolve_cache_manager().
  If not registered, that is a dev error.
- **pytest.raises(SomeError, match="...")** for asserting specific errors.

**Framework:** `logging` with Rich integration (`src/tdoc_crawler/logging/__init__.py`)
### Return Type Discipline

**Patterns:**
- **Avoid inconsistent return types.** A function returns X | None,
  not sometimes X, sometimes str, sometimes None.
  See docs/development.md for the documented antipattern.
- **Raise, don't return error strings.**
  Return None for "not found"; raise for invalid input.

- module logger retrieval via `get_logger(__name__)`
- root logger configured once (`configure_logger`) with `RichHandler`
- verbosity controlled centrally via `set_verbosity(...)`, used by CLI command handlers in `src/tdoc_crawler/cli/tdoc_app.py`
## Imports

## Comments
### Order

**When to Comment:**
1. from __future__ import annotations (always first, mandatory)
2. Standard library
3. Third-party (pydantic, typer, pytest, etc.)
4. First-party: tdoc_crawler.*
5. Local (relative imports only in tests: from .conftest import ...)

- comments are used to explain intent or behavior boundaries, especially in tests and adapters (for example niquests/hishel bridge rationale in `src/tdoc_crawler/http_client/session.py`)
### Style

**JSDoc/TSDoc:**
- **Absolute imports** within tdoc_crawler. No relative imports in src/.
- **Never import *** -- project uses explicit __all__ in every module.
- **Prefer importing classes/functions over modules:**
  from tdoc_crawler.config import CacheManager.
- **No TYPE_CHECKING for circular import workarounds.**
  Extract shared types to models/ instead.
- **TYPE_CHECKING acceptable** for expensive type-only imports
  (Iterable, Path, Version) in non-hot paths.

- Not applicable (Python codebase). Python docstrings are standard and widely used in source and tests (`tests/conftest.py`, `src/tdoc_crawler/config/settings.py`).
## Pydantic Models vs Dataclasses

## Function Design
### Pydantic Models (pydantic.BaseModel)

**Size:**
Use for:
- Data crossing system boundaries (database, JSON, YAML, CLI)
- Data needing validation (field validators, type coercion)
- Configuration classes (BaseSettings via pydantic-settings)

- mixed; module functions are typically medium-sized, while command handlers in `src/tdoc_crawler/cli/tdoc_app.py` can be larger and orchestrate nested helpers.
Pattern:
```python
class TDocMetadata(BaseModel):
    tdoc_id: str = Field(..., description="...")
    meeting_id: int = Field(..., description="...")

**Parameters:**
    @field_validator("tdoc_id")
    @classmethod
    def _normalize_tdoc_id(cls, value: str) -> str:
        return normalize_tdoc_id(value)

- explicit type hints are common in both source and tests (`Path | None`, `list[TDocMetadata]`, `requests.Session | None`)
- config-heavy functions use keyword-friendly optional parameters (for example `create_cached_session(...)`)
    @field_serializer("agenda_item_nbr")
    def _serialize_agenda_item_nbr(cls, value: AgendaItemNumber) -> str: ...
```

**Return Values:**
Key practices:
- Field(..., description="...") on every field
- Validators: _validate_<field> or _normalize_<field>
- Serializers: _serialize_<field>
- model_config = {"str_strip_whitespace": True} on config models
- model_dump(mode="json") for serialization

- typed return annotations are standard (`-> requests.Session`, `-> bool`, `-> list[TDocMetadata]`)
- async database workflows return domain models and are usually wrapped in `async with` contexts (`tests/test_crawler.py`, `tests/conftest.py`)
### Dataclasses

## Module Design
Use @dataclass for:
- Simple DTOs without validation
- Internal data transfer objects
- Example: PortalCredentials(username: str, password: str)

**Exports:**
## CLI Conventions

- controlled exports via `__all__` are used for package-facing APIs (`src/tdoc_crawler/http_client/__init__.py`, `src/tdoc_crawler/http_client/session.py`)
### Architecture

**Barrel Files:**
- **cli/ is thin** -- Only Typer command definitions and Rich formatting.
  All logic belongs in core library modules.
- **Never duplicate core library logic in CLI.**
  See src/tdoc_crawler/cli/AGENTS.md.

- used selectively for compatibility and API clarity (for example `src/tdoc_crawler/http_client/__init__.py` re-exports session-level functions and types)
### Typer Annotated Pattern

---
All CLI parameters use Annotated[type, typer.Option(...)] in cli/args.py:

```python
# In args.py:
WorkingGroupOption = Annotated[
    list[str] | None,
    typer.Option("--working-group", "-w", help="...", envvar="TDC_WORKING_GROUP"),
]

# In command definition:
def crawl_tdocs(
    working_group: WorkingGroupOption = None,
    limit_tdocs: LimitTDocsOption = None,
) -> None: ...
```

### Rich Output

- console from tdoc_crawler.cli._shared or tdoc_crawler.logging.get_console()
- Rich Tables with TableColumnSpec for structured data
- Rich Markup: [red]Error[/red], [green]OK[/green]
- Progress bars via create_progress_bar() from tdoc_crawler.cli._shared

### Command Registration

Commands registered with app.command("name", rich_help_panel=...).
Aliases use hidden=True:
```python
tdoc_app.command("crawl", rich_help_panel=HELP_PANEL_CRAWLING)(crawl_tdocs)
tdoc_app.command("ct", rich_help_panel=HELP_PANEL_CRAWLING, hidden=True)(crawl_tdocs)
```

## Logging

*Convention analysis: 2026-04-27*
- **Use get_logger(__name__)** from tdoc_crawler.logging.
- **Logger variable always named _logger.**
- **%-style formatting** for lazy evaluation:
  _logger.info("Found %d items", count).
- **_logger.exception(...)** for exception logging (includes traceback).
- **No logging.basicConfig()** -- root logger configured via configure_logger().
- Verbosity via set_verbosity(level).

## Docstrings

- **Google-style** (enforced by ruff: [lint.pydocstyle] convention = "google").
- **Required** for all public functions, classes, and methods (ruff D group).
- **Optional** for private/dunder methods (D100-D107 selectively ignored).
- Module-level docstrings are optional (D100 ignored).

## Ruff Configuration

See ruff.toml:
- target-version = "py314"
- line-length = 160
- preview = true
- Selected rule sets: E, F, C4, C90, D, I, PT, PL, SIM, UP, W, S, ANN, B, NPY
- Tests ignore: S101 (assert), S106, PLR6301, S603, PLW1510

## HTTP Caching

- **All HTTP requests MUST use create_cached_session()** from tdoc_crawler.http_client.
- Cache: SQLite via hishel.
- Session uses niquests (not requests).
- Pool configuration via PoolConfig.
- SSL verification via resolve_ssl_verify().
 No newline at end of file
+370 −204

File changed.

Preview size limit exceeded, changes collapsed.

Loading