Commit a575d180 authored by Jan Reimes's avatar Jan Reimes
Browse files

📝 docs: refresh codebase map (7 documents)

parent d36f6bf8
Loading
Loading
Loading
Loading
+239 −87

File changed.

Preview size limit exceeded, changes collapsed.

+186 −117

File changed.

Preview size limit exceeded, changes collapsed.

+150 −60
Original line number Diff line number Diff line
# Coding Conventions

**Analysis Date:** 2026-04-27
**Analysis Date:** 2026-05-03

## Naming Patterns

**Files:**

- `snake_case.py` for source modules (`src/tdoc_crawler/http_client/session.py`, `src/tdoc_crawler/config/settings.py`)
- `test_*.py` for tests (`tests/test_crawler.py`, `tests/test_http_client.py`)
- package `__init__.py` files used for controlled re-exports (`src/tdoc_crawler/http_client/__init__.py`)
- Snake_case for Python modules: `cache_manager.py`, `crawl.py`, `date_parser.py`
- Test files prefixed with `test_`: `test_models.py`, `test_database.py`
- Sub-packages as directories with `__init__.py`

**Functions:**

- `snake_case` for functions and methods (`create_cached_session`, `resolve_ssl_verify`, `test_crawl_collects_tdocs`)
- private helpers prefixed with `_` (`_load_tdoc_metadata` in `src/tdoc_crawler/cli/tdoc_app.py`)
- snake_case for all functions: `normalize_tdoc_id()`, `resolve_cache_manager()`
- Private helpers prefixed with single underscore: `_parse_date()`, `_build_scope_description()`
- Field validators use `_validate_` prefix: `_validate_status()`, `_validate_agenda_item_nbr()`
- Field serializers use `_serialize_` prefix: `_serialize_status()`

**Variables:**

- `snake_case` locals and parameters (`test_cache_dir`, `http_cache_file`, `pool_config`)
- constants in `UPPER_SNAKE_CASE` (`DEFAULT_LEVEL`, `WORKSPACE_REGISTRY_FILENAME`)
- Path variables use suffix convention: `cache_dir`, `db_file`, `checkout_dir`, `http_cache_file` (never `dir_cache`, `file_db`)
- Module-level private defaults use `_UPPER_SNAKE` prefix: `_DEFAULT_CACHE_DIR_STR`, `_DEFAULT_DATABASE_FILENAME`
- Regex patterns use `UPPER_SNAKE`: `TDOC_PATTERN`, `DATE_PATTERN`, `_DOTTED_BODY_PATTERN`
- Constants use `UPPER_SNAKE`: `DEFAULT_LEVEL`, `BROWSER_HEADERS`, `TDOC_SUBDIRS`

**Types:**
- Pipe syntax for unions: `str | None`, `Path | str | None`**never** `Optional[T]` or `Union[T, None]`
- Type aliases at module level: `StructuredData = dict | list | str | int | float | bool | None`
- Pydantic models use PascalCase: `TDocMetadata`, `MeetingCrawlConfig`, `ThreeGPPConfig`
- Enums use PascalCase members: `WorkingGroup.RAN`, `TDocStatus.AGREED`
- `Final` for constants: `DEFAULT_LEVEL: Final[int] = logging.WARNING`

- `PascalCase` for classes/dataclasses (`PathConfig`, `HttpConfig`, `PoolConfig`, `TestTDocCrawler`)
**CLI Option Types:**
- Typer options defined as module-level `Annotated` aliases in `src/tdoc_crawler/cli/args.py`
- Naming convention: `{Name}Option`, `{Name}Argument` (PascalCase): `WorkingGroupOption`, `TDocIdsArgument`, `ForceOption`

## Code Style

**Formatting:**

- Tool: Ruff formatter/linter via `ruff.toml`
- Key settings:
  - `target-version = "py314"`
  - `line-length = 160`
  - `fix = true`
  - `unsafe-fixes = true`
  - `docstring-code-format = true`
- Ruff formatter (replaces Black)
- Line length: 160 characters
- Target: Python 3.14
- Docstring code blocks: also 160 chars line length

**Linting:**

- Tool: Ruff (`[lint]` in `ruff.toml`)
- Key rules:
  - broad selection includes `E`, `F`, `I`, `D`, `PT`, `PL`, `ANN`, `B`, `S`
  - import-at-top enforced by `E402` and `PLC0415` (with explicit per-file exceptions)
  - Google docstring convention via `[lint.pydocstyle] convention = "google"`
  - tests allow assertion/secret subprocess exceptions via `ruff.toml` per-file ignores for `tests/**/*.py`
- Ruff with preview rules enabled
- Key rule categories: E, F, C4, C90, D, I, PT, PL, SIM, UP, W, S, ANN, B, NPY
- Extended: RUF022 (sorted `__all__`), PLR6301 (no-self-use), PLC0415, E402
- Docstring convention: Google style (`[lint.pydocstyle] convention = "google"`)

**Key Ignored Rules (intentional):**
- D100–D107 (module/class/function docstrings not required)
- PLR0913 (many parameters allowed for config models)
- ANN401 (dynamically typed expressions allowed)
- C901 (complex structure — mccabe)

**Forbidden Patterns (non-negotiable):**
- PLC0415 (`import-outside-top-level`) — **NEVER** introduce inline imports
- No `# noqa` comments to suppress legitimate linting issues
- No `TYPE_CHECKING` guards
- No `.format()` or `%` string formatting — **f-strings only**
- No hardcoded paths like `~/.3gpp-crawler` — use `CacheManager`
- No hardcoded string values — use existing enums/structures from `constants/`, `models/`, or `ConfigEnvVar`

## Import Organization

**Order:**
**Order (enforced by Ruff isort):**
1. Standard library: `from __future__ import annotations`, then `pathlib`, `re`, `asyncio`, etc.
2. Third-party: `pydantic`, `typer`, `rich`, `pytest`, etc.
3. Local application: `from tdoc_crawler.config import ...`, `from tdoc_crawler.models import ...`

**Pattern:**
```python
from __future__ import annotations

from pathlib import Path
from typing import Annotated, Final

1. future import (`from __future__ import annotations`)
2. standard library imports
3. third-party imports
4. local package imports (`tdoc_crawler.*`)
import typer
from pydantic import BaseModel, Field

from tdoc_crawler.config import CacheManager
from tdoc_crawler.logging import get_logger
```

**Path Aliases:**
- No path aliases configured (no `tsconfig`-style paths)
- All imports use full package path: `from tdoc_crawler.cli.args import ...`
- Workspace packages imported by name: `from pool_executors.pool_executors import ...`

- Not detected; repository uses package-qualified absolute imports (`from tdoc_crawler...`) rather than alias mapping.
**`__all__` exports:**
- Every module defines `__all__` as a list of public names
- Ruff RUF022 enforces sorted `__all__`

## Error Handling

**Patterns:**

- boundary failures are converted to typed exits in CLI (`raise typer.Exit(code=1) from exc` in `src/tdoc_crawler/cli/tdoc_app.py`)
- validation errors use explicit exceptions (for example unsupported URL scheme in `download_to_file` in `src/tdoc_crawler/http_client/session.py`)
- tests assert concrete failure paths with `pytest.raises(...)` (`tests/test_checkout.py`, `tests/test_cache_manager.py`)
- Custom exceptions inherit from domain-appropriate base: `CacheManagerNotRegisteredError(RuntimeError)`, `NormalizationError(ValueError)`, `ConfigLoadError(Exception)`
- Raise with `from` for exception chaining: `raise ValueError(f"...") from exc` (see `src/tdoc_crawler/utils/normalization.py`)
- Database errors wrapped in `DatabaseError`
- Never silently swallow errors — let CacheManagerNotRegisteredError propagate if not registered
- Pydantic validators return normalized values or raise `ValueError`

**Anti-patterns:**
```python
# ❌ WRONG — defensive fallback for CacheManager
try:
    manager = resolve_cache_manager()
except CacheManagerNotRegisteredError:
    manager = CacheManager(default_cache_dir).register()

# ✅ CORRECT — let it fail, registration is a dev concern
manager = resolve_cache_manager()
```

## Logging

**Framework:** `logging` with Rich integration (`src/tdoc_crawler/logging/__init__.py`)
**Framework:** Standard `logging` with Rich handler via `src/tdoc_crawler/logging/__init__.py`

**Patterns:**
**Logger acquisition:**
```python
from tdoc_crawler.logging import get_logger
_logger = get_logger(__name__)
```

- module logger retrieval via `get_logger(__name__)`
- root logger configured once (`configure_logger`) with `RichHandler`
- verbosity controlled centrally via `set_verbosity(...)`, used by CLI command handlers in `src/tdoc_crawler/cli/tdoc_app.py`
**Patterns:**
- Module-level logger: `_logger = get_logger(__name__)` at top of module
- Use `_logger.debug()`, `_logger.info()`, `_logger.warning()` etc.
- Rich markup supported in log messages
- Verbosity set via `set_verbosity(level)` at CLI entry points
- Default level: `WARNING` (constant `DEFAULT_LEVEL`)

## Comments

**When to Comment:**

- comments are used to explain intent or behavior boundaries, especially in tests and adapters (for example niquests/hishel bridge rationale in `src/tdoc_crawler/http_client/session.py`)
- Google-style docstrings for all public classes and functions
- Module-level docstring in every `.py` file
- Docstrings include `Args:`, `Returns:`, `Raises:` sections
- Include `Examples:` section for utility functions

**JSDoc/TSDoc:**

- Not applicable (Python codebase). Python docstrings are standard and widely used in source and tests (`tests/conftest.py`, `src/tdoc_crawler/config/settings.py`).
- Not applicable (Python project)
- Pydantic `Field(description=...)` serves as inline documentation for model fields

## Function Design

**Size:**
**Size:** Functions range from single-line helpers to ~50 line methods. Complex operations broken into helper functions.

- mixed; module functions are typically medium-sized, while command handlers in `src/tdoc_crawler/cli/tdoc_app.py` can be larger and orchestrate nested helpers.

**Parameters:**

- explicit type hints are common in both source and tests (`Path | None`, `list[TDocMetadata]`, `requests.Session | None`)
- config-heavy functions use keyword-friendly optional parameters (for example `create_cached_session(...)`)
**Parameters:** Use Pydantic config models (e.g., `TDocCrawlConfig`, `TDocQueryConfig`) for functions with many parameters rather than long parameter lists.

**Return Values:**

- typed return annotations are standard (`-> requests.Session`, `-> bool`, `-> list[TDocMetadata]`)
- async database workflows return domain models and are usually wrapped in `async with` contexts (`tests/test_crawler.py`, `tests/conftest.py`)
- Pydantic models for structured data: `TDocMetadata`, `MeetingMetadata`
- Result dataclasses/tuples for operation outcomes: `TDocCrawlResult(processed, inserted, updated, errors)`
- `bool` for success/failure where appropriate
- `None` for void operations

## Module Design

**Exports:**

- controlled exports via `__all__` are used for package-facing APIs (`src/tdoc_crawler/http_client/__init__.py`, `src/tdoc_crawler/http_client/session.py`)
- Every module defines `__all__` with all public names
- Package `__init__.py` re-exports from submodules for convenient imports
- Example: `from tdoc_crawler.config import CacheManager, resolve_cache_manager`

**Barrel Files:**
- Used selectively in `__init__.py` files for major subpackages
- Domain packages (`tdocs/`, `meetings/`, `specs/`) do **not** re-export operations to avoid circular imports

**CLI thin layer:**
- `src/tdoc_crawler/cli/` contains only Typer/Rich wrappers
- All domain logic belongs in `src/tdoc_crawler/<domain>/`
- CLI commands call core library functions; never implement business logic in CLI

## Async Patterns

- Database operations are async: `async with TDocDatabase(db_file) as db:`
- CLI bridges sync→async with `asyncio.run()`: `result = asyncio.run(run_tdoc_crawl())`
- Async context managers for database connections
- `@pytest.mark.asyncio` for async test methods

## Configuration Patterns

**Config precedence (highest to lowest):**
1. CLI explicit `--config` parameter
2. Discovered config files (later overrides earlier)
3. Environment variables (`TDC_*`, `TDC_EOL_*`, `TDC_CRAWL_*`, `HTTP_CACHE_*`)
4. Hard-coded defaults

**Environment variables:**
- Defined as `ConfigEnvVar` enum in `src/tdoc_crawler/config/env_vars.py`
- CLI options reference env vars via `ConfigEnvVar.TDC_WORKING_GROUP.name`
- pydantic-settings `AliasChoices` maps env vars to config fields

**Singleton pattern:**
- `CacheManager` uses class-level `_instance` singleton
- Registered once at CLI startup via `CacheManager(cache_dir).register()`
- Resolved everywhere via `resolve_cache_manager()`
- Reset in tests via `_reset_cache_manager` autouse fixture

## Pre-commit Hooks

- used selectively for compatibility and API clarity (for example `src/tdoc_crawler/http_client/__init__.py` re-exports session-level functions and types)
**Config:** `.pre-commit-config.yaml`
- `ruff-check` with `--exit-non-zero-on-fix`
- `ruff-format`
- `undersort` (method visibility ordering: public → protected → private)
- Standard hooks: check-toml, check-yaml, check-json, end-of-file-fixer, trailing-whitespace

---

*Convention analysis: 2026-04-27*
*Convention analysis: 2026-05-03*
+118 −67
Original line number Diff line number Diff line
# External Integrations

**Analysis Date:** 2026-04-27
**Analysis Date:** 2026-05-03

## APIs & External Services

**3GPP public endpoints:**

- 3GPP meetings/spec pages (`www.3gpp.org`) - Meeting/spec metadata lookups
 	- SDK/Client: `niquests` sessions created by `create_cached_session()` in `src/tdoc_crawler/http_client/session.py`
 	- Auth: None
- 3GPP FTP spec archive (`https://www.3gpp.org/ftp/Specs/archive/...`) - Spec file downloads
 	- SDK/Client: URL templates in `src/tdoc_crawler/constants/urls.py`, downloads through `src/tdoc_crawler/specs/downloads.py`
 	- Auth: None

**3GPP portal endpoints:**

- Portal login and TDoc endpoints (`portal.3gpp.org`) - Authenticated fallback for TDoc metadata
 	- SDK/Client: `PortalClient` in `src/tdoc_crawler/clients/portal.py`
 	- Auth: `TDC_EOL_USERNAME`, `TDC_EOL_PASSWORD` (mapped in `src/tdoc_crawler/config/env_vars.py`)
- Meeting document list endpoint (`GenerateDocumentList.aspx`) - Unauthenticated Excel document list fetch
 	- SDK/Client: `src/tdoc_crawler/tdocs/sources/doclist.py`
 	- Auth: None

**Community metadata API:**

- WhatTheSpec (`whatthespec.net`) - Preferred unauthenticated TDoc/spec metadata source and fallback path
 	- SDK/Client: `src/tdoc_crawler/tdocs/sources/whatthespec.py`, `src/tdoc_crawler/specs/sources/whatthespec.py`
 	- Auth: None

**AI and conversion services:**

- LLM providers through LiteLLM - Summarization/figure description/completions
 	- SDK/Client: `litellm` via `packages/3gpp-ai/threegpp_ai/operations/llm_client.py`
 	- Auth: `TDC_AI_LLM_API_KEY` or provider-specific API key env vars
- Remote Office-to-PDF conversion service (`pdf-convert.3gpp.org`) - Fallback when local LibreOffice conversion fails
 	- SDK/Client: `packages/3gpp-ai/threegpp_ai/operations/conversion.py`
 	- Auth: `PDF_REMOTE_API_KEY` (Bearer token, optional)
### 3GPP Portal (Primary)
- **Service:** 3GPP EOL (ETSI Online) authenticated portal
- **Base URL:** `https://portal.3gpp.org` (defined in `src/tdoc_crawler/constants/urls.py`)
- **Endpoints:**
  - `/ngppapp/CreateTdoc.Aspx` — TDoc metadata view (authenticated)
  - `/ngppapp/DownloadTDoc.aspx` — TDoc file download URL extraction
  - `/ngppapp/GenerateDocumentList.aspx` — Meeting document list Excel export
  - `/ETSIPages/LoginEOL.ashx` — AJAX login endpoint (JSON POST)
  - `/login.aspx` — Login page (session establishment)
- **Client:** `PortalClient` in `src/tdoc_crawler/clients/portal.py`
- **Auth:** EOL username/password via `TDC_EOL_USERNAME` / `TDC_EOL_PASSWORD`
- **Resolution order:** CLI args > config file > env vars > interactive prompt (`src/tdoc_crawler/credentials.py`)

### WhatTheSpec.net (Fallback)
- **Service:** Community 3GPP metadata API
- **Base URL:** `https://whatthespec.net/3gpp/`
- **Endpoints:**
  - `tdoc.php?name={tdoc}&api=1` — TDoc metadata lookup
  - `spec.php?q={compact}&api=1` — Spec metadata lookup
- **Client:** `src/tdoc_crawler/tdocs/sources/whatthespec.py`, `src/tdoc_crawler/specs/sources/whatthespec.py`
- **Auth:** None required (public API)
- **Role:** Fallback for TDoc/spec resolution; primary is 3GPP portal

### 3GPP dynareport (Spec Metadata)
- **Service:** 3GPP specification version listing
- **Base URL:** `https://www.3gpp.org/dynareport/{compact}.htm`
- **Client:** `src/tdoc_crawler/specs/sources/threegpp.py`
- **Auth:** None required
- **Role:** Primary source for spec metadata and version information

### 3GPP FTP (Spec Downloads)
- **Service:** 3GPP specification archive FTP server
- **Base URL:** `https://www.3gpp.org/ftp/Specs/archive/{series}/{normalized}/{file_name}`
- **Template:** Defined in `src/tdoc_crawler/constants/urls.py` as `SPEC_URL_TEMPLATE`
- **Auth:** None (public FTP over HTTPS)
- **Role:** Download specification documents

### 3GPP Meetings Page
- **Service:** 3GPP meetings listing
- **Base URL:** `https://www.3gpp.org/dynareport?code=Meetings-{code}.htm`
- **Template:** Defined in `src/tdoc_crawler/constants/urls.py` as `MEETINGS_BASE_URL`
- **Auth:** None required

### PDF Remote Converter (Optional Fallback)
- **Service:** Remote PDF conversion API
- **Base URL:** `https://pdf-convert.3gpp.org` (configurable via `PDF_REMOTE_API_BASE`)
- **Client:** `src/tdoc_crawler/extraction/conversion.py`
- **Auth:** API key via `PDF_REMOTE_API_KEY` env var
- **Role:** Fallback when LibreOffice is not available locally

### Hugging Face Hub (Optional)
- **Service:** Model hosting for AI/embedding features
- **Auth:** `HF_TOKEN` env var (optional, avoids rate limits)
- **Role:** Download AI models for opendataloader-pdf hybrid mode

## Data Storage

**Databases:**

- SQLite (primary metadata store)
 	- Connection: local file path via `PathConfig.db_file` in `src/tdoc_crawler/config/settings.py`
 	- Client: Oxyde `AsyncDatabase` in `src/tdoc_crawler/database/base.py` with `sqlite:///...` URL
- SQLite (HTTP cache store)
 	- Connection: local cache DB path via `PathConfig.http_cache_file`
 	- Client: `hishel.SyncSqliteStorage` in `src/tdoc_crawler/http_client/session.py`
- SQLite (via Oxyde ORM)
  - Connection: `CacheManager.db_file``~/.3gpp-crawler/3gpp_crawler.db`
  - ORM Client: `oxyde.AsyncDatabase` (`src/tdoc_crawler/database/base.py`)
  - Models: `src/tdoc_crawler/database/oxyde_models.py`
  - Tables: `tdocs`, `meetings`, `specs`, `spec_versions`, `spec_downloads`, `spec_source_records`, `working_groups`, `subworking_groups`, `crawl_log`
- SQLite (HTTP Cache)
  - Connection: `CacheManager.http_cache_file``~/.3gpp-crawler/http-cache.sqlite3`
  - Managed by: `hishel.SyncSqliteStorage`
  - Purpose: HTTP response caching with TTL and refresh-on-access

**File Storage:**

- Local filesystem only (cache, checkout, AI workspace folders) managed by `PathConfig` and `CacheManager` in `src/tdoc_crawler/config/settings.py` and `src/tdoc_crawler/config/cache_manager.py`
- Local filesystem via `CacheManager`
  - Cache root: `~/.3gpp-crawler/` (configurable via `TDC_CACHE_DIR`)
  - Checkout dir: `~/.3gpp-crawler/checkout/` (downloaded/converted documents)
  - Workspaces dir: `~/.3gpp-crawler/workspaces/` (AI workspace data)

**Caching:**

- HTTP response caching via `hishel` + SQLite in `src/tdoc_crawler/http_client/session.py`
- HTTP response cache — hishel SQLite-backed (configurable TTL, default 7200s / 2 hours)
- File-based cache for downloaded documents (checkout directory)

## Authentication & Identity

**Auth Provider:**

- Custom credential-based portal auth for 3GPP EOL portal
 	- Implementation: username/password in `CredentialsConfig` (`src/tdoc_crawler/config/settings.py`) consumed by `PortalClient` (`src/tdoc_crawler/clients/portal.py`)
- ETSI Online (EOL) — 3GPP portal authentication
  - Implementation: `PortalClient.authenticate()` in `src/tdoc_crawler/clients/portal.py`
  - Credentials model: `PortalCredentials` in `src/tdoc_crawler/models/base.py`
  - Resolution: `resolve_credentials()` in `src/tdoc_crawler/credentials.py`
  - Flow: Session-based (cookies), AJAX login API with JSON payload
  - Resolution order: CLI > config file > env vars > interactive prompt

## Monitoring & Observability

**Error Tracking:**

- None detected (no external Sentry/New Relic/etc. integration)
- None (external)

**Logs:**

- Python logging with centralized helpers (`tdoc_crawler.logging.get_logger`) used across `src/tdoc_crawler/` and `packages/3gpp-ai/threegpp_ai/`
- Rich-based console logging (`src/tdoc_crawler/logging/__init__.py`)
- `get_logger(__name__)` pattern throughout codebase
- `set_verbosity()` for runtime log level control
- Verbosity controlled via `TDC_VERBOSITY` or CLI `--verbose` flag

## CI/CD & Deployment

**Hosting:**

- Not detected (repository is CLI/package oriented)
- Local CLI tool (no deployment target)
- Repository: `https://forge.etsi.org/rep/reimes/3gpp-crawler`

**CI Pipeline:**

- No `.github/workflows/` directory detected
- Local/portable test orchestration via `tox.ini` (`tox` + `tox-uv`)
- tox for multi-Python version testing (py39–py313)
- Codecov integration (`codecov.yaml`, 90% coverage target)
- pre-commit hooks (ruff check/format, undersort, standard checks)

## Environment Configuration

**Required env vars:**

- Base crawler/runtime: `TDC_CACHE_DIR`, `HTTP_CACHE_TTL`, `TDC_TIMEOUT`, `TDC_MAX_RETRIES`
- Portal auth fallback: `TDC_EOL_USERNAME`, `TDC_EOL_PASSWORD`
- AI pipeline: `TDC_AI_LLM_MODEL`, `TDC_AI_LLM_API_KEY`, `TDC_AI_LLM_API_BASE`, `TDC_AI_PARALLELISM`
- Optional remote conversion: `PDF_REMOTE_API_KEY`, `PDF_REMOTE_API_BASE`
- `TDC_EOL_USERNAME` — EOL portal username (for authenticated features)
- `TDC_EOL_PASSWORD` — EOL portal password (for authenticated features)

**Optional env vars:**
- `TDC_CACHE_DIR` — Cache directory override
- `TDC_DB_FILENAME` — SQLite database filename
- `TDC_TIMEOUT` — HTTP request timeout (default: 30)
- `TDC_MAX_RETRIES` — Retry count (default: 3)
- `TDC_VERIFY_SSL` — SSL verification toggle (default: true)
- `TDC_WORKERS` — Parallel worker count (default: 4)
- `HTTP_CACHE_ENABLED` — HTTP cache toggle (default: true)
- `HTTP_CACHE_TTL` — Cache TTL in seconds (default: 7200)
- `PDF_REMOTE_API_KEY` — Remote PDF converter API key
- `PDF_REMOTE_API_BASE` — Remote converter base URL
- `HF_TOKEN` — Hugging Face API token
- `LIBREOFFICE_PATH` — LibreOffice executable path override

**Secrets location:**

- Environment variables (primary)
- Optional local config files loaded by settings discovery (`src/tdoc_crawler/config/sources.py`); secrets should remain environment-backed
- `.env` file (gitignored, documented in `.env.example`)
- Config files support `${ENV_VAR}` interpolation (`src/tdoc_crawler/config/sources.py`)
- Interactive prompt fallback for EOL credentials

## Webhooks & Callbacks

**Incoming:**

- None

**Outgoing:**

- None

## HTTP Infrastructure

**Session Factory:**
- `create_cached_session()` in `src/tdoc_crawler/http_client/session.py`
- Uses niquests (HTTP/2-capable, requests-compatible API)
- Hishel CacheAdapter bridges niquests↔hishel for HTTP caching
- `_NiquetsCacheAdapter` custom subclass handles type incompatibility
- Browser headers injected from `src/tdoc_crawler/constants/urls.py` to avoid 403 responses
- Decompression: gzip, deflate, brotli handled explicitly (`_decompress_body()`)
- Download helper: `download_to_file()` with streaming and empty-download detection
- Pool config: `PoolConfig` dataclass for connection pool tuning

---

*Integration audit: 2026-04-27*
*Integration audit: 2026-05-03*
+83 −53

File changed.

Preview size limit exceeded, changes collapsed.

Loading