Commit 4a0ac5ac authored by Jan Reimes's avatar Jan Reimes
Browse files

Replace str version types with Version subclasses (AgendaItem, SpecVersion)

AgendaItem and SpecVersion subclass packaging.version.Version and
implement __get_pydantic_core_schema__ so Oxyde/Pydantic models can
use them directly — stored as strings in SQLite, full Version
semantics in Python.

Eliminates the parse_*_version → Version → str() roundtrip that
lost type safety. The type aliases AgendaItemNumber and
SpecificationVersionNumber now resolve to the Version subclasses.

Also removes redundant field_validator calls in specs/models.py that
were doing what the SpecVersion type now handles at construction time.

Adds database/AGENTS.md documenting model choice guidance:
Oxyde Model for DB, dataclass for in-memory, Pydantic only when
validators/serializers are needed.
parent 56fa82a0
Loading
Loading
Loading
Loading
+92 −0
Original line number Diff line number Diff line
# Database — Persistence Layer

SQLite via Oxyde ORM. All models live in `oxyde_models.py`.

## Model Choice Guide

| Use When | Tool | Location |
|----------|------|----------|
| Persisting to SQLite | **Oxyde `Model`** | `database/oxyde_models.py` |
| Passing data between functions, no DB | **`dataclass`** | Domain module (e.g. `workspaces/`, `specs/models.py`) |
| Need Pydantic validators/serializers for external I/O | **Pydantic `BaseModel`** | Domain module |

### Why each choice

**Oxyde `Model`** — required for anything that goes into SQLite. Oxyde handles CRUD, migrations, and query building. All DB-facing types must be Oxyde-compatible (primitive types, or custom types with `__get_pydantic_core_schema__`).

**`dataclass`** — default for pure in-memory data transfer. No serialization overhead. Use for filter objects, query configs, result containers. If it doesn't need `model_dump()` or validators, use a dataclass.

**Pydantic `BaseModel`** — only when you need Pydantic-specific features: `field_validator`, `model_validator`, custom serializers, JSON schema generation, or interop with external APIs. The `TDocCrawlConfig` and `TDocQueryConfig` models use Pydantic because they have complex validation logic.

### Decision flow

```
Does it go into SQLite?
  → Yes: Oxyde Model in oxyde_models.py
  → No: Does it need validators/serializers/JSON schema?
         → Yes: Pydantic BaseModel in domain module
         → No: dataclass in domain module
```

## Single Model Per Entity

**One Oxyde model per database table. No duplicates.**

The canonical example: there used to be two `TDocMetadata` classes — a Pydantic one for the pipeline and an Oxyde one for the DB. This caused `AttributeError` when the DB layer accessed fields (`tbid`, `file_size`) that only existed on the Oxyde model. The fix: delete the Pydantic duplicate, use the Oxyde model everywhere.

If a field needs custom behavior (e.g. `Version` subclass for `agenda_item_nbr`), make the Oxyde model's field type the custom type. Don't create a parallel Pydantic model with different field types.

## Custom Field Types

For non-primitive types stored as strings in SQLite, create a `Version` subclass in `utils/version_types.py`:

```python
from packaging.version import Version
from pydantic import GetCoreSchemaHandler
from pydantic_core import CoreSchema, core_schema

class AgendaItem(Version):
    @classmethod
    def __get_pydantic_core_schema__(cls, source_type, handler: GetCoreSchemaHandler) -> CoreSchema:
        def validate(value):
            if isinstance(value, Version):
                return cls(str(value))
            return cls(str(value).strip())
        return core_schema.no_info_plain_validator_function(
            validate,
            serialization=core_schema.plain_serializer_function_ser_schema(str),
        )
```

This gives you:
- Full `Version` comparison/sorting in Python
- Automatic `str` serialization for DB storage
- Pydantic validation from `str`/`int`/`Version` input

## Key Rules

- **Import `TDocMetadata` from `database.oxyde_models`**, never from `tdocs.models`.
- **Don't create parallel models** for the same entity. If the Oxyde model lacks a field, add it there.
- **Custom types go in `utils/version_types.py`**, not inline in models.
- **Tests that construct DB models** must import from `database.oxyde_models`.

## Database Classes

| Class | File | Purpose |
|-------|------|---------|
| `TDocDatabase` | `tdocs.py` | TDoc CRUD, queries, bulk operations |
| `MeetingDatabase` | `meetings.py` | Meeting CRUD, queries, statistics |
| `SpecDatabase` | `specs.py` | Spec CRUD, version management |
| `DocDatabase` | `base.py` | Base class with shared helpers |

## Common Mistakes

```python
# ❌ WRONG — creating a parallel model
from tdoc_crawler.tdocs.models import TDocMetadata  # old Pydantic model, now deleted
metadata = TDocMetadata(tdoc_id="S4-123", ...)

# ✅ CORRECT — use the single Oxyde model
from tdoc_crawler.database.oxyde_models import TDocMetadata
metadata = TDocMetadata(tdoc_id="S4-123", ...)
```
+10 −8
Original line number Diff line number Diff line
@@ -13,6 +13,8 @@ from typing import Any
from oxyde import Field, Model
from pydantic import field_validator

from tdoc_crawler.utils.version_types import AgendaItem, SpecVersion


def _utcnow() -> datetime:
    return datetime.now(UTC)
@@ -102,7 +104,7 @@ class TDocMetadata(Model):
    contact: str
    tdoc_type: str = Field(default="unknown")
    for_purpose: str = Field(default="unknown")
    agenda_item_nbr: str
    agenda_item_nbr: AgendaItem
    agenda_item_text: str = Field(default="Unknown")
    status: str | None = None
    is_revision_of: str | None = None
@@ -129,7 +131,7 @@ class Specification(Model):
    status: str
    working_group: str
    series: str
    latest_version: str | None = None
    latest_version: SpecVersion | None = None

    class Meta:
        """Oxyde table configuration."""
@@ -145,7 +147,7 @@ class SpecificationSourceRecord(Model):
    source_name: str
    source_identifier: str | None = None
    metadata_payload: dict[str, Any] = Field(default_factory=dict)
    versions: list[str] = Field(default_factory=list)
    versions: list[SpecVersion] = Field(default_factory=list)
    fetched_at: datetime | None = None

    @field_validator("metadata_payload", mode="before")
@@ -161,14 +163,14 @@ class SpecificationSourceRecord(Model):

    @field_validator("versions", mode="before")
    @classmethod
    def _parse_versions(cls, value: list[str] | str) -> list[str]:
    def _parse_versions(cls, value: list[str | SpecVersion] | str) -> list[SpecVersion]:
        if isinstance(value, str):
            try:
                parsed = json.loads(value)
            except json.JSONDecodeError:
                return []
            return [str(item) for item in parsed] if isinstance(parsed, list) else []
        return [str(item) for item in value]
            return [SpecVersion(str(item)) for item in parsed] if isinstance(parsed, list) else []
        return [SpecVersion(str(item)) for item in value]

    class Meta:
        """Oxyde table configuration."""
@@ -181,7 +183,7 @@ class SpecificationVersion(Model):

    record_id: str = Field(db_pk=True)
    spec_number: str
    version: str
    version: SpecVersion
    file_name: str
    source_name: str

@@ -196,7 +198,7 @@ class SpecificationDownload(Model):

    record_id: str = Field(db_pk=True)
    spec_number: str
    version: str
    version: SpecVersion
    download_url: str
    checkout_path: str
    document_path: str
+6 −35
Original line number Diff line number Diff line
@@ -10,12 +10,11 @@ from datetime import datetime
from pathlib import Path
from typing import Any

from packaging.version import Version
from pydantic import BaseModel, Field, field_validator
from pydantic import BaseModel, Field
from rich.console import Console, ConsoleOptions, RenderResult
from rich.text import Text

from tdoc_crawler.utils.parse import SpecificationVersionNumber, parse_spec_version_nbr
from tdoc_crawler.utils.version_types import SpecVersion


class Specification(BaseModel):
@@ -28,20 +27,12 @@ class Specification(BaseModel):
    status: str
    working_group: str
    series: str
    latest_version: SpecificationVersionNumber | None = None
    latest_version: SpecVersion | None = None

    def __rich_console__(self, console: Console, options: ConsoleOptions) -> RenderResult:
        _ = (console, options)
        yield Text(f"{self.spec_number} - {self.title}")

    @field_validator("latest_version", mode="before")
    @classmethod
    def _normalize_latest_version(cls, value: Version | SpecificationVersionNumber | None) -> SpecificationVersionNumber | None:
        """Normalize latest version to canonical three-part string when provided."""
        if value is None:
            return None
        return parse_spec_version_nbr(value)


class SpecificationSourceRecord(BaseModel):
    """Source-specific metadata snapshot."""
@@ -51,40 +42,26 @@ class SpecificationSourceRecord(BaseModel):
    source_name: str
    source_identifier: str | None = None
    metadata_payload: dict[str, Any] = Field(default_factory=dict)
    versions: list[SpecificationVersionNumber] = Field(default_factory=list)
    versions: list[SpecVersion] = Field(default_factory=list)
    fetched_at: datetime | None = None

    @field_validator("versions", mode="before")
    @classmethod
    def _normalize_versions(cls, value: list[Version | SpecificationVersionNumber] | None) -> list[SpecificationVersionNumber]:
        """Normalize source versions to canonical three-part strings."""
        if value is None:
            return []
        return [parse_spec_version_nbr(item) for item in value]


class SpecificationVersion(BaseModel):
    """Spec version details."""

    record_id: str | None = None
    spec_number: str
    version: SpecificationVersionNumber
    version: SpecVersion
    file_name: str
    source_name: str

    @field_validator("version", mode="before")
    @classmethod
    def _normalize_version(cls, value: Version | SpecificationVersionNumber) -> SpecificationVersionNumber:
        """Normalize version to canonical three-part string."""
        return parse_spec_version_nbr(value)


class SpecificationDownload(BaseModel):
    """Download and extraction outcome for a spec version."""

    record_id: str | None = None
    spec_number: str
    version: SpecificationVersionNumber
    version: SpecVersion
    download_url: str
    checkout_path: Path
    document_path: Path
@@ -94,12 +71,6 @@ class SpecificationDownload(BaseModel):
    outcome_message: str | None = None
    extracted_at: datetime | None = None

    @field_validator("version", mode="before")
    @classmethod
    def _normalize_version(cls, value: Version | SpecificationVersionNumber) -> SpecificationVersionNumber:
        """Normalize downloaded version to canonical three-part string."""
        return parse_spec_version_nbr(value)


@dataclass
class SpecQueryFilters:
+25 −25
Original line number Diff line number Diff line
@@ -8,44 +8,44 @@ from packaging.version import InvalidVersion, Version
from tdoc_crawler.meetings.utils import normalize_subgroup_alias, normalize_working_group_alias
from tdoc_crawler.models.working_groups import WorkingGroup
from tdoc_crawler.utils.normalization import expand_spec_ranges_batch
from tdoc_crawler.utils.version_types import AgendaItem, SpecVersion

type AgendaItemNumber = str
type SpecificationVersionNumber = str
type AgendaItemNumber = AgendaItem
type SpecificationVersionNumber = SpecVersion


def parse_agenda_item_nbr(value: str | int | None) -> AgendaItemNumber:
    """Parse agenda item number as canonical string."""
    return str(parse_agenda_item_version(value))


def parse_agenda_item_version(value: str | int | None) -> Version:
    """Parse agenda item number as Version."""
def parse_agenda_item(value: str | int | None) -> AgendaItem:
    """Parse agenda item number as AgendaItem (Version subclass)."""
    if value is None:
        return Version("0")
        return AgendaItem("0")
    try:
        return Version(str(value).strip())
        return AgendaItem(str(value).strip())
    except InvalidVersion:
        return Version("0")
        return AgendaItem("0")


def parse_spec_version_nbr(value: str | int | None) -> SpecificationVersionNumber:
    """Parse specification version as canonical three-part string."""
    return str(parse_spec_version(value))
# Backward-compatible aliases
parse_agenda_item_nbr = parse_agenda_item
parse_agenda_item_version = parse_agenda_item


def parse_spec_version(value: str | int | None) -> Version:
    """Parse specification version as Version, normalized to major.minor.patch."""
    if value is None:
        return Version("0.0.0")
def parse_spec_version(value: str | int | None) -> SpecVersion:
    """Parse specification version as SpecVersion (Version subclass).

    parsed = Version(str(value).strip())
    release_parts = list(parsed.release)
    Normalizes to major.minor.patch.
    """
    if value is None:
        return SpecVersion("0.0.0")
    try:
        v = Version(str(value).strip())
    except InvalidVersion:
        return SpecVersion("0.0.0")
    parts = list(v.release) + [0] * (3 - len(v.release))
    return SpecVersion(".".join(str(p) for p in parts[:3]))

    if len(release_parts) >= 3:
        return Version(".".join(str(part) for part in release_parts[:3]))

    padded_parts = release_parts + [0] * (3 - len(release_parts))
    return Version(".".join(str(part) for part in padded_parts))
# Backward-compatible aliases
parse_spec_version_nbr = parse_spec_version


def infer_working_groups_from_subgroups(subgroups: list[str]) -> list[WorkingGroup]:
+65 −0
Original line number Diff line number Diff line
"""Pydantic-aware Version subclasses for 3GPP domain types.

Provides ``AgendaItem`` and ``SpecVersion`` — both subclass
``packaging.version.Version`` and implement
``__get_pydantic_core_schema__`` so Oxyde/Pydantic models can store them
as strings while exposing full ``Version`` comparison semantics in Python.
"""

from __future__ import annotations

from packaging.version import Version
from pydantic import GetCoreSchemaHandler
from pydantic_core import CoreSchema, core_schema


class _VersionField(Version):
    """Base class that makes ``Version`` work as a Pydantic field type.

    Serializes to ``str`` for DB/JSON storage, deserializes from
    ``str | int | Version`` input.
    """

    @classmethod
    def __get_pydantic_core_schema__(cls, source_type: type, handler: GetCoreSchemaHandler) -> CoreSchema:
        def validate(value: str | int | Version) -> _VersionField:
            if isinstance(value, Version):
                return cls(str(value))
            return cls(str(value).strip())

        return core_schema.no_info_plain_validator_function(
            validate,
            serialization=core_schema.plain_serializer_function_ser_schema(str),
        )


class AgendaItem(_VersionField):
    """Agenda item number (e.g. ``11.3.2``).

    A ``Version`` subclass — supports ordering, comparison, and
    ``str()`` serialization for DB storage.
    """


class SpecVersion(_VersionField):
    """Three-part specification version (e.g. ``18.1.0``).

    Input is normalized to ``major.minor.patch`` on construction.
    A ``Version`` subclass — supports ordering, comparison, and
    ``str()`` serialization for DB storage.
    """

    @classmethod
    def __get_pydantic_core_schema__(cls, source_type: type, handler: GetCoreSchemaHandler) -> CoreSchema:
        def validate(value: str | int | Version) -> SpecVersion:
            v = value if isinstance(value, Version) else Version(str(value).strip())
            parts = list(v.release) + [0] * (3 - len(v.release))
            return cls(".".join(str(p) for p in parts[:3]))

        return core_schema.no_info_plain_validator_function(
            validate,
            serialization=core_schema.plain_serializer_function_ser_schema(str),
        )


__all__ = ["AgendaItem", "SpecVersion"]
Loading