Commit c1749592 authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(specs): implement auto-crawl for missing spec metadata

- Modify `open-spec` and `checkout-spec` commands to trigger a metadata crawl if the requested specification is missing from the local database.
- Enhance `SpecDownloads` to utilize sources for metadata fetching.
- Improve 3GPP and WhatTheSpec fetchers with appropriate headers to avoid access issues.
- Update `SpecCatalog` to handle version-specific file names from metadata.
- Fix JSON deserialization in `TDocDatabase` for better data integrity.
parent d80e4a95
Loading
Loading
Loading
Loading
+3 −3
Original line number Diff line number Diff line
@@ -34,7 +34,7 @@ Use aliases for faster typing: `tdoc-crawler ct` instead of `tdoc-crawler crawl-
- Cache files and the SQLite database default to `~/.tdoc-crawler`; override with `--cache-dir`.
- Incremental crawls skip IDs already stored; pass `--full` to force reprocessing.
- Queries that specify `tdoc_ids` automatically launch a targeted crawl when a requested ID is missing; results refresh before output.
- Targeted crawls infer working groups from the prefix of each TDoc ID (`R`, `S`, `T`, `C`).
- Specification commands (`open-spec`, `checkout-spec`) automatically crawl metadata for unknown spec numbers before attempting downloads.
- Downloaded TDocs live under `<cache-dir>/tdocs/` and are reused when possible.

## Typical Workflow
@@ -523,7 +523,7 @@ tdoc-crawler checkout R1-2400001 --cache-dir /path/to/cache
tdoc-crawler open-spec <SPEC_NUMBER> [OPTIONS]
```

Download and open the latest document for a specification.
Download and open the latest document for a specification. If the spec metadata is missing from the local database, it automatically triggers a crawl before downloading.

**Options:**

@@ -548,7 +548,7 @@ tdoc-crawler open-spec 23.501 -r 17
tdoc-crawler checkout-spec <SPEC_NUMBERS...> [OPTIONS]
```

Batch download specification documents to the checkout folder.
Batch download specification documents to the checkout folder. Automatically crawls missing spec metadata before downloading.

**Examples:**

+39 −0
Original line number Diff line number Diff line
# Summary - 2026-02-07 - SPEC_DOWNLOAD_AUTO_CRAWL_AND_BUG_FIXES

## Changes

### 1. Auto-Crawl for Specification Commands

- Modified `open-spec` and `checkout-spec` to automatically trigger a metadata crawl if the requested specification is missing from the local database.
- Impacted files: `src/tdoc_crawler/specs/downloads.py`, `src/tdoc_crawler/cli/app.py`.

### 2. 3GPP Metadata Fetcher Improvements

- Added browser-like `User-Agent` and `Accept` headers to the 3GPP metadata fetcher to avoid `403 Forbidden` errors from the 3GPP portal.
- Impacted file: `src/tdoc_crawler/specs/sources/threegpp.py`.

### 3. WhatTheSpec Fetcher Bug Fixes

- Fixed field mapping for `whatthespec` source: changed `versions` to `vers` and added support for version-specific file names via the `specfile` field.
- Impacted file: `src/tdoc_crawler/specs/sources/whatthespec.py`.

### 4. Spec Catalog Logic Update

- Improved `SpecCatalog` to handle mappings between versions and their specific file names when provided by metadata sources (e.g., `whatthespec`).
- Impacted file: `src/tdoc_crawler/specs/catalog.py`.

### 5. Database Retrieval Robustness

- Fixed a bug in `TDocDatabase` where JSON fields (`metadata_payload`, `versions`) from the `spec_source_records` table were being returned as raw strings instead of parsed dictionaries/lists during model instantiation.
- Impacted file: `src/tdoc_crawler/database/connection.py`.

### 6. Official Archive Download Fix

- Added `User-Agent` header to official 3GPP archive downloads (zip files) to prevent `403 Forbidden` errors during file retrieval.
- Impacted file: `src/tdoc_crawler/specs/downloads.py`.

## Verification Results

- `uv run tdoc-crawler open-spec 26132`: Successfully crawls metadata, downloads zip, and opens docx.
- `uv run tdoc-crawler checkout-spec 26.131`: Successfully crawls metadata and downloads spec.
- Database records for specifications are now correctly stored and retrieved with full metadata payloads.
+12 −2
Original line number Diff line number Diff line
@@ -628,10 +628,15 @@ def checkout_spec(

    effective_checkout_dir = checkout_dir or (cache_dir / "checkout")

    sources = [
        FunctionSpecSource("3gpp", fetch_threegpp_metadata),
        FunctionSpecSource("whatthespec", fetch_whatthespec_metadata),
    ]

    db_path = database_path(cache_dir)
    with TDocDatabase(db_path) as database:
        downloader = SpecDownloads(database)
        results = downloader.checkout_specs(specs, doc_only, effective_checkout_dir, release)
        results = downloader.checkout_specs(specs, doc_only, effective_checkout_dir, release, sources=sources)

    # Output formatting
    print_checkout_results(results)
@@ -648,11 +653,16 @@ def open_spec(
    normalized = spec.strip()
    checkout_dir = cache_dir / "checkout"

    sources = [
        FunctionSpecSource("3gpp", fetch_threegpp_metadata),
        FunctionSpecSource("whatthespec", fetch_whatthespec_metadata),
    ]

    db_path = database_path(cache_dir)
    with TDocDatabase(db_path) as database:
        downloader = SpecDownloads(database)
        try:
            path = downloader.open_spec(normalized, doc_only, checkout_dir, release)
            path = downloader.open_spec(normalized, doc_only, checkout_dir, release, sources=sources)
            console.print(f"[green]Opening {path}[/green]")
            launch_file(path)
        except Exception as exc:
+29 −7
Original line number Diff line number Diff line
"""Database access layer backed by pydantic_sqlite."""

import contextlib
import json
from collections import defaultdict
from collections.abc import Callable, Iterable
@@ -750,15 +751,36 @@ class TDocDatabase:

    def _get_spec_source_record(self, record_id: str) -> SpecificationSourceRecord | None:
        try:
            record = self.connection.model_from_table("spec_source_records", record_id)  # type: ignore[arg-type]
            # Handle JSON deserialization for metadata_payload field
            if record is not None and isinstance(record.metadata_payload, str):
            # Use raw query to handle JSON deserialization manually before model instantiation
            cursor = self.connection._db.execute("SELECT * FROM spec_source_records WHERE record_id = ?", (record_id,))
            row = cursor.fetchone()
            if row is None:
                return None

            columns = [description[0] for description in cursor.description]
            row_dict = dict(zip(columns, row, strict=False))

            # Handle JSON fields
            if "metadata_payload" in row_dict and isinstance(row_dict["metadata_payload"], str):
                try:
                    record = record.model_copy(update={"metadata_payload": json.loads(record.metadata_payload)})
                    row_dict["metadata_payload"] = json.loads(row_dict["metadata_payload"])
                except json.JSONDecodeError:
                    record = record.model_copy(update={"metadata_payload": {}})
            return record
        except KeyError:
                    row_dict["metadata_payload"] = {}

            if "versions" in row_dict and isinstance(row_dict["versions"], str):
                try:
                    row_dict["versions"] = json.loads(row_dict["versions"])
                except json.JSONDecodeError:
                    row_dict["versions"] = []

            # Handle datetime deserialization
            if "fetched_at" in row_dict and isinstance(row_dict["fetched_at"], str):
                with contextlib.suppress(ValueError, AttributeError):
                    row_dict["fetched_at"] = datetime.fromisoformat(row_dict["fetched_at"])

            return SpecificationSourceRecord(**row_dict)
        except Exception as exc:
            _logger.debug("Error fetching spec source record %s: %s", record_id, exc)
            return None

    def _get_spec_version(self, record_id: str) -> SpecificationVersion | None:
+9 −2
Original line number Diff line number Diff line
@@ -136,8 +136,15 @@ class SpecCatalog:
                elif aggregated.latest_version is None and candidate.latest_version is not None:
                    aggregated = aggregated.model_copy(update={"latest_version": candidate.latest_version})

                for version in normalized_versions:
                    file_name = str(metadata_payload.get("file_name", f"{compact}-unknown.zip"))
                for i, version in enumerate(normalized_versions):
                    # Try to get specific file name for this version from payload
                    file_name = f"{compact}-unknown.zip"
                    if "specfile" in metadata_payload and isinstance(metadata_payload["specfile"], list):
                        if i < len(metadata_payload["specfile"]):
                            file_name = str(metadata_payload["specfile"][i])
                    elif "file_name" in metadata_payload:
                        file_name = str(metadata_payload["file_name"])

                    spec_versions.append(
                        SpecificationVersion(
                            spec_number=normalized,
Loading