feat(specs): implement auto-crawl for missing spec metadata (c1749592) · Commits · Jan Reimes / 3gpp-crawler

docs/QUICK_REFERENCE.md

+3 −3

Original line number	Diff line number	Diff line
		@@ -34,7 +34,7 @@ Use aliases for faster typing: `tdoc-crawler ct` instead of `tdoc-crawler crawl-
		- Cache files and the SQLite database default to `~/.tdoc-crawler`; override with `--cache-dir`.
		- Incremental crawls skip IDs already stored; pass `--full` to force reprocessing.
		- Queries that specify `tdoc_ids` automatically launch a targeted crawl when a requested ID is missing; results refresh before output.
		- Targeted crawls infer working groups from the prefix of each TDoc ID (`R`, `S`, `T`, `C`).
		- Specification commands (`open-spec`, `checkout-spec`) automatically crawl metadata for unknown spec numbers before attempting downloads.
		- Downloaded TDocs live under `<cache-dir>/tdocs/` and are reused when possible.

		## Typical Workflow
		@@ -523,7 +523,7 @@ tdoc-crawler checkout R1-2400001 --cache-dir /path/to/cache
		tdoc-crawler open-spec <SPEC_NUMBER> [OPTIONS]
		```

		Download and open the latest document for a specification.
		Download and open the latest document for a specification. If the spec metadata is missing from the local database, it automatically triggers a crawl before downloading.

		Options:

		@@ -548,7 +548,7 @@ tdoc-crawler open-spec 23.501 -r 17
		tdoc-crawler checkout-spec <SPEC_NUMBERS...> [OPTIONS]
		```

		Batch download specification documents to the checkout folder.
		Batch download specification documents to the checkout folder. Automatically crawls missing spec metadata before downloading.

		Examples:

docs/history/2026-02-07_SUMMARY_01_SPEC_DOWNLOAD_AUTO_CRAWL_AND_BUG_FIXES.md

0 → 100644

+39 −0

Original line number	Diff line number	Diff line
		# Summary - 2026-02-07 - SPEC_DOWNLOAD_AUTO_CRAWL_AND_BUG_FIXES

		## Changes

		### 1. Auto-Crawl for Specification Commands

		- Modified `open-spec` and `checkout-spec` to automatically trigger a metadata crawl if the requested specification is missing from the local database.
		- Impacted files: `src/tdoc_crawler/specs/downloads.py`, `src/tdoc_crawler/cli/app.py`.

		### 2. 3GPP Metadata Fetcher Improvements

		- Added browser-like `User-Agent` and `Accept` headers to the 3GPP metadata fetcher to avoid `403 Forbidden` errors from the 3GPP portal.
		- Impacted file: `src/tdoc_crawler/specs/sources/threegpp.py`.

		### 3. WhatTheSpec Fetcher Bug Fixes

		- Fixed field mapping for `whatthespec` source: changed `versions` to `vers` and added support for version-specific file names via the `specfile` field.
		- Impacted file: `src/tdoc_crawler/specs/sources/whatthespec.py`.

		### 4. Spec Catalog Logic Update

		- Improved `SpecCatalog` to handle mappings between versions and their specific file names when provided by metadata sources (e.g., `whatthespec`).
		- Impacted file: `src/tdoc_crawler/specs/catalog.py`.

		### 5. Database Retrieval Robustness

		- Fixed a bug in `TDocDatabase` where JSON fields (`metadata_payload`, `versions`) from the `spec_source_records` table were being returned as raw strings instead of parsed dictionaries/lists during model instantiation.
		- Impacted file: `src/tdoc_crawler/database/connection.py`.

		### 6. Official Archive Download Fix

		- Added `User-Agent` header to official 3GPP archive downloads (zip files) to prevent `403 Forbidden` errors during file retrieval.
		- Impacted file: `src/tdoc_crawler/specs/downloads.py`.

		## Verification Results

		- `uv run tdoc-crawler open-spec 26132`: Successfully crawls metadata, downloads zip, and opens docx.
		- `uv run tdoc-crawler checkout-spec 26.131`: Successfully crawls metadata and downloads spec.
		- Database records for specifications are now correctly stored and retrieved with full metadata payloads.

src/tdoc_crawler/cli/app.py

+12 −2

Original line number	Diff line number	Diff line
		@@ -628,10 +628,15 @@ def checkout_spec(

		effective_checkout_dir = checkout_dir or (cache_dir / "checkout")

		sources = [
		FunctionSpecSource("3gpp", fetch_threegpp_metadata),
		FunctionSpecSource("whatthespec", fetch_whatthespec_metadata),
		]

		db_path = database_path(cache_dir)
		with TDocDatabase(db_path) as database:
		downloader = SpecDownloads(database)
		results = downloader.checkout_specs(specs, doc_only, effective_checkout_dir, release)
		results = downloader.checkout_specs(specs, doc_only, effective_checkout_dir, release, sources=sources)

		# Output formatting
		print_checkout_results(results)
		@@ -648,11 +653,16 @@ def open_spec(
		normalized = spec.strip()
		checkout_dir = cache_dir / "checkout"

		sources = [
		FunctionSpecSource("3gpp", fetch_threegpp_metadata),
		FunctionSpecSource("whatthespec", fetch_whatthespec_metadata),
		]

		db_path = database_path(cache_dir)
		with TDocDatabase(db_path) as database:
		downloader = SpecDownloads(database)
		try:
		path = downloader.open_spec(normalized, doc_only, checkout_dir, release)
		path = downloader.open_spec(normalized, doc_only, checkout_dir, release, sources=sources)
		console.print(f"[green]Opening {path}[/green]")
		launch_file(path)
		except Exception as exc:

src/tdoc_crawler/database/connection.py

+29 −7

Original line number	Diff line number	Diff line
		"""Database access layer backed by pydantic_sqlite."""

		import contextlib
		import json
		from collections import defaultdict
		from collections.abc import Callable, Iterable
		@@ -750,15 +751,36 @@ class TDocDatabase:

		def _get_spec_source_record(self, record_id: str) -> SpecificationSourceRecord \| None:
		try:
		record = self.connection.model_from_table("spec_source_records", record_id) # type: ignore[arg-type]
		# Handle JSON deserialization for metadata_payload field
		if record is not None and isinstance(record.metadata_payload, str):
		# Use raw query to handle JSON deserialization manually before model instantiation
		cursor = self.connection._db.execute("SELECT * FROM spec_source_records WHERE record_id = ?", (record_id,))
		row = cursor.fetchone()
		if row is None:
		return None

		columns = [description[0] for description in cursor.description]
		row_dict = dict(zip(columns, row, strict=False))

		# Handle JSON fields
		if "metadata_payload" in row_dict and isinstance(row_dict["metadata_payload"], str):
		try:
		record = record.model_copy(update={"metadata_payload": json.loads(record.metadata_payload)})
		row_dict["metadata_payload"] = json.loads(row_dict["metadata_payload"])
		except json.JSONDecodeError:
		record = record.model_copy(update={"metadata_payload": {}})
		return record
		except KeyError:
		row_dict["metadata_payload"] = {}

		if "versions" in row_dict and isinstance(row_dict["versions"], str):
		try:
		row_dict["versions"] = json.loads(row_dict["versions"])
		except json.JSONDecodeError:
		row_dict["versions"] = []

		# Handle datetime deserialization
		if "fetched_at" in row_dict and isinstance(row_dict["fetched_at"], str):
		with contextlib.suppress(ValueError, AttributeError):
		row_dict["fetched_at"] = datetime.fromisoformat(row_dict["fetched_at"])

		return SpecificationSourceRecord(**row_dict)
		except Exception as exc:
		_logger.debug("Error fetching spec source record %s: %s", record_id, exc)
		return None

		def _get_spec_version(self, record_id: str) -> SpecificationVersion \| None:

src/tdoc_crawler/specs/catalog.py

+9 −2

Original line number	Diff line number	Diff line
		@@ -136,8 +136,15 @@ class SpecCatalog:
		elif aggregated.latest_version is None and candidate.latest_version is not None:
		aggregated = aggregated.model_copy(update={"latest_version": candidate.latest_version})

		for version in normalized_versions:
		file_name = str(metadata_payload.get("file_name", f"{compact}-unknown.zip"))
		for i, version in enumerate(normalized_versions):
		# Try to get specific file name for this version from payload
		file_name = f"{compact}-unknown.zip"
		if "specfile" in metadata_payload and isinstance(metadata_payload["specfile"], list):
		if i < len(metadata_payload["specfile"]):
		file_name = str(metadata_payload["specfile"][i])
		elif "file_name" in metadata_payload:
		file_name = str(metadata_payload["file_name"])

		spec_versions.append(
		SpecificationVersion(
		spec_number=normalized,