Commit 50e75bf3 authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(specs): enhance CLI commands and documentation for crawl/query

- Added support for `--output` argument in crawl and query commands.
- Updated CLI documentation to reflect new argument usage.
- Enhanced implementation plan with detailed library boundaries and CLI contract.
- Added new feature requirements for standalone library and output modes.
parent cb3ded1f
Loading
Loading
Loading
Loading
+3 −0
Original line number Diff line number Diff line
@@ -3,6 +3,8 @@
Auto-generated from all feature plans. Last updated: 2026-02-05

## Active Technologies
- Python 3.14 + typer, rich, pydantic, pydantic-sqlite, requests, (001-specs-crawl-query)
- SQLite via pydantic-sqlite (001-specs-crawl-query)

- Python 3.14 + typer, rich, requests, beautifulsoup4, lxml, pydantic, pydantic-sqlite, hishel, zipinspect (001-specs-crawl-query)

@@ -22,6 +24,7 @@ cd src; pytest; ruff check .
Python 3.14: Follow standard conventions

## Recent Changes
- 001-specs-crawl-query: Added Python 3.14 + typer, rich, pydantic, pydantic-sqlite, requests,

- 001-specs-crawl-query: Added Python 3.14 + typer, rich, requests, beautifulsoup4, lxml, pydantic, pydantic-sqlite, hishel, zipinspect

+14 −4
Original line number Diff line number Diff line
@@ -6,13 +6,16 @@

**Arguments**:

- `--spec`: One or more spec numbers (dotted or undotted).
- `--spec`: One or more spec numbers (dotted or undotted). Use `-` to read from stdin.
- `--spec-file`: Path to a file containing spec numbers (one per line).
- `--release`: Version selector; default `latest`.
- `--output`: Output format (table, json, yaml).

**Behavior**:

- Stores metadata per source and normalized spec numbers.
- If `--release` is not `latest`, the value must match parsed metadata versions.
- Emits a summary of matched specs and source outcomes in the requested output format.

## query-specs

@@ -20,7 +23,8 @@

**Arguments**:

- `--spec`: One or more spec numbers.
- `--spec`: One or more spec numbers. Use `-` to read from stdin.
- `--spec-file`: Path to a file containing spec numbers (one per line).
- `--working-group`: Filter by working group.
- `--status`: Filter by spec status.
- `--title`: Title keyword filter.
@@ -37,15 +41,18 @@

**Arguments**:

- `--spec`: One or more spec numbers.
- `--spec`: One or more spec numbers. Use `-` to read from stdin.
- `--spec-file`: Path to a file containing spec numbers (one per line).
- `--release`: Version selector; default `latest`.
- `--doc-only/--no-doc-only`: Attempt document-only download from the remote zip.
- `--checkout-dir`: Target checkout base directory.
- `--output`: Output format (table, json, yaml).

**Behavior**:

- Doc-only uses a simple spec-number match to locate `.doc` or `.docx` entries.
- If doc-only fails, warns and falls back to full zip download.
- Emits per-spec checkout results (paths, status, warnings) in the requested output format.

## open-spec

@@ -53,11 +60,14 @@

**Arguments**:

- `--spec`: Single spec number.
- `--spec`: Single spec number. Use `-` to read from stdin.
- `--spec-file`: Path to a file containing spec numbers (one per line).
- `--release`: Version selector; default `latest`.
- `--doc-only/--no-doc-only`: Attempt document-only download from the remote zip.
- `--checkout-dir`: Target checkout base directory.
- `--output`: Output format (table, json, yaml).

**Behavior**:

- Uses the system default application to open the document after checkout.
- Emits checkout/open results in the requested output format before launching the document.
+88 −37
Original line number Diff line number Diff line
# Implementation Plan: Crawl and Query Specs

**Branch**: `001-specs-crawl-query` | **Date**: 2026-02-05 | **Spec**: [specs/001-specs-crawl-query/spec.md](specs/001-specs-crawl-query/spec.md)
**Branch**: `001-specs-crawl-query` | **Date**: 2026-02-05 | **Spec**: [spec](spec.md)
**Input**: Feature specification from `/specs/001-specs-crawl-query/spec.md`

**Note**: This plan follows the updated constitution (library-first, CLI JSON output,
TDD gates, and Python standards).

## Summary

Add spec crawling, querying, checkout, and open commands for 3GPP specifications. Metadata comes from 3GPP.org redirects and whatthespec.net JSON. Downloads support doc-only mode for large zip files, release selection, and fallback to full zips when document extraction fails. New database entities capture spec metadata, source-specific records, versions, and download outcomes.
Implement spec crawling and querying as a standalone library that ingests metadata
from 3GPP.org redirects and whatthespec.net JSON, normalizes spec identifiers, stores
source-attributed records in SQLite, and exposes new CLI commands for crawl/query and
checkout/open with release selection and doc-only downloads.

## Technical Context

**Language/Version**: Python 3.14
**Primary Dependencies**: typer, rich, requests, beautifulsoup4, lxml, pydantic, pydantic-sqlite, hishel, zipinspect
**Primary Dependencies**: typer, rich, pydantic, pydantic-sqlite, requests,
beautifulsoup4, lxml, pandas, python-calamine, xlsxwriter, zipinspect, hishel
**Storage**: SQLite via pydantic-sqlite
**Testing**: pytest
**Target Platform**: Cross-platform CLI (Windows/Linux/macOS)
**Project Type**: single (CLI application)
**Performance Goals**: query exact match <2s for 10k specs; doc-only avoids full zip downloads in most cases
**Constraints**: handle multi-GB zips without full download in doc-only mode; warn/skip on unknown release; preserve archive-style checkout paths
**Scale/Scope**: 10k+ specs with multiple versions and per-source metadata
**Testing**: pytest, pytest-asyncio
**Target Platform**: Cross-platform CLI (Windows, macOS, Linux)
**Project Type**: single
**Performance Goals**: Query known spec in <2s; crawl success >=95% for known specs
**Constraints**: JSON output for all spec commands; no `print`; use `pathlib` and
logging; Ruff and Ty clean; doc-only fallback to full zip on mismatch
**Scale/Scope**: 10k+ specs, four new commands, two metadata sources

## Constitution Check

*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*

- Status: PASS (constitution template not populated; no enforceable principles found)
- Action: run `/speckit.constitution` to define project principles and gates.
- Post-design re-check: PASS (no defined gates to evaluate).
- [x] Library-first boundary documented (standalone library + integration points).
- [x] CLI contract defined (text input/output + JSON mode).
- [x] TDD evidence planned (tests written/approved + red phase before implementation).
- [x] Python standards planned (type hints, logging, uv/pyproject, Ruff, Ty, pathlib,
  dataclasses where appropriate, Typer CLI).

## Library Boundary & CLI Contract

**Library boundary**: Introduce `tdoc_crawler/specs/` as the standalone library layer.
It encapsulates parsing/normalization, source fetching, and download orchestration.

**Public API (library)**:

- `SpecCatalog.crawl_specs(spec_numbers: list[str], release: str, sources: list[SpecSource])`
- `SpecCatalog.query_specs(filters: SpecQueryFilters, release: str)`
- `SpecDownloads.checkout_specs(specs: list[SpecRef], doc_only: bool, checkout_dir: Path)`
- `SpecDownloads.open_spec(spec: SpecRef, doc_only: bool, checkout_dir: Path)`
- `normalize_spec_number(value: str) -> SpecNumber`

**CLI contract**: Add commands in `tdoc_crawler/cli/app.py` with arguments wired via
`tdoc_crawler/cli/args.py` using Typer `Annotated` patterns. Input supports `--spec`,
`--spec-file`, or stdin (`--spec -`). Output defaults to Rich tables and supports
`--output json|yaml` for structured output. Errors are reported on stderr.

## TDD Evidence

Tests are written and approved before implementation. The red phase is validated by
running the new tests and confirming failure before any production code changes.

Planned tests (initial red phase):

- `tests/test_specs_sources.py`: 3GPP redirect parsing and whatthespec JSON mapping
- `tests/test_specs_normalization.py`: dotted vs undotted normalization rules
- `tests/test_specs_database.py`: upsert/query behavior for new spec tables
- `tests/test_specs_downloads.py`: doc-only selection and fallback to full zip
- `tests/test_specs_cli.py`: CLI parsing, JSON output, and stdin/file input

## Project Structure

@@ -33,40 +74,50 @@ Add spec crawling, querying, checkout, and open commands for 3GPP specifications

```text
specs/001-specs-crawl-query/
├── plan.md              # This file (/speckit.plan command output)
├── research.md          # Phase 0 output (/speckit.plan command)
├── data-model.md        # Phase 1 output (/speckit.plan command)
├── quickstart.md        # Phase 1 output (/speckit.plan command)
├── contracts/           # Phase 1 output (/speckit.plan command)
└── tasks.md             # Phase 2 output (/speckit.tasks command - NOT created by /speckit.plan)
├── plan.md
├── research.md
├── data-model.md
├── quickstart.md
├── contracts/
└── tasks.md
```

### Source Code (repository root)

```text
src/
├── tdoc_crawler/
│   ├── cli/
│   ├── crawlers/
│   ├── database/
│   ├── models/
│   ├── checkout.py
│   ├── fetching.py
│   ├── http_client.py
│   └── credentials.py
└── pool_executors/
src/tdoc_crawler/
├── cli/
│   ├── app.py
│   └── args.py
├── crawlers/
│   └── specs.py
├── database/
│   └── connection.py
├── models/
│   └── specs.py
├── specs/
│   ├── __init__.py
│   ├── catalog.py
│   ├── downloads.py
│   ├── normalization.py
│   ├── query.py
│   └── sources/
│       ├── threegpp.py
│       └── whatthespec.py
└── checkout.py

tests/
├── test_cli.py
├── test_crawler.py
├── test_database.py
├── test_http_client.py
├── test_targeted_fetch.py
└── ...
├── test_specs_cli.py
├── test_specs_database.py
├── test_specs_downloads.py
├── test_specs_normalization.py
└── test_specs_sources.py
```

**Structure Decision**: Single CLI project; extend `src/tdoc_crawler` modules and tests.
**Structure Decision**: Single project. New spec functionality lives in
`tdoc_crawler/specs/` with data models in `tdoc_crawler/models/specs.py` and database
integration in `tdoc_crawler/database/connection.py`.

## Complexity Tracking

No constitution violations identified.
No constitution violations required for this feature.
+4 −0
Original line number Diff line number Diff line
@@ -112,6 +112,10 @@ As a maintainer, I want to see when metadata differs between 3GPP.org and whatth
- **FR-020**: The system MUST provide a `query-specs` command to search by spec number, title keyword, working group, and status.
- **FR-021**: The system MUST support querying a spec with multiple source records and expose any differences.
- **FR-022**: The system MUST record crawl and download outcomes, including success or failure, for auditing and troubleshooting.
- **FR-023**: Feature functionality MUST be implemented as a standalone library module before CLI integration.
- **FR-024**: The CLI MUST support text output and a JSON output mode for structured results; errors go to stderr.
- **FR-025**: Unit tests MUST be written, user-approved, and verified failing before implementation begins.
- **FR-026**: Python implementations MUST use `pyproject.toml` with `uv`, include type hints and Google-style docstrings for public code, use `logging` instead of `print`, rely on `pathlib` for file paths, and keep Ruff and Ty checks clean without suppressions unless explicitly approved.

### Key Entities *(include if feature involves data)*