Commit c4ddce12 authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(cli): align --clear-tdocs, --clear-specs and --checkout across query/crawl commands (tdc-dxd)

parent 2e343869
Loading
Loading
Loading
Loading
+1 −3
Original line number Diff line number Diff line
@@ -35,12 +35,10 @@ beads.right.meta.json
# Sync state (local-only, per-machine)
# These files are machine-specific and should not be shared across clones
.sync.lock
.jsonl.lock
sync_base.jsonl
export-state/

# Process semaphore slot files (runtime concurrency limiting)
sem/

# NOTE: Do NOT add negation patterns (e.g., !issues.jsonl) here.
# They would override fork protection in .git/info/exclude, allowing
# contributors to accidentally commit upstream issue databases.
+17 −0
Original line number Diff line number Diff line
@@ -128,6 +128,8 @@ Crawl TDoc metadata from the 3GPP FTP server. TDocs are fetched based on the mee
| `-s, --sub-group SG` | Sub-working groups to crawl (repeatable). Example: `SA4`, `RAN1` |
| `--incremental/--full` | Incremental mode skips existing TDocs; `--full` forces reprocessing |
| `--clear-tdocs` | Clear all TDocs before crawling (start fresh) |
| `--clear-specs` | Clear all specs before crawling |
| `--checkout` | Download and extract crawled TDocs to checkout folder |
| `--limit-tdocs N` | Maximum number of TDocs to process |
| `--limit-meetings N` | Maximum number of meetings to consider |
| `--limit-meetings-per-wg N` | Maximum meetings per working group |
@@ -192,6 +194,9 @@ Crawl meeting metadata from the 3GPP portal. This command is required before cra
| `-s, --sub-group SG` | Sub-working groups to crawl (repeatable) |
| `--incremental/--full` | Incremental mode skips existing meetings; `--full` forces reprocessing |
| `--clear-db` | Clear all data (meetings and TDocs) before crawling |
| `--clear-tdocs` | Clear all TDocs before crawling |
| `--clear-specs` | Clear all specs before crawling |
| `--checkout` | Download and extract meeting TDocs to checkout folder |
| `--limit-meetings N` | Maximum number of meetings overall |
| `--limit-meetings-per-wg N` | Maximum meetings per working group |
| `--limit-wgs N` | Maximum number of working groups |
@@ -246,6 +251,9 @@ Crawl normalized technical specification (TS/TR) metadata from both 3GPP and com
| `-w, --working-group WG` | Working groups to crawl (repeatable) |
| `-s, --source SOURCE` | Metadata sources to use (`3gpp`, `whatthespec`). Default: both |
| `--full` | Force update of existing records |
| `--clear-tdocs` | Clear all TDocs before crawling |
| `--clear-specs` | Clear all specs before crawling |
| `--checkout` | Download and extract crawled specs to checkout folder |
| `-v, --verbose` | Enable verbose logging |

**Examples:**
@@ -290,6 +298,9 @@ Query TDoc metadata from the local database. If specific TDoc IDs are requested
|--------|-------------|
| `-c, --cache-dir PATH` | Database cache location (default: `~/.tdoc-crawler`) |
| `-w, --working-group WG` | Filter by working group (repeatable) |
| `--clear-tdocs` | Clear all TDocs before querying |
| `--clear-specs` | Clear all specs before querying |
| `--checkout` | Download and extract queried TDocs to checkout folder |
| `-o, --output FORMAT` | Output format: `table`, `json`, `yaml` (default: `table`) |
| `-l, --limit N` | Maximum number of results |
| `--order ORDER` | Sort order: `asc` or `desc` (default: `desc`) |
@@ -366,6 +377,9 @@ Query technical specification metadata from the local catalog. Supports filterin
| `-c, --cache-dir PATH` | Database cache location (default: `~/.tdoc-crawler`) |
| `-w, --group WG` | Filter by working group (repeatable) |
| `-s, --status STATUS` | Filter by status (e.g., `Under change control`) |
| `--clear-tdocs` | Clear all TDocs before querying |
| `--clear-specs` | Clear all specs before querying |
| `--checkout` | Download and extract queried specs to checkout folder |
| `-o, --output FORMAT` | Output format: `table`, `json`, `yaml` (default: `table`) |
| `-v, --verbose` | Show per-source discrepancy details |

@@ -410,6 +424,9 @@ Query meeting metadata from the local database. Unlike `query-tdocs`, this comma
| `-c, --cache-dir PATH` | Database cache location (default: `~/.tdoc-crawler`) |
| `-w, --working-group WG` | Filter by working group (repeatable). Supports aliases: `RP`, `SP`, `CP` |
| `-s, --sub-group SG` | Filter by sub-working group (repeatable) |
| `--clear-tdocs` | Clear all TDocs before querying |
| `--clear-specs` | Clear all specs before querying |
| `--checkout` | Download and extract meeting TDocs to checkout folder |
| `-o, --output FORMAT` | Output format: `table`, `json`, `yaml` (default: `table`) |
| `-l, --limit N` | Maximum number of results |
| `--order ORDER` | Sort order: `asc` or `desc` (default: `desc`) |
+41 −0
Original line number Diff line number Diff line
# Summary - Align CLI options across query/crawl commands (tdc-dxd)

Aligned `--clear-tdocs`, `--clear-specs`, and `--checkout` options across all query and crawl commands in the CLI to provide a consistent user experience and more granular data management.

## Changes

### CLI Layer

- **New Options**: Added `--clear-tdocs`, `--clear-specs`, and `--checkout` to `crawl-tdocs`, `crawl-meetings`, `crawl-specs`, `query-tdocs`, `query-meetings`, and `query-specs`.
- **Granular Clearing**:
  - `crawl-meetings` now supports `--clear-db` (everything), `--clear-tdocs` (only tdocs), and `--clear-specs` (only specs).
  - `crawl-tdocs` and `crawl-specs` support their respective clear flags.
  - All query commands now support clearing data before execution if requested.
- **Checkout Integration**: All crawl/query commands now support a `--checkout` flag that automatically downloads and extracts results to the `checkout` directory after the main operation.
- **Helper Functions**:
  - `_clear_checkout_tdocs`: Clears TDoc files from the checkout directory while preserving specifications.
  - `_clear_checkout_specs`: Clears specification files from the checkout directory.
  - `_checkout_tdocs`, `_checkout_specs`, `_checkout_meeting_tdocs`: Standardized helper functions for batch checkout operations.

### Database Layer

- **`TDocDatabase`**:
  - Added `clear_specs()` method to selectively clear spec-related tables (`specs`, `spec_versions`, `spec_source_records`, `spec_downloads`).
  - Ensured `clear_tdocs()` and `clear_meetings()` are available for granular clearing.

### Models and Protocols

- **`SpecQueryResult`**: Added `source_differences` field to store metadata discrepancies between sources.
- **`SpecSource`**: Updated protocol to use `@property` for `name`.
- **`SpecQueryFilters`**: Added `SpecQueryFilters` dataclass for structured spec querying.

### Bug Fixes

- Fixed `NameError: name 'field' is not defined` in `src/tdoc_crawler/models/specs.py`.
- Corrected type hints in `src/tdoc_crawler/cli/args.py` using `TypeAlias`.

## Verification Results

- All existing tests passed (217 passed).
- Manual audit confirmed all query/crawl commands in `app.py` implement the new options.
- Documentation in `QUICK_REFERENCE.md` updated to reflect the new CLI structure.
+268 −25

File changed.

Preview size limit exceeded, changes collapsed.

+39 −57
Original line number Diff line number Diff line
@@ -7,77 +7,59 @@ from typing import Annotated

import typer

CacheDirOption = Annotated[Path, typer.Option("--cache-dir", "-c", help="Cache directory")]
WorkingGroupOption = Annotated[list[str] | None, typer.Option("--working-group", "-w", help="Filter by working group")]
SubgroupOption = Annotated[list[str] | None, typer.Option("--sub-group", "-s", help="Filter by sub-working group")]
IncrementalOption = Annotated[bool, typer.Option("--incremental/--full", help="Toggle incremental mode")]
ClearTDocsOption = Annotated[bool, typer.Option("--clear-tdocs", help="Clear all TDocs before crawling")]
ClearDbOption = Annotated[bool, typer.Option("--clear-db", help="Clear all meetings and TDocs before crawling")]
LimitTDocsOption = Annotated[int | None, typer.Option("--limit-tdocs", help="Limit number of TDocs")]
LimitMeetingsOption = Annotated[int | None, typer.Option("--limit-meetings", help="Limit meetings overall")]
LimitMeetingsPerWgOption = Annotated[int | None, typer.Option("--limit-meetings-per-wg", help="Limit meetings per working group")]
LimitWgsOption = Annotated[int | None, typer.Option("--limit-wgs", help="Limit number of working groups")]
WorkersOption = Annotated[int, typer.Option("--workers", help="Number of parallel subinterpreter workers")]
OverallTimeoutOption = Annotated[
type CacheDirOption = Annotated[Path, typer.Option("--cache-dir", "-c", help="Cache directory")]
type WorkingGroupOption = Annotated[list[str] | None, typer.Option("--working-group", "-w", help="Filter by working group")]
type SubgroupOption = Annotated[list[str] | None, typer.Option("--sub-group", "-s", help="Filter by sub-working group")]
type IncrementalOption = Annotated[bool, typer.Option("--incremental/--full", help="Toggle incremental mode")]
type ClearTDocsOption = Annotated[bool, typer.Option("--clear-tdocs", help="Clear all TDocs before crawling")]
type ClearSpecsOption = Annotated[bool, typer.Option("--clear-specs", help="Clear all specs before crawling")]
type ClearDbOption = Annotated[bool, typer.Option("--clear-db", help="Clear all meetings and TDocs before crawling")]
type CheckoutOption = Annotated[bool, typer.Option("--checkout", help="Download and extract metadata results to checkout folder")]
type LimitTDocsOption = Annotated[int | None, typer.Option("--limit-tdocs", help="Limit number of TDocs")]
type LimitMeetingsOption = Annotated[int | None, typer.Option("--limit-meetings", help="Limit meetings overall")]
type LimitMeetingsPerWgOption = Annotated[int | None, typer.Option("--limit-meetings-per-wg", help="Limit meetings per working group")]
type LimitWgsOption = Annotated[int | None, typer.Option("--limit-wgs", help="Limit number of working groups")]
type WorkersOption = Annotated[int, typer.Option("--workers", help="Number of parallel subinterpreter workers")]
type OverallTimeoutOption = Annotated[
    int | None,
    typer.Option("--overall-timeout", help="Maximum total crawl duration in seconds (None = unlimited)"),
]
MaxRetriesOption = Annotated[int, typer.Option("--max-retries", help="HTTP retry attempts")]
TimeoutOption = Annotated[int, typer.Option("--timeout", help="HTTP timeout seconds")]
VerboseOption = Annotated[bool, typer.Option("--verbose", "-v", help="Enable verbose logging")]
type MaxRetriesOption = Annotated[int, typer.Option("--max-retries", help="HTTP retry attempts")]
type TimeoutOption = Annotated[int, typer.Option("--timeout", help="HTTP timeout seconds")]
type VerboseOption = Annotated[bool, typer.Option("--verbose", "-v", help="Enable verbose logging")]

TDocIdsArgument = Annotated[list[str] | None, typer.Argument(help="TDoc identifiers to query")]
OutputFormatOption = Annotated[str, typer.Option("--output", "-o", help="Output format")]
type TDocIdsArgument = Annotated[list[str] | None, typer.Argument(help="TDoc identifiers to query")]
type OutputFormatOption = Annotated[str, typer.Option("--output", "-o", help="Output format")]

# New options for TDoc fetching
FullMetadataOption = Annotated[bool, typer.Option("--full-metadata", help="Fetch full metadata instead of URL only")]
UseWhatTheSpecOption = Annotated[bool, typer.Option("--use-whatthespec", help="Use WhatTheSpec API for fetching")]
WorkingGroupOption = Annotated[list[str] | None, typer.Option("--working-group", "-w", help="Filter by working group")]
SubgroupOption = Annotated[list[str] | None, typer.Option("--sub-group", "-s", help="Filter by sub-working group")]
IncrementalOption = Annotated[bool, typer.Option("--incremental/--full", help="Toggle incremental mode")]
ClearTDocsOption = Annotated[bool, typer.Option("--clear-tdocs", help="Clear all TDocs before crawling")]
ClearDbOption = Annotated[bool, typer.Option("--clear-db", help="Clear all meetings and TDocs before crawling")]
LimitTDocsOption = Annotated[int | None, typer.Option("--limit-tdocs", help="Limit number of TDocs")]
LimitMeetingsOption = Annotated[int | None, typer.Option("--limit-meetings", help="Limit meetings overall")]
LimitMeetingsPerWgOption = Annotated[int | None, typer.Option("--limit-meetings-per-wg", help="Limit meetings per working group")]
LimitWgsOption = Annotated[int | None, typer.Option("--limit-wgs", help="Limit number of working groups")]
WorkersOption = Annotated[int, typer.Option("--workers", help="Number of parallel subinterpreter workers")]
OverallTimeoutOption = Annotated[
    int | None,
    typer.Option("--overall-timeout", help="Maximum total crawl duration in seconds (None = unlimited)"),
]
MaxRetriesOption = Annotated[int, typer.Option("--max-retries", help="HTTP retry attempts")]
TimeoutOption = Annotated[int, typer.Option("--timeout", help="HTTP timeout seconds")]
VerboseOption = Annotated[bool, typer.Option("--verbose", "-v", help="Enable verbose logging")]

TDocIdsArgument = Annotated[list[str] | None, typer.Argument(help="TDoc identifiers to query")]
OutputFormatOption = Annotated[str, typer.Option("--output", "-o", help="Output format")]
LimitOption = Annotated[int | None, typer.Option("--limit", "-l", help="Maximum number of rows")]
OrderOption = Annotated[str, typer.Option("--order", help="Sort order (asc|desc)")]
StartDateOption = Annotated[str | None, typer.Option("--start-date", help="Filter from ISO timestamp")]
EndDateOption = Annotated[str | None, typer.Option("--end-date", help="Filter until ISO timestamp")]
NoFetchOption = Annotated[
type FullMetadataOption = Annotated[bool, typer.Option("--full-metadata", help="Fetch full metadata instead of URL only")]
type UseWhatTheSpecOption = Annotated[bool, typer.Option("--use-whatthespec", help="Use WhatTheSpec API for fetching")]
type LimitOption = Annotated[int | None, typer.Option("--limit", "-l", help="Maximum number of rows")]
type OrderOption = Annotated[str, typer.Option("--order", help="Sort order (asc|desc)")]
type StartDateOption = Annotated[str | None, typer.Option("--start-date", help="Filter from ISO timestamp")]
type EndDateOption = Annotated[str | None, typer.Option("--end-date", help="Filter until ISO timestamp")]
type NoFetchOption = Annotated[
    bool,
    typer.Option("--no-fetch", help="Disable automatic fetching of missing TDocs from portal"),
]
EolUsernameOption = Annotated[str | None, typer.Option("--eol-username", help="ETSI Online account username")]
EolPasswordOption = Annotated[str | None, typer.Option("--eol-password", help="ETSI Online account password")]
PromptCredentialsOption = Annotated[
type EolUsernameOption = Annotated[str | None, typer.Option("--eol-username", help="ETSI Online account username")]
type EolPasswordOption = Annotated[str | None, typer.Option("--eol-password", help="ETSI Online account password")]
type PromptCredentialsOption = Annotated[
    bool | None,
    typer.Option("--prompt-credentials/--no-prompt-credentials", help="Prompt for credentials when missing"),
]
IncludeWithoutFilesOption = Annotated[
type IncludeWithoutFilesOption = Annotated[
    bool,
    typer.Option("--include-without-files", help="Include meetings without files URLs"),
]

TDocIdArgument = Annotated[str, typer.Argument(help="TDoc identifier to download and open")]
CheckoutTDocIdsArgument = Annotated[list[str], typer.Argument(help="TDoc identifier(s) to checkout")]
ForceOption = Annotated[bool, typer.Option("--force", "-f", help="Re-download even if already checked out")]
type TDocIdArgument = Annotated[str, typer.Argument(help="TDoc identifier to download and open")]
type CheckoutTDocIdsArgument = Annotated[list[str], typer.Argument(help="TDoc identifier(s) to checkout")]
type ForceOption = Annotated[bool, typer.Option("--force", "-f", help="Re-download even if already checked out")]

SpecOption = Annotated[list[str] | None, typer.Option("--spec", help="Spec number(s) (dotted or undotted)")]
SpecArgument = Annotated[list[str] | None, typer.Argument(help="Spec number(s) to query (dotted or undotted)")]
SpecFileOption = Annotated[Path | None, typer.Option("--spec-file", help="File with spec numbers")]
ReleaseOption = Annotated[str, typer.Option("--release", help="Spec release selector")]
DocOnlyOption = Annotated[bool, typer.Option("--doc-only/--no-doc-only", help="Attempt document-only download")]
CheckoutDirOption = Annotated[Path | None, typer.Option("--checkout-dir", help="Spec checkout base directory")]
type SpecOption = Annotated[list[str] | None, typer.Option("--spec", help="Spec number(s) (dotted or undotted)")]
type SpecArgument = Annotated[list[str] | None, typer.Argument(help="Spec number(s) to query (dotted or undotted)")]
type SpecFileOption = Annotated[Path | None, typer.Option("--spec-file", help="File with spec numbers")]
type ReleaseOption = Annotated[str, typer.Option("--release", help="Spec release selector")]
type DocOnlyOption = Annotated[bool, typer.Option("--doc-only/--no-doc-only", help="Attempt document-only download")]
type CheckoutDirOption = Annotated[Path | None, typer.Option("--checkout-dir", help="Spec checkout base directory")]
Loading