feat(docs): add comprehensive documentation for crawling and querying (6747fbe4) · Commits · Jan Reimes / 3gpp-crawler

CONTRIBUTING.md

deleted100644 → 0

+0 −126

Original line number	Diff line number	Diff line
		# Contributing to `tdoc-crawler`

		Contributions are welcome, and they are greatly appreciated!
		Every little bit helps, and credit will always be given.

		You can contribute in many ways:

		# Types of Contributions

		## Report Bugs

		Report bugs at https://github.com/Jan-Reimes_HEAD/tdoc-crawler/issues

		If you are reporting a bug, please include:

		- Your operating system name and version.
		- Any details about your local setup that might be helpful in troubleshooting.
		- Detailed steps to reproduce the bug.

		## Fix Bugs

		Look through the GitHub issues for bugs.
		Anything tagged with "bug" and "help wanted" is open to whoever wants to implement a fix for it.

		## Implement Features

		Look through the GitHub issues for features.
		Anything tagged with "enhancement" and "help wanted" is open to whoever wants to implement it.

		## Write Documentation

		tdoc-crawler could always use more documentation, whether as part of the official docs, in docstrings, or even on the web in blog posts, articles, and such.

		## Submit Feedback

		The best way to send feedback is to file an issue at https://github.com/Jan-Reimes_HEAD/tdoc-crawler/issues.

		If you are proposing a new feature:

		- Explain in detail how it would work.
		- Keep the scope as narrow as possible, to make it easier to implement.
		- Remember that this is a volunteer-driven project, and that contributions
		are welcome :)

		# Get Started!

		Ready to contribute? Here's how to set up `tdoc-crawler` for local development.
		Please note this documentation assumes you already have `uv` and `Git` installed and ready to go.

		1. Fork the `tdoc-crawler` repo on GitHub.

		2. Clone your fork locally:

		```bash
		cd <directory_in_which_repo_should_be_created>
		git clone git@github.com:YOUR_NAME/tdoc-crawler.git
		```

		3. Now we need to install the environment. Navigate into the directory

		```bash
		cd tdoc-crawler
		```

		Then, install and activate the environment with:

		```bash
		uv sync
		```

		4. Install pre-commit to run linters/formatters at commit time:

		```bash
		uv run pre-commit install
		```

		5. Create a branch for local development:

		```bash
		git checkout -b name-of-your-bugfix-or-feature
		```

		Now you can make your changes locally.

		6. Don't forget to add test cases for your added functionality to the `tests` directory.

		7. When you're done making changes, check that your changes pass the formatting tests.

		```bash
		make check
		```

		Now, validate that all unit tests are passing:

		```bash
		make test
		```

		9. Before raising a pull request you should also run tox.
		This will run the tests across different versions of Python:

		```bash
		tox
		```

		This requires you to have multiple versions of python installed.
		This step is also triggered in the CI/CD pipeline, so you could also choose to skip this step locally.

		10. Commit your changes and push your branch to GitHub:

		```bash
		git add .
		git commit -m "Your detailed description of your changes."
		git push origin name-of-your-bugfix-or-feature
		```

		11. Submit a pull request through the GitHub website.

		# Pull Request Guidelines

		Before you submit a pull request, check that it meets these guidelines:

		1. The pull request should include tests.

		2. If the pull request adds functionality, the docs should be updated.
		Put your new functionality into a function with a docstring, and add the feature to the list in `README.md`.

README.md

+64 −150

Original line number	Diff line number	Diff line
		# tdoc-crawler

		[![Release](https://img.shields.io/github/v/release/Jan-Reimes_HEAD/tdoc-crawler)](https://img.shields.io/github/v/release/Jan-Reimes_HEAD/tdoc-crawler)
		[![Build status](https://img.shields.io/github/actions/workflow/status/Jan-Reimes_HEAD/tdoc-crawler/main.yml?branch=main)](https://github.com/Jan-Reimes_HEAD/tdoc-crawler/actions/workflows/main.yml?query=branch%3Amain)
		[![codecov](https://codecov.io/gh/Jan-Reimes_HEAD/tdoc-crawler/branch/main/graph/badge.svg)](https://codecov.io/gh/Jan-Reimes_HEAD/tdoc-crawler)
		[![Commit activity](https://img.shields.io/github/commit-activity/m/Jan-Reimes_HEAD/tdoc-crawler)](https://img.shields.io/github/commit-activity/m/Jan-Reimes_HEAD/tdoc-crawler)
		[![License](https://img.shields.io/github/license/Jan-Reimes_HEAD/tdoc-crawler)](https://img.shields.io/github/license/Jan-Reimes_HEAD/tdoc-crawler)

		A command-line tool for crawling the 3GPP FTP server, caching TDoc metadata in a local database, and querying structured data via JSON/YAML output.

		- Github repository: <https://github.com/Jan-Reimes_HEAD/tdoc-crawler/>
		- Documentation: <https://Jan-Reimes_HEAD.github.io/tdoc-crawler/>
		Github repository: <https://forge.3gpp.org/rep/reimes/tdoc-crawler/>

		## Features

		@@ -24,25 +17,23 @@ A command-line tool for crawling the 3GPP FTP server, caching TDoc metadata in a

		## Installation

		### Using uv (recommended)
		### Install as uv tool (recommended)

		```bash
		# Install from PyPI (when published)
		uv pip install tdoc-crawler

		# Or install from source
		git clone https://github.com/Jan-Reimes_HEAD/tdoc-crawler.git
		cd tdoc-crawler
		uv sync

		# Or install it as uv tool directly:
		uv tool install https://github.com/Jan-Reimes_HEAD/tdoc-crawler.git
		uv tool install https://forge.3gpp.org/rep/reimes/tdoc-crawler.git
		uvx tdoc-crawler --help
		```

		### Using uv

		```bash
		# Install from PyPI (publication pending)
		uv add tdoc-crawler



		# Or install from source
		git clone https://forge.3gpp.org/rep/reimes/tdoc-crawler.git
		cd tdoc-crawler
		uv sync
		```

		### Using pip (not recommended)
		@@ -51,8 +42,6 @@ uvx tdoc-crawler --help
		pip install tdoc-crawler
		```

		(... then same usage as above.)

		## Configuration

		### Environment Variables
		@@ -66,6 +55,9 @@ cp .env.example .env
		# Edit .env and add your settings:

		# ETSI Online (EOL) credentials (optional for portal authentication)
		# Acts as a "3GPP-compatible fallback". While whatthespec.net is the primary
		# data source, it is community-maintained. Credentials allow falling back to
		# official 3GPP portal endpoints if the primary source is unavailable.
		EOL_USERNAME=your_username
		EOL_PASSWORD=your_password

		@@ -89,165 +81,87 @@ export EOL_PASSWORD=your_password
		export HTTP_CACHE_TTL=3600
		```

		... or let the tool prompt you interactively when needed
		NOTE: If no credentials are provided, the tool will prompt you interactively when needed.

		## Quick Start

		### 1. Crawl Meeting Metadata

		First, populate the meetings database by crawling the 3GPP portal:
		### 1. Command Reference

		\| Command \| Alias \| Purpose \|
		\|---------\|-------\|---------\|
		\| Crawling \| \| \|
		\| `crawl-meetings` \| `cm` \| Populate meeting database (Run this first!) \|
		\| `crawl-tdocs` \| `ct` \| Crawl TDoc metadata from FTP \|
		\| `crawl-specs` \| `cs` \| Crawl technical specification metadata \|
		\| Querying \| \| \|
		\| `query-meetings` \| `qm` \| Search and display meeting metadata \|
		\| `query-tdocs` \| `qt` \| Search for TDocs (auto-fetches if missing) \|
		\| `query-specs` \| `qs` \| Search technical specifications \|
		\| Utilities \| \| \|
		\| `open` \| \| Download and open a TDoc \|
		\| `checkout` \| \| Batch download TDocs to local folder \|
		\| `open-spec` \| `os` \| Download and open latest spec document \|
		\| `checkout-spec` \| `cos` \| Batch download technical specifications \|
		\| `stats` \| \| View database statistics \|

		### 1. Crawl Metadata

		Gather metadata from 3GPP and WhatTheSpec:

		```bash
		# Populate meeting database (REQUIRED first step)
		tdoc-crawler crawl-meetings
		```

		This will:

		- Connect to the 3GPP portal
		- Retrieve meeting metadata for RAN, SA, and CT working groups
		- Store metadata in a local SQLite database at `~/.tdoc-crawler/tdoc_crawler.db`

		Optional filters:

		```bash
		# Crawl meetings for specific working group
		tdoc-crawler crawl-meetings -w SA

		# Limit to recent meetings (e.g., 10 per working group)
		tdoc-crawler crawl-meetings --limit-meetings-per-wg 10
		```

		### 2. Crawl TDoc Metadata

		Once meetings are populated, crawl TDocs from the 3GPP FTP server:

		```bash
		# Crawl all TDocs
		# Crawl TDoc metadata (RAN, SA, CT)
		tdoc-crawler crawl

		# Crawl TDocs for specific working group
		tdoc-crawler crawl -w SA

		# Crawl TDocs for specific subgroup (e.g., SA4)
		tdoc-crawler crawl -w SA -s S4

		# Crawl TDocs from meetings in date range
		tdoc-crawler crawl -w RAN --start-date 2024-01-01 --end-date 2024-12-31

		# Crawl TDocs from specific meeting IDs
		tdoc-crawler crawl --meeting-ids 60666 60667
		# Populate spec catalog
		tdoc-crawler crawl-specs
		```

		### 3. Query TDoc Metadata
		### 2. Query Metadata

		Once the database is populated, you can query TDoc information:
		Search and filter stored information:

		```bash
		# Query a specific TDoc (case-insensitive)
		tdoc-crawler query R1-2301234

		# Query multiple TDocs
		tdoc-crawler query R1-2301234 S2-2305678

		# Query all TDocs from a working group
		tdoc-crawler query --working-group RAN
		# Query a specific TDoc (auto-fetches metadata if missing)
		tdoc-crawler query R1-2400001

		# Export results as JSON
		tdoc-crawler query R1-2301234 --format json --output results.json
		# Query specifications
		tdoc-crawler query-specs 23.501

		# Export as YAML
		tdoc-crawler query --working-group SA --format yaml
		# List recent meetings
		tdoc-crawler query-meetings --limit 10
		```

		### 4. Crawl and Query Specifications
		### 3. Utilities & File Access

		Populate and search the spec catalog:
		Open documents, batch download (checkout), and check database status:

		```bash
		# Crawl spec metadata from all sources
		tdoc-crawler crawl-specs

		# Query specific specifications
		tdoc-crawler query-specs 23.501 38.331
		# Download and open a TDoc with system default app
		tdoc-crawler open R1-2400001

		# Open latest document for a spec
		# Download and open latest version of a spec
		tdoc-crawler open-spec 23.501
		```

		### 5. View Database Statistics
		# Batch download (checkout) TDocs to local folder
		tdoc-crawler checkout R1-2400001 S2-2400567

		```bash
		# Batch checkout specifications
		tdoc-crawler checkout-spec 26130-26140

		# View database statistics
		tdoc-crawler stats
		```

		## Quick Reference
		## Documentation

		For a complete command reference, see [QUICK_REFERENCE.md](docs/QUICK_REFERENCE.md).
		For detailed documentation, including command deep-dives, configuration, and architecture, see the [Documentation Index](docs/index.md).

		## Development

		### Setting Up Development Environment

		```bash
		# Clone the repository
		git clone https://github.com/Jan-Reimes_HEAD/tdoc-crawler.git
		cd tdoc-crawler

		# Install dependencies and dev tools
		uv sync --all-extras

		# Install pre-commit hooks
		uv run pre-commit install
		```

		### Running Tests

		```bash
		# Run all tests
		uv run pytest

		# Run with coverage
		uv run pytest --cov=tdoc_crawler --cov-report=html

		# Run specific test file
		uv run pytest tests/test_database.py -v
		```

		### Code Quality

		```bash
		# Format code
		uv run ruff format

		# Lint code
		uv run ruff check

		# Type checking
		uv run ty check
		```

		## Architecture

		The project follows a modular structure:

		1. `models/`: Pydantic models for data validation and configuration

		- `base.py`: Base configuration models, enums (OutputFormat, SortOrder)
		- `working_groups.py`: WorkingGroup enum with tbid/ftp_root properties
		- `subworking_groups.py`: SubworkingGroup reference data
		- `tdocs.py`: TDoc metadata models and crawl/query configurations
		- `meetings.py`: Meeting metadata models and configurations
		- `crawl_limits.py`: Crawl throttling configuration

		1. `crawlers/`: Web scraping and FTP crawling logic

		- `tdocs.py`: TDocCrawler - FTP directory traversal, TDoc discovery
		- `meetings.py`: MeetingCrawler - HTML parsing from 3GPP portal
		- `portal.py`: Portal authentication and metadata extraction

		1. `database.py`: SQLite database interface with typed wrappers

		1. `cli.py`: Command-line interface using Typer and Rich
		For information on setting up the development environment, running tests, and code quality standards, please refer to the [Development Guide](docs/development.md).

		## License

docs/QUICK_REFERENCE.md

deleted100644 → 0

+0 −889

File deleted.

Preview size limit exceeded, changes collapsed.

docs/crawl.md

0 → 100644

+105 −0

Original line number	Diff line number	Diff line
		# Crawling Metadata

		Crawling is the process of gathering metadata from external sources (3GPP FTP server, 3GPP Portal, WhatTheSpec) and storing it in your local SQLite database.

		## Prerequisites

		- Meetings must be crawled first: The `crawl-tdocs` command relies on the meetings database to know which meeting directories to visit on the FTP server.

		## Commands

		### `crawl-meetings` (alias: `cm`)

		Crawl meeting metadata from the 3GPP portal.

		When to Use:

		- Initial setup (required before first TDoc crawl).
		- Periodic updates to meeting schedules.
		- Adding new working groups to crawl.

		Options:

		\| Option \| Description \|
		\|--------\|-------------\|
		\| `-w, --working-group WG` \| Working groups to crawl (repeatable). `RAN`, `SA`, `CT`. \|
		\| `-s, --sub-group SG` \| Sub-working groups to crawl (repeatable). \|
		\| `--incremental/--full` \| Incremental mode skips existing; `--full` forces reprocessing. \|
		\| `--limit-meetings-per-wg N` \| Maximum meetings per working group. \|
		\| `--eol-username USER` \| ETSI Online account username (faster authenticated access). \|

		Examples:

		```bash
		# Initial setup
		tdoc-crawler crawl-meetings

		# Specific working group
		tdoc-crawler crawl-meetings -w SA

		# Recent meetings only
		tdoc-crawler crawl-meetings --limit-meetings-per-wg 10
		```

		---

		### `crawl-tdocs` (alias: `ct`)

		Crawl TDoc metadata from the 3GPP FTP server.

		When to Use:

		- Initial TDoc metadata population.
		- Incremental updates to your local TDoc database.
		- After running `crawl-meetings` to refresh TDoc data.

		Options:

		\| Option \| Description \|
		\|--------\|-------------\|
		\| `-w, --working-group WG` \| Working groups to crawl. \|
		\| `-s, --sub-group SG` \| Sub-working groups to crawl. \|
		\| `--incremental/--full` \| Incremental mode skips existing TDocs. \|
		\| `--workers N` \| Number of parallel workers (default: 4). \|
		\| `--checkout` \| Download and extract crawled TDocs to checkout folder. \|

		Examples:

		```bash
		# Crawl all TDocs (after crawl-meetings)
		tdoc-crawler crawl-tdocs

		# Specific subgroup
		tdoc-crawler crawl-tdocs -w RAN -s R1 -s R2

		# Faster crawl with more workers
		tdoc-crawler crawl-tdocs -w RAN --workers 8
		```

		---

		### `crawl-specs` (alias: `cs`)

		Crawl technical specification (TS/TR) metadata.

		When to Use:

		- Populating the specs catalog for searching/viewing specs.
		- Synchronizing latest spec versions and titles.

		Options:

		\| Option \| Description \|
		\|--------\|-------------\|
		\| `-s, --source SOURCE` \| Metadata sources: `3gpp`, `whatthespec`. Default: both. \|
		\| `-w, --working-group WG` \| Working groups to crawl. \|

		Examples:

		```bash
		# Crawl all specs from all sources
		tdoc-crawler crawl-specs

		# Crawl only RAN specs from whatthespec
		tdoc-crawler crawl-specs -w RAN -s whatthespec
		```

docs/development.md

0 → 100644

+70 −0

Original line number	Diff line number	Diff line
		# Development Guide

		This guide describes how to set up your environment for contributing to `tdoc-crawler`.

		## Setup

		### Using uv (recommended)

		1. Clone the repository:

		```bash
		git clone https://forge.3gpp.org/rep/reimes/tdoc-crawler.git
		cd tdoc-crawler
		```

		2. Sync dependencies:

		```bash
		uv sync --all-extras
		```

		3. Install pre-commit hooks:

		```bash
		uv run pre-commit install
		```

		## Workflow

		### Running Tests

		All tests use `pytest`. The project aims for 70%+ coverage.

		```bash
		# Run all tests
		uv run pytest

		# Run with coverage report
		uv run pytest --cov=tdoc_crawler --cov-report=html

		# Run specific test file
		uv run pytest tests/test_database.py -v
		```

		### Code Quality

		We use `ruff` for formatting and linting, and `ty` for type checking.

		```bash
		# Format and lint
		uv run ruff format
		uv run ruff check --fix

		# Type checking
		uv run ty check
		```

		## Documentation Standards

		- Always update [QUICK_REFERENCE.md](QUICK_REFERENCE.md) when adding/changing CLI commands. (Note: During documentation refactoring, this rule may change to updating specific sub-docs).
		- Write Google-style docstrings for all functions.
		- Keep `history/` logs updated for significant changes.

		## Architecture Overview

		- `models/`: Pydantic models (the "source of truth" for data structures).
		- `crawlers/`: External data acquisition (FTP, Portal, WhatTheSpec).
		- `database.py`: SQLite/Pydantic-SQLite persistence layer.
		- `cli/`: Typer-based command definitions.
		- `http_client.py`: Cached HTTP session management.