Commit 6747fbe4 authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(docs): add comprehensive documentation for crawling and querying

* Introduce `crawl.md` for crawling metadata from external sources.
* Add `development.md` for setup and contribution guidelines.
* Create `history.md` to log significant changes and improvements.
* Implement `index.md` as the main entry point for documentation.
* Establish `misc.md` for configuration and HTTP caching details.
* Develop `query.md` for querying stored metadata with examples.
* Add `utils.md` for utility commands related to file access and diagnostics.
* Include `whatthespec.md` to explain the primary data source used.
* Update `mkdocs.yml` and `pyproject.toml` to reflect new repository URL.
parent 07f7320b
Loading
Loading
Loading
Loading

CONTRIBUTING.md

deleted100644 → 0
+0 −126
Original line number Diff line number Diff line
# Contributing to `tdoc-crawler`

Contributions are welcome, and they are greatly appreciated!
Every little bit helps, and credit will always be given.

You can contribute in many ways:

# Types of Contributions

## Report Bugs

Report bugs at https://github.com/Jan-Reimes_HEAD/tdoc-crawler/issues

If you are reporting a bug, please include:

- Your operating system name and version.
- Any details about your local setup that might be helpful in troubleshooting.
- Detailed steps to reproduce the bug.

## Fix Bugs

Look through the GitHub issues for bugs.
Anything tagged with "bug" and "help wanted" is open to whoever wants to implement a fix for it.

## Implement Features

Look through the GitHub issues for features.
Anything tagged with "enhancement" and "help wanted" is open to whoever wants to implement it.

## Write Documentation

tdoc-crawler could always use more documentation, whether as part of the official docs, in docstrings, or even on the web in blog posts, articles, and such.

## Submit Feedback

The best way to send feedback is to file an issue at https://github.com/Jan-Reimes_HEAD/tdoc-crawler/issues.

If you are proposing a new feature:

- Explain in detail how it would work.
- Keep the scope as narrow as possible, to make it easier to implement.
- Remember that this is a volunteer-driven project, and that contributions
  are welcome :)

# Get Started!

Ready to contribute? Here's how to set up `tdoc-crawler` for local development.
Please note this documentation assumes you already have `uv` and `Git` installed and ready to go.

1. Fork the `tdoc-crawler` repo on GitHub.

2. Clone your fork locally:

```bash
cd <directory_in_which_repo_should_be_created>
git clone git@github.com:YOUR_NAME/tdoc-crawler.git
```

3. Now we need to install the environment. Navigate into the directory

```bash
cd tdoc-crawler
```

Then, install and activate the environment with:

```bash
uv sync
```

4. Install pre-commit to run linters/formatters at commit time:

```bash
uv run pre-commit install
```

5. Create a branch for local development:

```bash
git checkout -b name-of-your-bugfix-or-feature
```

Now you can make your changes locally.

6. Don't forget to add test cases for your added functionality to the `tests` directory.

7. When you're done making changes, check that your changes pass the formatting tests.

```bash
make check
```

Now, validate that all unit tests are passing:

```bash
make test
```

9. Before raising a pull request you should also run tox.
   This will run the tests across different versions of Python:

```bash
tox
```

This requires you to have multiple versions of python installed.
This step is also triggered in the CI/CD pipeline, so you could also choose to skip this step locally.

10. Commit your changes and push your branch to GitHub:

```bash
git add .
git commit -m "Your detailed description of your changes."
git push origin name-of-your-bugfix-or-feature
```

11. Submit a pull request through the GitHub website.

# Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

1. The pull request should include tests.

2. If the pull request adds functionality, the docs should be updated.
   Put your new functionality into a function with a docstring, and add the feature to the list in `README.md`.
+64 −150
Original line number Diff line number Diff line
# tdoc-crawler

[![Release](https://img.shields.io/github/v/release/Jan-Reimes_HEAD/tdoc-crawler)](https://img.shields.io/github/v/release/Jan-Reimes_HEAD/tdoc-crawler)
[![Build status](https://img.shields.io/github/actions/workflow/status/Jan-Reimes_HEAD/tdoc-crawler/main.yml?branch=main)](https://github.com/Jan-Reimes_HEAD/tdoc-crawler/actions/workflows/main.yml?query=branch%3Amain)
[![codecov](https://codecov.io/gh/Jan-Reimes_HEAD/tdoc-crawler/branch/main/graph/badge.svg)](https://codecov.io/gh/Jan-Reimes_HEAD/tdoc-crawler)
[![Commit activity](https://img.shields.io/github/commit-activity/m/Jan-Reimes_HEAD/tdoc-crawler)](https://img.shields.io/github/commit-activity/m/Jan-Reimes_HEAD/tdoc-crawler)
[![License](https://img.shields.io/github/license/Jan-Reimes_HEAD/tdoc-crawler)](https://img.shields.io/github/license/Jan-Reimes_HEAD/tdoc-crawler)

A command-line tool for crawling the 3GPP FTP server, caching TDoc metadata in a local database, and querying structured data via JSON/YAML output.

- **Github repository**: <https://github.com/Jan-Reimes_HEAD/tdoc-crawler/>
- **Documentation**: <https://Jan-Reimes_HEAD.github.io/tdoc-crawler/>
**Github repository**: <https://forge.3gpp.org/rep/reimes/tdoc-crawler/>

## Features

@@ -24,25 +17,23 @@ A command-line tool for crawling the 3GPP FTP server, caching TDoc metadata in a

## Installation

### Using uv (recommended)
### Install as uv tool (recommended)

```bash
# Install from PyPI (when published)
uv pip install tdoc-crawler

# Or install from source
git clone https://github.com/Jan-Reimes_HEAD/tdoc-crawler.git
cd tdoc-crawler
uv sync

# Or install it as uv tool directly:
uv tool install https://github.com/Jan-Reimes_HEAD/tdoc-crawler.git
uv tool install https://forge.3gpp.org/rep/reimes/tdoc-crawler.git
uvx tdoc-crawler --help
```

### Using uv

```bash
# Install from PyPI (publication pending)
uv add tdoc-crawler



# Or install from source
git clone https://forge.3gpp.org/rep/reimes/tdoc-crawler.git
cd tdoc-crawler
uv sync
```

### Using pip (not recommended)
@@ -51,8 +42,6 @@ uvx tdoc-crawler --help
pip install tdoc-crawler
```

(... then same usage as above.)

## Configuration

### Environment Variables
@@ -66,6 +55,9 @@ cp .env.example .env
# Edit .env and add your settings:

# ETSI Online (EOL) credentials (optional for portal authentication)
# Acts as a "3GPP-compatible fallback". While whatthespec.net is the primary
# data source, it is community-maintained. Credentials allow falling back to
# official 3GPP portal endpoints if the primary source is unavailable.
EOL_USERNAME=your_username
EOL_PASSWORD=your_password

@@ -89,165 +81,87 @@ export EOL_PASSWORD=your_password
export HTTP_CACHE_TTL=3600
```

... or let the tool prompt you interactively when needed
NOTE:    If no credentials are provided, the tool will prompt you interactively when needed.

## Quick Start

### 1. Crawl Meeting Metadata

First, populate the meetings database by crawling the 3GPP portal:
### 1. Command Reference

| Command | Alias | Purpose |
|---------|-------|---------|
| **Crawling** | | |
| `crawl-meetings` | `cm` | Populate meeting database (Run this first!) |
| `crawl-tdocs` | `ct` | Crawl TDoc metadata from FTP |
| `crawl-specs` | `cs` | Crawl technical specification metadata |
| **Querying** | | |
| `query-meetings` | `qm` | Search and display meeting metadata |
| `query-tdocs` | `qt` | Search for TDocs (auto-fetches if missing) |
| `query-specs` | `qs` | Search technical specifications |
| **Utilities** | | |
| `open` | | Download and open a TDoc |
| `checkout` | | Batch download TDocs to local folder |
| `open-spec` | `os` | Download and open latest spec document |
| `checkout-spec` | `cos` | Batch download technical specifications |
| `stats` | | View database statistics |

### 1. Crawl Metadata

Gather metadata from 3GPP and WhatTheSpec:

```bash
# Populate meeting database (REQUIRED first step)
tdoc-crawler crawl-meetings
```

This will:

- Connect to the 3GPP portal
- Retrieve meeting metadata for RAN, SA, and CT working groups
- Store metadata in a local SQLite database at `~/.tdoc-crawler/tdoc_crawler.db`

Optional filters:

```bash
# Crawl meetings for specific working group
tdoc-crawler crawl-meetings -w SA

# Limit to recent meetings (e.g., 10 per working group)
tdoc-crawler crawl-meetings --limit-meetings-per-wg 10
```

### 2. Crawl TDoc Metadata

Once meetings are populated, crawl TDocs from the 3GPP FTP server:

```bash
# Crawl all TDocs
# Crawl TDoc metadata (RAN, SA, CT)
tdoc-crawler crawl

# Crawl TDocs for specific working group
tdoc-crawler crawl -w SA

# Crawl TDocs for specific subgroup (e.g., SA4)
tdoc-crawler crawl -w SA -s S4

# Crawl TDocs from meetings in date range
tdoc-crawler crawl -w RAN --start-date 2024-01-01 --end-date 2024-12-31

# Crawl TDocs from specific meeting IDs
tdoc-crawler crawl --meeting-ids 60666 60667
# Populate spec catalog
tdoc-crawler crawl-specs
```

### 3. Query TDoc Metadata
### 2. Query Metadata

Once the database is populated, you can query TDoc information:
Search and filter stored information:

```bash
# Query a specific TDoc (case-insensitive)
tdoc-crawler query R1-2301234

# Query multiple TDocs
tdoc-crawler query R1-2301234 S2-2305678

# Query all TDocs from a working group
tdoc-crawler query --working-group RAN
# Query a specific TDoc (auto-fetches metadata if missing)
tdoc-crawler query R1-2400001

# Export results as JSON
tdoc-crawler query R1-2301234 --format json --output results.json
# Query specifications
tdoc-crawler query-specs 23.501

# Export as YAML
tdoc-crawler query --working-group SA --format yaml
# List recent meetings
tdoc-crawler query-meetings --limit 10
```

### 4. Crawl and Query Specifications
### 3. Utilities & File Access

Populate and search the spec catalog:
Open documents, batch download (checkout), and check database status:

```bash
# Crawl spec metadata from all sources
tdoc-crawler crawl-specs

# Query specific specifications
tdoc-crawler query-specs 23.501 38.331
# Download and open a TDoc with system default app
tdoc-crawler open R1-2400001

# Open latest document for a spec
# Download and open latest version of a spec
tdoc-crawler open-spec 23.501
```

### 5. View Database Statistics
# Batch download (checkout) TDocs to local folder
tdoc-crawler checkout R1-2400001 S2-2400567

```bash
# Batch checkout specifications
tdoc-crawler checkout-spec 26130-26140

# View database statistics
tdoc-crawler stats
```

## Quick Reference
## Documentation

For a complete command reference, see [QUICK_REFERENCE.md](docs/QUICK_REFERENCE.md).
For detailed documentation, including command deep-dives, configuration, and architecture, see the [Documentation Index](docs/index.md).

## Development

### Setting Up Development Environment

```bash
# Clone the repository
git clone https://github.com/Jan-Reimes_HEAD/tdoc-crawler.git
cd tdoc-crawler

# Install dependencies and dev tools
uv sync --all-extras

# Install pre-commit hooks
uv run pre-commit install
```

### Running Tests

```bash
# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=tdoc_crawler --cov-report=html

# Run specific test file
uv run pytest tests/test_database.py -v
```

### Code Quality

```bash
# Format code
uv run ruff format

# Lint code
uv run ruff check

# Type checking
uv run ty check
```

## Architecture

The project follows a modular structure:

1. **`models/`**: Pydantic models for data validation and configuration

   - `base.py`: Base configuration models, enums (OutputFormat, SortOrder)
   - `working_groups.py`: WorkingGroup enum with tbid/ftp_root properties
   - `subworking_groups.py`: SubworkingGroup reference data
   - `tdocs.py`: TDoc metadata models and crawl/query configurations
   - `meetings.py`: Meeting metadata models and configurations
   - `crawl_limits.py`: Crawl throttling configuration

1. **`crawlers/`**: Web scraping and FTP crawling logic

   - `tdocs.py`: TDocCrawler - FTP directory traversal, TDoc discovery
   - `meetings.py`: MeetingCrawler - HTML parsing from 3GPP portal
   - `portal.py`: Portal authentication and metadata extraction

1. **`database.py`**: SQLite database interface with typed wrappers

1. **`cli.py`**: Command-line interface using Typer and Rich
For information on setting up the development environment, running tests, and code quality standards, please refer to the [Development Guide](docs/development.md).

## License

docs/QUICK_REFERENCE.md

deleted100644 → 0
+0 −889

File deleted.

Preview size limit exceeded, changes collapsed.

docs/crawl.md

0 → 100644
+105 −0
Original line number Diff line number Diff line
# Crawling Metadata

Crawling is the process of gathering metadata from external sources (3GPP FTP server, 3GPP Portal, WhatTheSpec) and storing it in your local SQLite database.

## Prerequisites

- **Meetings must be crawled first**: The `crawl-tdocs` command relies on the meetings database to know which meeting directories to visit on the FTP server.

## Commands

### `crawl-meetings` (alias: `cm`)

Crawl meeting metadata from the 3GPP portal.

**When to Use:**

- Initial setup (required before first TDoc crawl).
- Periodic updates to meeting schedules.
- Adding new working groups to crawl.

**Options:**

| Option | Description |
|--------|-------------|
| `-w, --working-group WG` | Working groups to crawl (repeatable). `RAN`, `SA`, `CT`. |
| `-s, --sub-group SG` | Sub-working groups to crawl (repeatable). |
| `--incremental/--full` | Incremental mode skips existing; `--full` forces reprocessing. |
| `--limit-meetings-per-wg N` | Maximum meetings per working group. |
| `--eol-username USER` | ETSI Online account username (faster authenticated access). |

**Examples:**

```bash
# Initial setup
tdoc-crawler crawl-meetings

# Specific working group
tdoc-crawler crawl-meetings -w SA

# Recent meetings only
tdoc-crawler crawl-meetings --limit-meetings-per-wg 10
```

---

### `crawl-tdocs` (alias: `ct`)

Crawl TDoc metadata from the 3GPP FTP server.

**When to Use:**

- Initial TDoc metadata population.
- Incremental updates to your local TDoc database.
- After running `crawl-meetings` to refresh TDoc data.

**Options:**

| Option | Description |
|--------|-------------|
| `-w, --working-group WG` | Working groups to crawl. |
| `-s, --sub-group SG` | Sub-working groups to crawl. |
| `--incremental/--full` | Incremental mode skips existing TDocs. |
| `--workers N` | Number of parallel workers (default: 4). |
| `--checkout` | Download and extract crawled TDocs to checkout folder. |

**Examples:**

```bash
# Crawl all TDocs (after crawl-meetings)
tdoc-crawler crawl-tdocs

# Specific subgroup
tdoc-crawler crawl-tdocs -w RAN -s R1 -s R2

# Faster crawl with more workers
tdoc-crawler crawl-tdocs -w RAN --workers 8
```

---

### `crawl-specs` (alias: `cs`)

Crawl technical specification (TS/TR) metadata.

**When to Use:**

- Populating the specs catalog for searching/viewing specs.
- Synchronizing latest spec versions and titles.

**Options:**

| Option | Description |
|--------|-------------|
| `-s, --source SOURCE` | Metadata sources: `3gpp`, `whatthespec`. Default: both. |
| `-w, --working-group WG` | Working groups to crawl. |

**Examples:**

```bash
# Crawl all specs from all sources
tdoc-crawler crawl-specs

# Crawl only RAN specs from whatthespec
tdoc-crawler crawl-specs -w RAN -s whatthespec
```

docs/development.md

0 → 100644
+70 −0
Original line number Diff line number Diff line
# Development Guide

This guide describes how to set up your environment for contributing to `tdoc-crawler`.

## Setup

### Using uv (recommended)

1. Clone the repository:

   ```bash
   git clone https://forge.3gpp.org/rep/reimes/tdoc-crawler.git
   cd tdoc-crawler
   ```

2. Sync dependencies:

   ```bash
   uv sync --all-extras
   ```

3. Install pre-commit hooks:

   ```bash
   uv run pre-commit install
   ```

## Workflow

### Running Tests

All tests use `pytest`. The project aims for 70%+ coverage.

```bash
# Run all tests
uv run pytest

# Run with coverage report
uv run pytest --cov=tdoc_crawler --cov-report=html

# Run specific test file
uv run pytest tests/test_database.py -v
```

### Code Quality

We use `ruff` for formatting and linting, and `ty` for type checking.

```bash
# Format and lint
uv run ruff format
uv run ruff check --fix

# Type checking
uv run ty check
```

## Documentation Standards

- Always update [QUICK_REFERENCE.md](QUICK_REFERENCE.md) when adding/changing CLI commands. (Note: During documentation refactoring, this rule may change to updating specific sub-docs).
- Write Google-style docstrings for all functions.
- Keep `history/` logs updated for significant changes.

## Architecture Overview

- **`models/`**: Pydantic models (the "source of truth" for data structures).
- **`crawlers/`**: External data acquisition (FTP, Portal, WhatTheSpec).
- **`database.py`**: SQLite/Pydantic-SQLite persistence layer.
- **`cli/`**: Typer-based command definitions.
- **`http_client.py`**: Cached HTTP session management.
Loading