Commit 2fbd01b8 authored by jr2804's avatar jr2804
Browse files

docs(README, QUICK_REFERENCE): update usage instructions for crawling meetings and TDocs

- Added `crawl-meetings` command to README and QUICK_REFERENCE.
- Updated examples for crawling meetings and TDocs with new options.
- Clarified environment variable usage for ETSI Online credentials.
- Enhanced command descriptions for better user guidance.
parent 93d10c5f
Loading
Loading
Loading
Loading
+66 −15
Original line number Diff line number Diff line
@@ -33,17 +33,24 @@ git clone https://github.com/Jan-Reimes_HEAD/tdoc-crawler.git
cd tdoc-crawler
uv sync

# Optional: Set up environment variables for ETSI Online credentials
cp .env.example .env
# Edit .env and add your credentials
# Or install it as uv tool directly:
uv tool install https://github.com/Jan-Reimes_HEAD/tdoc-crawler.git
uvx tdoc-crawler --help





```

### Using pip
### Using pip (not recommended)

```bash
pip install tdoc-crawler
```

(... then same usage as above.)

## Configuration

### Environment Variables
@@ -51,8 +58,9 @@ pip install tdoc-crawler
For accessing certain 3GPP resources that require authentication, you can configure ETSI Online (EOL) credentials:

```bash
# Copy the example file
# Optional/required for parsing document metadata: Set up environment variables for ETSI Online credentials
cp .env.example .env
# -> Edit .env and add your credentials

# Edit .env and add your credentials:
# EOL_USERNAME=your_username
@@ -60,25 +68,68 @@ cp .env.example .env
```

Alternatively, you can:
- Pass credentials via CLI options: `--eol-username` and `--eol-password`
- Let the tool prompt you interactively when needed

```bash
# Pass them via CLI options or let the tool prompt you interactively:
uvx tdoc-crawler crawl-meetings --eol-username your_username --eol-password your_password
```

```bash
# Or configure environment variables directly:
export EOL_USERNAME=your_username
export EOL_PASSWORD=your_password
```

... or let the tool prompt you interactively when needed

## Quick Start

### 1. Crawl the 3GPP FTP Server
### 1. Crawl Meeting Metadata

First, populate your local database by crawling the 3GPP FTP server:
First, populate the meetings database by crawling the 3GPP portal:

```bash
tdoc-crawler crawl
tdoc-crawler crawl-meetings
```

This will:
- Connect to the 3GPP FTP server
- Retrieve TDoc links from RAN, SA, and CT working groups
- Store metadata in a local SQLite database at `~/.tdoc-crawler/tdocs.db`

### 2. Query TDoc Metadata
- Connect to the 3GPP portal
- Retrieve meeting metadata for RAN, SA, and CT working groups
- Store metadata in a local SQLite database at `~/.tdoc-crawler/tdoc_crawler.db`

Optional filters:

```bash
# Crawl meetings for specific working group
tdoc-crawler crawl-meetings -w SA

# Limit to recent meetings (e.g., 10 per working group)
tdoc-crawler crawl-meetings --limit-meetings-per-wg 10
```

### 2. Crawl TDoc Metadata

Once meetings are populated, crawl TDocs from the 3GPP FTP server:

```bash
# Crawl all TDocs
tdoc-crawler crawl

# Crawl TDocs for specific working group
tdoc-crawler crawl -w SA

# Crawl TDocs for specific subgroup (e.g., SA4)
tdoc-crawler crawl -w SA -s S4

# Crawl TDocs from meetings in date range
tdoc-crawler crawl -w RAN --start-date 2024-01-01 --end-date 2024-12-31

# Crawl TDocs from specific meeting IDs
tdoc-crawler crawl --meeting-ids 60666 60667
```

### 3. Query TDoc Metadata

Once the database is populated, you can query TDoc information:

@@ -99,7 +150,7 @@ tdoc-crawler query R1-2301234 --format json --output results.json
tdoc-crawler query --working-group SA --format yaml
```

### 3. View Database Statistics
### 4. View Database Statistics

```bash
tdoc-crawler stats
+64 −9
Original line number Diff line number Diff line
@@ -12,32 +12,87 @@ Single source of truth for the CLI behaviour. All examples assume execution from

## Commands

### `crawl-meetings`

```bash
uv run tdoc-crawler crawl-meetings [OPTIONS]
```

Scrape the 3GPP portal for meeting metadata. **Run this first** to populate the meetings database before crawling TDocs.

**Options:**

- `-c, --cache-dir PATH` – Database and download cache location.
- `-w, --working-group WG` – Repeatable; defaults to all (`RAN`, `SA`, `CT`).
- `--full/--incremental` – Disable incremental mode (defaults to incremental).
- `--limit-meetings`, `--limit-meetings-per-wg`, `--limit-wgs` – Throttle traversal.
- `--eol-username`, `--eol-password` – ETSI Online credentials (or use `EOL_USERNAME`/`EOL_PASSWORD` env vars).
- `-v, --verbose` – Emit debug logging.

**Examples:**

```bash
# Crawl all meetings for all working groups (initial setup)
uv run tdoc-crawler crawl-meetings

# Crawl meetings for specific working group
uv run tdoc-crawler crawl-meetings -w SA

# Crawl recent meetings only (limit to 10 per working group)
uv run tdoc-crawler crawl-meetings --limit-meetings-per-wg 10

# Update meetings database incrementally
uv run tdoc-crawler crawl-meetings --incremental
```

### `crawl`

```bash
uv run tdoc-crawler crawl [OPTIONS]
```

Crawl the 3GPP FTP hierarchy for TDocs and upsert metadata.
Crawl the 3GPP FTP hierarchy for TDocs and upsert metadata. **Requires meetings database** to be populated first via `crawl-meetings`.

**Options:**

- `-c, --cache-dir PATH` – Database and download cache location.
- `-w, --working-group WG` – Repeatable; defaults to all (`RAN`, `SA`, `CT`).
- `--full/--incremental` – Disable incremental mode.
- `-s, --sub-group SG` – Filter by sub-working group (repeatable). Supports aliases like `S4` (SA4), `R1` (RAN1).
- `--full/--incremental` – Disable incremental mode (defaults to incremental).
- `--limit-tdocs`, `--limit-meetings`, `--limit-meetings-per-wg`, `--limit-wgs` – Throttle traversal.
- `--start-date`, `--end-date` – ISO 8601 date range filter for meetings (e.g., `2024-01-01`).
- `--meeting-ids` – Specific meeting IDs to crawl (repeatable).
- `--workers` – Number of parallel workers (default: 4).
- `--max-retries`, `--timeout` – FTP connection controls.
- `-v, --verbose` – Emit debug logging.

### `crawl-meetings`
**Examples:**

```bash
uv run tdoc-crawler crawl-meetings [OPTIONS]
```
# Crawl all TDocs from all working groups (after crawl-meetings)
uv run tdoc-crawler crawl

# Crawl TDocs for specific working group
uv run tdoc-crawler crawl -w SA

# Crawl TDocs for specific subgroup
uv run tdoc-crawler crawl -w SA -s S4

# Crawl TDocs from multiple subgroups
uv run tdoc-crawler crawl -w RAN -s R1 -s R2

Scrape the 3GPP portal for meeting metadata.
# Crawl TDocs from meetings in date range
uv run tdoc-crawler crawl -w SA --start-date 2024-01-01 --end-date 2024-12-31

- Shares the limiting flags from `crawl` (per-meeting options only).
- Supports ETSI Online credentials via CLI options, environment (`EOL_USERNAME`, `EOL_PASSWORD`), or interactive prompt.
- Uses incremental mode by default to skip previously seen meeting IDs.
# Crawl TDocs from specific meeting IDs
uv run tdoc-crawler crawl --meeting-ids 60666 60667

# Limit crawl to recent meetings and TDocs
uv run tdoc-crawler crawl --limit-meetings-per-wg 5 --limit-tdocs 100

# Crawl with more parallel workers for faster processing
uv run tdoc-crawler crawl -w RAN --workers 8
```

### `query`