Commit d0627d1e authored by Jan Reimes's avatar Jan Reimes
Browse files

Refactor TDoc Crawler to use HTTP instead of FTP

- Updated TDocCrawler to crawl TDoc metadata from 3GPP HTTP server instead of FTP.
- Replaced FTP connection logic with HTTP requests using the requests library.
- Enhanced error handling for HTTP requests and added retry logic.
- Modified TDoc filename regex to accommodate various formats.
- Updated tests to reflect changes in crawling logic and ensure proper functionality.
- Adjusted sample TDoc URLs in tests to use HTTPS.
- Improved logging and error messages for better debugging.
- Added support for subgroups in the crawl command line interface.
parent 6dbd04cb
Loading
Loading
Loading
Loading
+6 −4
Original line number Diff line number Diff line
@@ -87,9 +87,11 @@ TDocs are stored on the 3GPP web server and are publicly accessible to everyone.
Note that ...

- `<tdoc_nbr>` is the filename stem of the TDoc file, e.g., `R1-2301234`.
- the first letter of the TDoc number indicates the working group, e.g., `R` for RAN, `S` for SA, and `T` for CT.
- Any other files on the FTP server that do not follow this naming convention are not TDocs and should be ignored.
- `<sub-working_group_identifier>` and `<meeting_identifier>` are just path names and do not have a fixed format or naming convention, they also do not correspond to the official SWG and meeting identifiers (i.e., they are arbitrary path names on the FTP server and thus do not need to be stored in the database).
- The first letter of the TDoc number indicates the working group, e.g., `R` for RAN, `S` for SA, and `C` for CT. The second letter indicates the sub-working group or plenary, e.g., `1` for RAN1, `4` for SA4, and `P` for plenary. All other characters in the TDoc number can vary, but at least additional 4 arbitrary characters are required.
- More than 99% of TDocs are in `.zip` format, with only a few rare cases of `.pdf` or `.txt` files.
- Otherwise, there is no specific format for or limit of the number of digits, letters, dashes, etc.
- Any other files on the server that do not follow this naming convention are not TDocs and should be ignored.
- `<sub-working_group_identifier>` and `<meeting_identifier>` are just path names and do not have a fixed format or naming convention, they also do not correspond to the official SWG and meeting identifiers (i.e., they are arbitrary path names on the server and thus do not need to be stored in the database).
- Each WG has multiple SWGs (sub-groups, simply numbered from 1 to n) as well as a so-called "plenary" group, which is not a SWG but just called "plenary". The plenary group usually has the identifier `TSG_<WG>`, e.g., `TSG_RAN` for RAN plenary TDocs.

There are three main working groups in 3GPP that handle TDocs:
@@ -1007,7 +1009,7 @@ The project maintains three levels of documentation:
**When adding/modifying features:**

1. **Implement** the feature in code
2. **Create** history file documenting the change in `docs/history/YYYY-MM-DD_SUMMARY_<topic>.md`
2. **Create** history file documenting the change in `docs/history/YYYY-MM-DD_SUMMARY_<NN>_<topic>.md`. `<NN>` is a sequential number for multiple changes on the same day.
3. **Update** `docs/QUICK_REFERENCE.md` immediately with the new/changed command documentation
4. **Verify** README.md still links to QUICK_REFERENCE.md
5. **Test** that all examples in documentation work correctly
+160 −0
Original line number Diff line number Diff line
# Add Subgroup Filtering to `crawl` Command

**Date:** October 21, 2025

## Summary

Added the missing `--sub-group` / `-s` option to the `crawl` command to enable filtering TDocs by sub-working groups (e.g., SA4, RAN1, CT Plenary).

## Changes Made

### 1. CLI Update (`src/tdoc_crawler/cli.py`)

**Added parameter to `crawl` command:**
```python
subgroup: list[str] | None = typer.Option(None, "--sub-group", "-s", help="Filter by sub-working group")
```

**Updated config instantiation:**
```python
subgroups = _parse_subgroups(subgroup)  # Parse and normalize aliases
config = TDocCrawlConfig(
    # ...
    subgroups=subgroups,  # Pass to config
    # ...
)
```

### 2. Crawler Logic (`src/tdoc_crawler/crawlers/tdocs.py`)

**Improved `_extract_subgroup` method:**
- Now returns **codes** (R1, S4, RP) instead of full names (RAN1, SA4, RAN Plenary)
- Properly extracts subgroup from FTP directory names:
  - `TSG_RAN``RP` (RAN Plenary)
  - `WG4_Codec``S4` (SA4)
  - `WG1_RL1``R1` (RAN1)

**Added filtering logic in `_process_file`:**
```python
# Filter by subgroups if specified
if config.subgroups is not None and subgroup is not None:
    normalized_subgroup = subgroup.upper().strip()
    if not any(normalized_subgroup == sg.upper().strip() for sg in config.subgroups):
        return  # Skip this TDoc
```

### 3. Test Updates (`tests/test_crawler.py`)

Updated `test_extract_subgroup` to verify correct code extraction:
- `TSG_RAN``RP`
- `WG1_RL1``R1`
- `WG4_Codec``S4`

## Usage Examples

### Filter by specific subgroup
```bash
# Crawl only SA4 TDocs
tdoc-crawler crawl -s S4

# Crawl only RAN1 and RAN2 TDocs
tdoc-crawler crawl -s R1 -s R2

# Crawl only RAN Plenary TDocs
tdoc-crawler crawl -s RP
```

### Filter by working group and subgroup
```bash
# Crawl SA4 TDocs (explicit working group)
tdoc-crawler crawl -w SA -s S4

# Crawl multiple subgroups within SA
tdoc-crawler crawl -w SA -s S1 -s S2 -s S4
```

### Subgroup without working group
The `-s` option works **without** explicit `-w` specification. The crawler:
1. Normalizes the subgroup alias (S4 → S4, RP → RP)
2. Walks all working groups' FTP directories
3. Filters TDocs based on extracted subgroup codes

Example:
```bash
# This works - crawls only S4 TDocs from SA working group
tdoc-crawler crawl -s S4
```

## Supported Aliases

All aliases from `MEETING_CODE_REGISTRY` are supported:

**RAN:**
- `RP` - RAN Plenary
- `R1` - RAN1
- `R2` - RAN2
- `R3` - RAN3
- `R4` - RAN4
- `R5` - RAN5
- `R6` - RAN6

**SA:**
- `SP` - SA Plenary
- `S1` - SA1
- `S2` - SA2
- `S3` - SA3
- `S4` - SA4
- `S5` - SA5
- `S6` - SA6

**CT:**
- `CP` - CT Plenary
- `C1` - CT1
- `C2` - CT2
- `C3` - CT3
- `C4` - CT4
- `C5` - CT5
- `C6` - CT6

## Technical Details

### FTP Directory Mapping

The crawler maps FTP directory names to subgroup codes:

| FTP Directory Pattern | Subgroup Code | Example |
|-----------------------|---------------|---------|
| `TSG_<WG>` | `<W>P` | `TSG_RAN``RP` |
| `WG<n>_<name>` | `<W><n>` | `WG4_Codec``S4` |

Where `<W>` is the first letter of the working group (R, S, or C).

### Filtering Algorithm

1. User specifies: `-s S4`
2. `_parse_subgroups()` normalizes to `["S4"]`
3. For each TDoc file found:
   - Extract subgroup code from FTP path
   - Compare (case-insensitive): `S4` == `S4` ?
   - If match: collect TDoc
   - If no match: skip TDoc

## Testing

All existing tests pass:
-`test_extract_subgroup` - Verifies code extraction from FTP paths
-`test_crawl_collects_tdocs` - Verifies basic crawl functionality
- ✅ Manual verification with `tdoc-crawler crawl -s S4 --help`

## Implementation Notes

1. **Consistency:** Subgroup filtering works the same way in both `crawl` and `query-meetings` commands
2. **Case-insensitive:** Comparisons use `.upper()` for robustness
3. **Multiple subgroups:** Users can specify `-s` multiple times
4. **No validation errors:** Invalid subgroup codes are passed through (crawler simply won't find matching TDocs)

## Related Commands

For reference, these commands already supported subgroup filtering:
- `query-meetings -s <subgroup>` - Query meeting metadata by subgroup
- (Now) `crawl -s <subgroup>` - Crawl TDocs by subgroup ✓
+683 −0

File added.

Preview size limit exceeded, changes collapsed.

+71 −6
Original line number Diff line number Diff line
@@ -69,7 +69,7 @@ def _parse_working_groups(values: list[str] | None) -> list[WorkingGroup]:


def _parse_subgroups(values: list[str] | None) -> list[str] | None:
    """Parse and normalize subgroup names, expanding aliases to canonical names."""
    """Parse and normalize subgroup aliases to canonical names."""
    from tdoc_crawler.crawlers import normalize_subgroup_alias

    if not values:
@@ -83,6 +83,8 @@ def _parse_subgroups(values: list[str] | None) -> list[str] | None:
            raise typer.Exit(code=2)
        resolved.extend(normalized)

    return resolved

    # Remove duplicates while preserving order
    seen = set()
    unique_resolved = []
@@ -259,6 +261,11 @@ def _normalize_portal_meeting_name(portal_meeting: str) -> str:
def _resolve_meeting_id(database: TDocDatabase, meeting_name: str) -> int | None:
    """Resolve meeting name to meeting_id from database.

    Uses fuzzy matching to handle variations in meeting names:
    - Exact match (case-insensitive)
    - Normalized name match
    - Prefix/suffix matching for variations like "SA4-e" vs "3GPPSA4-e"

    Args:
        database: Database connection
        meeting_name: Meeting identifier (e.g., "SA4#133-e" or "S4-133-e")
@@ -266,7 +273,7 @@ def _resolve_meeting_id(database: TDocDatabase, meeting_name: str) -> int | None
    Returns:
        Meeting ID if found, None otherwise
    """
    # Try original name first
    # Try exact match first (case-insensitive)
    cursor = database.connection.execute(
        "SELECT meeting_id FROM meetings WHERE short_name = ? COLLATE NOCASE",
        (meeting_name,),
@@ -286,6 +293,55 @@ def _resolve_meeting_id(database: TDocDatabase, meeting_name: str) -> int | None
        if row:
            return row[0]

    # Try fuzzy matching with meeting names in database
    # Use SQL LIKE for better performance
    candidate_lower = meeting_name.lower()
    normalized_lower = normalized.lower()

    # Try fuzzy patterns with candidate
    for pattern in [
        f"{candidate_lower}%",  # candidate is prefix of cached
        f"%{candidate_lower}",  # candidate is suffix of cached
    ]:
        cursor = database.connection.execute(
            "SELECT meeting_id FROM meetings WHERE LOWER(short_name) LIKE ? LIMIT 1",
            (pattern,),
        )
        row = cursor.fetchone()
        if row:
            return row[0]

    # Try fuzzy patterns with normalized candidate
    if normalized_lower != candidate_lower:
        for pattern in [
            f"{normalized_lower}%",  # normalized is prefix of cached
            f"%{normalized_lower}",  # normalized is suffix of cached
        ]:
            cursor = database.connection.execute(
                "SELECT meeting_id FROM meetings WHERE LOWER(short_name) LIKE ? LIMIT 1",
                (pattern,),
            )
            row = cursor.fetchone()
            if row:
                return row[0]

    # Try reverse patterns: cached is prefix/suffix of candidate
    cursor = database.connection.execute("SELECT meeting_id, short_name FROM meetings")
    for meeting_id, cached_name in cursor.fetchall():
        cached_lower = cached_name.lower()
        # Check if cached_name is prefix of candidate
        if candidate_lower.startswith(cached_lower):
            return meeting_id
        # Check if cached_name is suffix of candidate
        if candidate_lower.endswith(cached_lower):
            return meeting_id
        # Also check with normalized candidate
        if normalized_lower != candidate_lower:
            if normalized_lower.startswith(cached_lower):
                return meeting_id
            if normalized_lower.endswith(cached_lower):
                return meeting_id

    return None


@@ -425,21 +481,23 @@ def _maybe_fetch_missing_tdocs(
def crawl(
    cache_dir: Path = typer.Option(Path.home() / ".tdoc-crawler", "--cache-dir", "-c", help="Cache directory"),
    working_group: list[str] | None = typer.Option(None, "--working-group", "-w", help="Working groups to crawl"),
    subgroup: list[str] | None = typer.Option(None, "--sub-group", "-s", help="Filter by sub-working group"),
    incremental: bool = typer.Option(True, "--incremental/--full", help="Toggle incremental mode"),
    limit_tdocs: int | None = typer.Option(None, "--limit-tdocs", help="Limit number of TDocs"),
    limit_meetings: int | None = typer.Option(None, "--limit-meetings", help="Limit meetings considered"),
    limit_meetings_per_wg: int | None = typer.Option(None, "--limit-meetings-per-wg", help="Limit meetings per working group"),
    limit_wgs: int | None = typer.Option(None, "--limit-wgs", help="Limit number of working groups"),
    max_retries: int = typer.Option(3, "--max-retries", help="FTP retry attempts"),
    timeout: int = typer.Option(30, "--timeout", help="FTP timeout seconds"),
    max_retries: int = typer.Option(3, "--max-retries", help="HTTP connection retry attempts"),
    timeout: int = typer.Option(30, "--timeout", help="HTTP request timeout seconds"),
    verbose: bool = typer.Option(False, "--verbose", "-v", help="Enable verbose logging"),
) -> None:
    working_groups = _parse_working_groups(working_group)
    subgroups = _parse_subgroups(subgroup)
    limits = _build_limits(limit_tdocs, limit_meetings, limit_meetings_per_wg, limit_wgs)
    config = TDocCrawlConfig(
        cache_dir=cache_dir,
        working_groups=working_groups,
        subgroups=None,
        subgroups=subgroups,
        meeting_ids=None,
        start_date=None,
        end_date=None,
@@ -455,7 +513,14 @@ def crawl(
    )

    database_path = _database_path(config.cache_dir)
    console.print(f"[cyan]Crawling TDocs for {', '.join(wg.value for wg in working_groups)}[/cyan]")

    # Build descriptive message
    scope_parts = []
    if subgroups:
        scope_parts.append(f"subgroups: {', '.join(subgroups)}")
    else:
        scope_parts.append(f"working groups: {', '.join(wg.value for wg in working_groups)}")
    console.print(f"[cyan]Crawling TDocs ({', '.join(scope_parts)})[/cyan]")

    if config.verbose:
        logging.getLogger().setLevel(logging.DEBUG)
+243 −208

File changed.

Preview size limit exceeded, changes collapsed.

Loading