Refactor TDoc Crawler to use HTTP instead of FTP (d0627d1e) · Commits · Jan Reimes / 3gpp-crawler

AGENTS.md

+6 −4

Original line number	Diff line number	Diff line
		@@ -87,9 +87,11 @@ TDocs are stored on the 3GPP web server and are publicly accessible to everyone.
		Note that ...

		- `<tdoc_nbr>` is the filename stem of the TDoc file, e.g., `R1-2301234`.
		- the first letter of the TDoc number indicates the working group, e.g., `R` for RAN, `S` for SA, and `T` for CT.
		- Any other files on the FTP server that do not follow this naming convention are not TDocs and should be ignored.
		- `<sub-working_group_identifier>` and `<meeting_identifier>` are just path names and do not have a fixed format or naming convention, they also do not correspond to the official SWG and meeting identifiers (i.e., they are arbitrary path names on the FTP server and thus do not need to be stored in the database).
		- The first letter of the TDoc number indicates the working group, e.g., `R` for RAN, `S` for SA, and `C` for CT. The second letter indicates the sub-working group or plenary, e.g., `1` for RAN1, `4` for SA4, and `P` for plenary. All other characters in the TDoc number can vary, but at least additional 4 arbitrary characters are required.
		- More than 99% of TDocs are in `.zip` format, with only a few rare cases of `.pdf` or `.txt` files.
		- Otherwise, there is no specific format for or limit of the number of digits, letters, dashes, etc.
		- Any other files on the server that do not follow this naming convention are not TDocs and should be ignored.
		- `<sub-working_group_identifier>` and `<meeting_identifier>` are just path names and do not have a fixed format or naming convention, they also do not correspond to the official SWG and meeting identifiers (i.e., they are arbitrary path names on the server and thus do not need to be stored in the database).
		- Each WG has multiple SWGs (sub-groups, simply numbered from 1 to n) as well as a so-called "plenary" group, which is not a SWG but just called "plenary". The plenary group usually has the identifier `TSG_<WG>`, e.g., `TSG_RAN` for RAN plenary TDocs.

		There are three main working groups in 3GPP that handle TDocs:
		@@ -1007,7 +1009,7 @@ The project maintains three levels of documentation:
		When adding/modifying features:

		1. Implement the feature in code
		2. Create history file documenting the change in `docs/history/YYYY-MM-DD_SUMMARY_<topic>.md`
		2. Create history file documenting the change in `docs/history/YYYY-MM-DD_SUMMARY_<NN>_<topic>.md`. `<NN>` is a sequential number for multiple changes on the same day.
		3. Update `docs/QUICK_REFERENCE.md` immediately with the new/changed command documentation
		4. Verify README.md still links to QUICK_REFERENCE.md
		5. Test that all examples in documentation work correctly

docs/history/2025-10-21_SUMMARY_04_ADD_SUBGROUP_FILTERING_TO_CRAWL.md

0 → 100644

+160 −0

Original line number	Diff line number	Diff line
		# Add Subgroup Filtering to `crawl` Command

		Date: October 21, 2025

		## Summary

		Added the missing `--sub-group` / `-s` option to the `crawl` command to enable filtering TDocs by sub-working groups (e.g., SA4, RAN1, CT Plenary).

		## Changes Made

		### 1. CLI Update (`src/tdoc_crawler/cli.py`)

		Added parameter to `crawl` command:
		```python
		subgroup: list[str] \| None = typer.Option(None, "--sub-group", "-s", help="Filter by sub-working group")
		```

		Updated config instantiation:
		```python
		subgroups = _parse_subgroups(subgroup) # Parse and normalize aliases
		config = TDocCrawlConfig(
		# ...
		subgroups=subgroups, # Pass to config
		# ...
		)
		```

		### 2. Crawler Logic (`src/tdoc_crawler/crawlers/tdocs.py`)

		Improved `_extract_subgroup` method:
		- Now returns codes (R1, S4, RP) instead of full names (RAN1, SA4, RAN Plenary)
		- Properly extracts subgroup from FTP directory names:
		- `TSG_RAN` → `RP` (RAN Plenary)
		- `WG4_Codec` → `S4` (SA4)
		- `WG1_RL1` → `R1` (RAN1)

		Added filtering logic in `_process_file`:
		```python
		# Filter by subgroups if specified
		if config.subgroups is not None and subgroup is not None:
		normalized_subgroup = subgroup.upper().strip()
		if not any(normalized_subgroup == sg.upper().strip() for sg in config.subgroups):
		return # Skip this TDoc
		```

		### 3. Test Updates (`tests/test_crawler.py`)

		Updated `test_extract_subgroup` to verify correct code extraction:
		- `TSG_RAN` → `RP` ✓
		- `WG1_RL1` → `R1` ✓
		- `WG4_Codec` → `S4` ✓

		## Usage Examples

		### Filter by specific subgroup
		```bash
		# Crawl only SA4 TDocs
		tdoc-crawler crawl -s S4

		# Crawl only RAN1 and RAN2 TDocs
		tdoc-crawler crawl -s R1 -s R2

		# Crawl only RAN Plenary TDocs
		tdoc-crawler crawl -s RP
		```

		### Filter by working group and subgroup
		```bash
		# Crawl SA4 TDocs (explicit working group)
		tdoc-crawler crawl -w SA -s S4

		# Crawl multiple subgroups within SA
		tdoc-crawler crawl -w SA -s S1 -s S2 -s S4
		```

		### Subgroup without working group
		The `-s` option works without explicit `-w` specification. The crawler:
		1. Normalizes the subgroup alias (S4 → S4, RP → RP)
		2. Walks all working groups' FTP directories
		3. Filters TDocs based on extracted subgroup codes

		Example:
		```bash
		# This works - crawls only S4 TDocs from SA working group
		tdoc-crawler crawl -s S4
		```

		## Supported Aliases

		All aliases from `MEETING_CODE_REGISTRY` are supported:

		RAN:
		- `RP` - RAN Plenary
		- `R1` - RAN1
		- `R2` - RAN2
		- `R3` - RAN3
		- `R4` - RAN4
		- `R5` - RAN5
		- `R6` - RAN6

		SA:
		- `SP` - SA Plenary
		- `S1` - SA1
		- `S2` - SA2
		- `S3` - SA3
		- `S4` - SA4
		- `S5` - SA5
		- `S6` - SA6

		CT:
		- `CP` - CT Plenary
		- `C1` - CT1
		- `C2` - CT2
		- `C3` - CT3
		- `C4` - CT4
		- `C5` - CT5
		- `C6` - CT6

		## Technical Details

		### FTP Directory Mapping

		The crawler maps FTP directory names to subgroup codes:

		\| FTP Directory Pattern \| Subgroup Code \| Example \|
		\|-----------------------\|---------------\|---------\|
		\| `TSG_<WG>` \| `<W>P` \| `TSG_RAN` → `RP` \|
		\| `WG<n>_<name>` \| `<W><n>` \| `WG4_Codec` → `S4` \|

		Where `<W>` is the first letter of the working group (R, S, or C).

		### Filtering Algorithm

		1. User specifies: `-s S4`
		2. `_parse_subgroups()` normalizes to `["S4"]`
		3. For each TDoc file found:
		- Extract subgroup code from FTP path
		- Compare (case-insensitive): `S4` == `S4` ?
		- If match: collect TDoc
		- If no match: skip TDoc

		## Testing

		All existing tests pass:
		- ✅ `test_extract_subgroup` - Verifies code extraction from FTP paths
		- ✅ `test_crawl_collects_tdocs` - Verifies basic crawl functionality
		- ✅ Manual verification with `tdoc-crawler crawl -s S4 --help`

		## Implementation Notes

		1. Consistency: Subgroup filtering works the same way in both `crawl` and `query-meetings` commands
		2. Case-insensitive: Comparisons use `.upper()` for robustness
		3. Multiple subgroups: Users can specify `-s` multiple times
		4. No validation errors: Invalid subgroup codes are passed through (crawler simply won't find matching TDocs)

		## Related Commands

		For reference, these commands already supported subgroup filtering:
		- `query-meetings -s <subgroup>` - Query meeting metadata by subgroup
		- (Now) `crawl -s <subgroup>` - Crawl TDocs by subgroup ✓

docs/history/2025-10-21_SUMMARY_05_crawler_http_migration_and_refactoring.md

0 → 100644

+683 −0

File added.

Preview size limit exceeded, changes collapsed.

src/tdoc_crawler/cli.py

+71 −6

Original line number	Diff line number	Diff line
		@@ -69,7 +69,7 @@ def _parse_working_groups(values: list[str] \| None) -> list[WorkingGroup]:


		def _parse_subgroups(values: list[str] \| None) -> list[str] \| None:
		"""Parse and normalize subgroup names, expanding aliases to canonical names."""
		"""Parse and normalize subgroup aliases to canonical names."""
		from tdoc_crawler.crawlers import normalize_subgroup_alias

		if not values:
		@@ -83,6 +83,8 @@ def _parse_subgroups(values: list[str] \| None) -> list[str] \| None:
		raise typer.Exit(code=2)
		resolved.extend(normalized)

		return resolved

		# Remove duplicates while preserving order
		seen = set()
		unique_resolved = []
		@@ -259,6 +261,11 @@ def _normalize_portal_meeting_name(portal_meeting: str) -> str:
		def _resolve_meeting_id(database: TDocDatabase, meeting_name: str) -> int \| None:
		"""Resolve meeting name to meeting_id from database.

		Uses fuzzy matching to handle variations in meeting names:
		- Exact match (case-insensitive)
		- Normalized name match
		- Prefix/suffix matching for variations like "SA4-e" vs "3GPPSA4-e"

		Args:
		database: Database connection
		meeting_name: Meeting identifier (e.g., "SA4#133-e" or "S4-133-e")
		@@ -266,7 +273,7 @@ def _resolve_meeting_id(database: TDocDatabase, meeting_name: str) -> int \| None
		Returns:
		Meeting ID if found, None otherwise
		"""
		# Try original name first
		# Try exact match first (case-insensitive)
		cursor = database.connection.execute(
		"SELECT meeting_id FROM meetings WHERE short_name = ? COLLATE NOCASE",
		(meeting_name,),
		@@ -286,6 +293,55 @@ def _resolve_meeting_id(database: TDocDatabase, meeting_name: str) -> int \| None
		if row:
		return row[0]

		# Try fuzzy matching with meeting names in database
		# Use SQL LIKE for better performance
		candidate_lower = meeting_name.lower()
		normalized_lower = normalized.lower()

		# Try fuzzy patterns with candidate
		for pattern in [
		f"{candidate_lower}%", # candidate is prefix of cached
		f"%{candidate_lower}", # candidate is suffix of cached
		]:
		cursor = database.connection.execute(
		"SELECT meeting_id FROM meetings WHERE LOWER(short_name) LIKE ? LIMIT 1",
		(pattern,),
		)
		row = cursor.fetchone()
		if row:
		return row[0]

		# Try fuzzy patterns with normalized candidate
		if normalized_lower != candidate_lower:
		for pattern in [
		f"{normalized_lower}%", # normalized is prefix of cached
		f"%{normalized_lower}", # normalized is suffix of cached
		]:
		cursor = database.connection.execute(
		"SELECT meeting_id FROM meetings WHERE LOWER(short_name) LIKE ? LIMIT 1",
		(pattern,),
		)
		row = cursor.fetchone()
		if row:
		return row[0]

		# Try reverse patterns: cached is prefix/suffix of candidate
		cursor = database.connection.execute("SELECT meeting_id, short_name FROM meetings")
		for meeting_id, cached_name in cursor.fetchall():
		cached_lower = cached_name.lower()
		# Check if cached_name is prefix of candidate
		if candidate_lower.startswith(cached_lower):
		return meeting_id
		# Check if cached_name is suffix of candidate
		if candidate_lower.endswith(cached_lower):
		return meeting_id
		# Also check with normalized candidate
		if normalized_lower != candidate_lower:
		if normalized_lower.startswith(cached_lower):
		return meeting_id
		if normalized_lower.endswith(cached_lower):
		return meeting_id

		return None


		@@ -425,21 +481,23 @@ def _maybe_fetch_missing_tdocs(
		def crawl(
		cache_dir: Path = typer.Option(Path.home() / ".tdoc-crawler", "--cache-dir", "-c", help="Cache directory"),
		working_group: list[str] \| None = typer.Option(None, "--working-group", "-w", help="Working groups to crawl"),
		subgroup: list[str] \| None = typer.Option(None, "--sub-group", "-s", help="Filter by sub-working group"),
		incremental: bool = typer.Option(True, "--incremental/--full", help="Toggle incremental mode"),
		limit_tdocs: int \| None = typer.Option(None, "--limit-tdocs", help="Limit number of TDocs"),
		limit_meetings: int \| None = typer.Option(None, "--limit-meetings", help="Limit meetings considered"),
		limit_meetings_per_wg: int \| None = typer.Option(None, "--limit-meetings-per-wg", help="Limit meetings per working group"),
		limit_wgs: int \| None = typer.Option(None, "--limit-wgs", help="Limit number of working groups"),
		max_retries: int = typer.Option(3, "--max-retries", help="FTP retry attempts"),
		timeout: int = typer.Option(30, "--timeout", help="FTP timeout seconds"),
		max_retries: int = typer.Option(3, "--max-retries", help="HTTP connection retry attempts"),
		timeout: int = typer.Option(30, "--timeout", help="HTTP request timeout seconds"),
		verbose: bool = typer.Option(False, "--verbose", "-v", help="Enable verbose logging"),
		) -> None:
		working_groups = _parse_working_groups(working_group)
		subgroups = _parse_subgroups(subgroup)
		limits = _build_limits(limit_tdocs, limit_meetings, limit_meetings_per_wg, limit_wgs)
		config = TDocCrawlConfig(
		cache_dir=cache_dir,
		working_groups=working_groups,
		subgroups=None,
		subgroups=subgroups,
		meeting_ids=None,
		start_date=None,
		end_date=None,
		@@ -455,7 +513,14 @@ def crawl(
		)

		database_path = _database_path(config.cache_dir)
		console.print(f"[cyan]Crawling TDocs for {', '.join(wg.value for wg in working_groups)}[/cyan]")

		# Build descriptive message
		scope_parts = []
		if subgroups:
		scope_parts.append(f"subgroups: {', '.join(subgroups)}")
		else:
		scope_parts.append(f"working groups: {', '.join(wg.value for wg in working_groups)}")
		console.print(f"[cyan]Crawling TDocs ({', '.join(scope_parts)})[/cyan]")

		if config.verbose:
		logging.getLogger().setLevel(logging.DEBUG)

src/tdoc_crawler/crawlers/tdocs.py

+243 −208

File changed.

Preview size limit exceeded, changes collapsed.