✨ feat(crawler): enhance progress tracking and subgroup alias normalization (acff0dbd) · Commits · Jan Reimes / 3gpp-crawler

docs/history/2025-10-23_SUMMARY_01_PROGRESS_BAR_AND_CLEAR_FLAGS.md

+80 −22

Original line number	Diff line number	Diff line
		@@ -147,37 +147,59 @@ Crawling meetings (working groups: RAN, SA, CT)

		## 🔧 Technical Implementation

		### Progress Callback Pattern
		### Progress Callback Pattern (Updated)

		Design Decision: Decoupled progress tracking from crawler logic
		Evolution: Originally used `Callable[[], None]`, now uses `Callable[[float, float], None]`

		```python
		# In TDocCrawler._scan_directory_for_tdocs()
		def _scan_directory_for_tdocs(self, ..., progress_callback: Callable[[], None] \| None = None):
		for link in soup.find_all("a"):
		# ... TDoc extraction logic ...
		collected.append(metadata)
		seen_ids.add(tdoc_id)

		# Call progress callback after collecting each TDoc
		if progress_callback:
		progress_callback()
		Updated Implementation:

		# In MeetingCrawler.crawl()
		def crawl(self, config: MeetingCrawlConfig, progress_callback: Callable[[], None] \| None = None):
		for meeting in parsed_meetings:
		meetings.append(meeting)
		# Call progress callback after collecting each meeting
		```python
		# In database/connection.py
		def bulk_upsert_meetings(
		self,
		meetings: Iterable[MeetingMetadata],
		progress_callback: Callable[[float, float], None] \| None = None,
		) -> tuple[int, int]:
		meetings_list = list(meetings)
		total = float(len(meetings_list))

		for index, meeting in enumerate(meetings_list, start=1):
		created, changed = self.upsert_meeting(meeting)
		# ... counting logic ...
		if progress_callback:
		progress_callback()
		progress_callback(float(index), total)
		return inserted, updated

		# In cli/app.py
		with Progress(
		SpinnerColumn(),
		TextColumn("[progress.description]{task.description}"),
		BarColumn(),
		MofNCompleteColumn(),
		console=console,
		) as progress:
		task = progress.add_task("[cyan]Crawling...", total=100)

		def update_progress(completed: float, total: float) -> None:
		progress.update(task, completed=completed, total=total)

		result = crawler.crawl(config, progress_callback=update_progress)
		```

		Key Improvements:

		1. Accurate Progress Bar: Callback now receives `(completed, total)` instead of just being invoked
		2. Visual Feedback: Shows actual progress bar with "N/M" format (e.g., "365/365")
		3. Database-Level Tracking: Progress tracked during database insertion, not collection
		4. Consistent Pattern: Both `crawl-meetings` and `crawl-tdocs` use identical approach

		Benefits:

		- Crawler remains UI-agnostic (no Rich dependency in crawler layer)
		- Callback can be omitted for non-interactive use cases
		- Easy to test (mock callback, verify call count)
		- Progress updates reflect actual work done (TDocs/meetings discovered, not just meetings processed)
		- Progress updates reflect actual work done (database operations completed)
		- Users see deterministic progress bar instead of indeterminate spinner

		### Raw SQL Execution with pydantic_sqlite

		@@ -302,11 +324,47 @@ No migration required. Existing workflows continue unchanged. New features are o

		None discovered during implementation or testing.

		## 🚀 Future Enhancements
		## � Additional Improvements (Same Session)

		### Progress Bar Enhancement: Database-Level Tracking

		Issue: Initial implementation tracked progress during collection phase, but total count was unknown at callback time, resulting in indeterminate progress bars.

		Solution: Moved progress tracking from collection to database insertion phase:

		1. Database Methods Updated:
		- `bulk_upsert_meetings()` and `bulk_upsert_tdocs()` now accept `Callable[[float, float], None]`
		- Convert iterable to list to determine total count before processing
		- Call callback with `(completed, total)` after each database operation

		2. Crawler Methods Updated:
		- `MeetingCrawler.crawl()` signature changed to use new callback type
		- `TDocCrawler.crawl()` signature changed to use new callback type
		- Removed progress callbacks from internal methods (`_crawl_meeting`, `_scan_directory_for_tdocs`)

		3. CLI Progress Bars:
		- Added `BarColumn` and `MofNCompleteColumn` for visual progress
		- Shows "N/M" format (e.g., "365/365 meetings", "1745/1745 TDocs")
		- Callback updates task with exact completed/total values

		Result: Users now see accurate progress bars showing exactly how many items have been processed and how many remain.

		Example Output:

		```text
		Crawling meetings (working groups: RAN, SA, CT)
		⠋ Crawling... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 365/365 meetings

		Crawling TDocs from 5 meetings
		⠋ Crawling... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1745/1745 TDocs
		```

		## � Future Enhancements

		Potential Improvements:

		1. Add `--clear-tdocs-by-wg` for selective clearing per working group
		2. Add progress bar to `crawl-meetings` command (similar pattern)
		2. ~~Add progress bar to `crawl-meetings` command (similar pattern)~~ ✅ Implemented
		3. Add confirmation prompt for destructive clear operations
		4. Add `--dry-run` flag to preview what would be cleared

docs/history/2025-10-23_SUMMARY_02_FIX_SUBGROUP_INFERENCE.md

+99 −0

Original line number	Diff line number	Diff line
		@@ -183,10 +183,109 @@ All 63 tests pass with no changes required:
		- `crawl_meetings()`: Reordered parsing to call `parse_subgroups()` before `parse_working_groups()`
		- `crawl_tdocs()`: Same reordering for consistency

		## 🔄 Additional Improvements (Same Session)

		### Subgroup Alias Normalization Support

		Issue: Users reported that full subgroup names (SA4, RAN1, CT3) were not recognized, only short codes (S4, R1, C3) worked.

		Example:
		```bash
		# This worked
		uv run tdoc-crawler crawl-meetings -s S4 --limit-meetings 5

		# This did NOT work (0 meetings processed)
		uv run tdoc-crawler crawl-meetings -s SA4 --limit-meetings 5
		```

		Root Cause: The `normalize_subgroup_alias()` function in `src/tdoc_crawler/crawlers/meetings.py` only checked for exact matches with short codes stored in `MEETING_CODE_REGISTRY`. The registry stores codes like "S4", "R1", "C3" (not "SA4", "RAN1", "CT3").

		Solution: Enhanced `normalize_subgroup_alias()` to normalize long format to short format before matching:

		```python
		def normalize_subgroup_alias(alias: str) -> list[str]:
		"""Normalize subgroup aliases to canonical names.
		Supports: S4, SA4, R1, RAN1, C3, CT3, RP, SP, CP, etc.
		"""
		alias_upper = alias.strip().upper()
		matches: list[str] = []

		# Normalize long form to short form
		normalized_alias = alias_upper
		if alias_upper.startswith("SA") and len(alias_upper) > 2:
		normalized_alias = "S" + alias_upper[2:] # SA4 → S4
		elif alias_upper.startswith("RAN") and len(alias_upper) > 3:
		normalized_alias = "R" + alias_upper[3:] # RAN1 → R1
		elif alias_upper.startswith("CT") and len(alias_upper) > 2:
		normalized_alias = "C" + alias_upper[2:] # CT1 → C1

		# Check all registries for matches
		for _working_group, codes in MEETING_CODE_REGISTRY.items():
		for code, subgroup in codes:
		if code.upper() == normalized_alias or code.upper() == alias_upper:
		if subgroup:
		matches.append(subgroup)
		elif subgroup and subgroup.upper() == alias_upper:
		matches.append(subgroup)

		if not matches:
		matches.append(alias_upper)
		return matches
		```

		Normalization Rules:

		- SA4 → S4, SA1 → S1, SA2 → S2, etc.
		- RAN1 → R1, RAN2 → R2, RAN3 → R3, etc.
		- CT1 → C1, CT2 → C2, CT3 → C3, etc.
		- Preserves short forms: S4 → S4, R1 → R1, C3 → C3
		- Handles plenary codes: SP, RP, CP (unchanged)

		Behavior After Fix:

		\| Command \| Result \|
		\|---------\|--------\|
		\| `-s SA4` \| ✅ Works (normalizes to S4) \|
		\| `-s RAN1` \| ✅ Works (normalizes to R1) \|
		\| `-s CT3` \| ✅ Works (normalizes to C3) \|
		\| `-s S4` \| ✅ Still works (already canonical) \|
		\| `-s R2` \| ✅ Still works \|
		\| `-s SP` \| ✅ Works (plenary code) \|

		Query Behavior:

		When querying meetings with normalized aliases, the database stores and returns canonical short codes:

		```bash
		# Input: SA4, Output: meetings with subgroup "S4"
		uv run tdoc-crawler query-meetings -s SA4 --include-without-files

		# Output shows:
		│ S4-141 │ SA │ S4 │ 2027-11-15 - 2027-11-19 │
		│ S4-140-bis │ SA │ S4 │ 2027-10-11 - 2027-10-15 │
		```

		Testing:

		- Manual testing confirmed all formats work correctly
		- All 63 tests passing
		- Updated `test_query_meetings_with_subgroup_filter` to expect normalized value `["S4"]` instead of input value `["SA4"]`

		Files Modified:

		3. `src/tdoc_crawler/crawlers/meetings.py`:
		- Enhanced `normalize_subgroup_alias()` with long-to-short format transformation
		- Added normalization logic for SA→S, RAN→R, CT→C prefixes

		4. `tests/test_cli.py`:
		- Updated `test_query_meetings_with_subgroup_filter` assertion to expect normalized subgroup code

		## Migration Notes

		No migration needed - this is a bug fix that makes the behavior more intuitive. Users who were working around the bug by using `-w` explicitly can continue to do so.

		For subgroup aliases: Both short and long formats now work identically. Internal database storage remains in canonical short format (S4, R1, C3) for consistency.

		---

		Status: ✅ Implemented and tested

src/tdoc_crawler/cli/app.py

+6 −9

Original line number	Diff line number	Diff line
		@@ -109,25 +109,22 @@ def crawl_tdocs(
		crawl_id = database.log_crawl_start("tdoc", config.working_groups, config.incremental)

		# Create progress bar for TDoc crawling
		# If limit_tdocs is specified, use it as total; otherwise use indeterminate progress
		total_tdocs = config.limits.limit_tdocs if config.limits.limit_tdocs and config.limits.limit_tdocs > 0 else None

		with Progress(
		SpinnerColumn(),
		TextColumn("[progress.description]{task.description}"),
		BarColumn(),
		TextColumn("[progress.percentage]{task.completed} TDocs") if total_tdocs is None else MofNCompleteColumn(),
		MofNCompleteColumn(),
		console=console,
		) as progress:
		# Add progress task
		# Add progress task (total will be set by callback during DB insertion)
		task = progress.add_task(
		"[cyan]Crawling TDocs...",
		total=total_tdocs,
		total=100, # Initial placeholder, will be updated by callback
		)

		# Define progress callback
		def update_progress() -> None:
		progress.update(task, advance=1)
		# Define progress callback that receives completed and total
		def update_progress(completed: float, total: float) -> None:
		progress.update(task, completed=completed, total=total)

		# Run crawl with progress callback
		result = crawler.crawl(config, progress_callback=update_progress)

src/tdoc_crawler/crawlers/meetings.py

+25 −3

Original line number	Diff line number	Diff line
		@@ -77,21 +77,43 @@ def normalize_subgroup_alias(alias: str) -> list[str]:
		"""Normalize subgroup aliases to canonical names.

		Returns a list of possible matching subgroup names.
		Supports: R1→RAN1, S4→SA4, C3→CT3, RP→RAN Plenary, etc.
		Supports:
		- Short codes: S4, R1, C3
		- Full names: SA4, RAN1, CT3
		- Plenary codes: RP, SP, CP
		- Full plenary names: RAN PLENARY, SA PLENARY, CT PLENARY
		"""
		alias_upper = alias.strip().upper()
		matches: list[str] = []

		# Try to normalize long form to short form (SA4 → S4, RAN1 → R1, CT3 → C3)
		normalized_alias = alias_upper
		if alias_upper.startswith("SA") and len(alias_upper) > 2:
		# SA4 → S4, SA1 → S1, etc.
		normalized_alias = "S" + alias_upper[2:]
		elif alias_upper.startswith("RAN") and len(alias_upper) > 3:
		# RAN1 → R1, RAN2 → R2, etc.
		normalized_alias = "R" + alias_upper[3:]
		elif alias_upper.startswith("CT") and len(alias_upper) > 2:
		# CT1 → C1, CT2 → C2, etc.
		normalized_alias = "C" + alias_upper[2:]

		# Check all registries for matches
		for _working_group, codes in MEETING_CODE_REGISTRY.items():
		for code, subgroup in codes:
		if code.upper() == alias_upper:
		# Check if normalized alias matches the code
		if code.upper() == normalized_alias:
		if subgroup:
		matches.append(subgroup)
		# Also check direct match with the original alias
		elif code.upper() == alias_upper:
		if subgroup:
		matches.append(subgroup)
		# Check if subgroup name matches
		elif subgroup and subgroup.upper() == alias_upper:
		matches.append(subgroup)

		# If no matches found, return the alias as-is (might be exact name)
		# If no matches found, return the original alias (might be exact name)
		if not matches:
		matches.append(alias_upper)

src/tdoc_crawler/crawlers/tdocs.py

+8 −12

Original line number	Diff line number	Diff line
		@@ -47,14 +47,15 @@ class TDocCrawler:
		def __init__(self, database: TDocDatabase) -> None:
		self.database = database

		def crawl(self, config: TDocCrawlConfig, progress_callback: Callable[[], None] \| None = None) -> TDocCrawlResult:
		def crawl(self, config: TDocCrawlConfig, progress_callback: Callable[[float, float], None] \| None = None) -> TDocCrawlResult:
		"""Execute a crawl using the provided configuration.

		Queries meetings from the database and crawls their HTTP directories for TDocs.

		Args:
		config: Crawl configuration
		progress_callback: Optional callback function called after each TDoc is discovered
		progress_callback: Optional callback function called after each TDoc is processed.
		Takes (completed, total) as float parameters.
		"""
		errors: list[str] = []
		collected: list[TDocMetadata] = []
		@@ -87,7 +88,6 @@ class TDocCrawler:
		seen_ids,
		existing_ids,
		targets,
		progress_callback,
		)
		if targets is not None and not targets:
		break
		@@ -104,7 +104,8 @@ class TDocCrawler:
		inserted = 0
		updated = 0
		if collected:
		inserted, updated = self.database.bulk_upsert_tdocs(collected)
		# Pass progress callback to bulk_upsert_tdocs to update after each DB operation
		inserted, updated = self.database.bulk_upsert_tdocs(collected, progress_callback=progress_callback)

		return TDocCrawlResult(
		processed=len(collected),
		@@ -180,7 +181,6 @@ class TDocCrawler:
		seen_ids: set[str],
		existing_ids: set[str],
		targets: set[str] \| None,
		progress_callback: Callable[[], None] \| None = None,
		) -> None:
		"""Crawl a specific meeting's HTTP directory for TDocs.

		@@ -232,10 +232,10 @@ class TDocCrawler:
		# Crawl subdirectories if found, otherwise crawl base directory
		if subdirs_found:
		for subdir_url in subdirs_found:
		self._scan_directory_for_tdocs(session, subdir_url, meeting, config, collected, seen_ids, existing_ids, targets, progress_callback)
		self._scan_directory_for_tdocs(session, subdir_url, meeting, config, collected, seen_ids, existing_ids, targets)
		else:
		# No subdirectories found, scan base directory directly
		self._scan_directory_for_tdocs(session, base_url, meeting, config, collected, seen_ids, existing_ids, targets, progress_callback)
		self._scan_directory_for_tdocs(session, base_url, meeting, config, collected, seen_ids, existing_ids, targets)

		def _scan_directory_for_tdocs(
		self,
		@@ -247,7 +247,6 @@ class TDocCrawler:
		seen_ids: set[str],
		existing_ids: set[str],
		targets: set[str] \| None,
		progress_callback: Callable[[], None] \| None = None,
		) -> None:
		"""Scan a specific directory URL for TDoc files."""
		if not directory_url.endswith("/"):
		@@ -323,6 +322,7 @@ class TDocCrawler:
		tdoc_id=tdoc_id,
		url=file_url,
		meeting_id=meeting.meeting_id,
		meeting_name=None, # Not resolved from HTTP directory listing
		file_size=file_size,
		title="Pending validation", # Will be updated after portal validation
		source="Unknown", # Will be updated after portal validation
		@@ -345,10 +345,6 @@ class TDocCrawler:
		if config.verbose:
		logger.debug("Collected TDoc %s from meeting %s", tdoc_id, meeting.short_name)

		# Call progress callback after collecting each TDoc
		if progress_callback:
		progress_callback()

		def _should_store_tdoc(
		self,
		tdoc_id: str,