Commit acff0dbd authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(crawler): enhance progress tracking and subgroup alias normalization

- Updated progress callback to provide accurate completion metrics
- Improved progress bar visibility during TDoc and meeting crawls
- Normalized subgroup aliases to support both short and long formats
- Ensured consistent behavior across crawling commands
parent 8b3bd0cd
Loading
Loading
Loading
Loading
+80 −22
Original line number Diff line number Diff line
@@ -147,37 +147,59 @@ Crawling meetings (working groups: RAN, SA, CT)

## 🔧 Technical Implementation

### Progress Callback Pattern
### Progress Callback Pattern (Updated)

**Design Decision:** Decoupled progress tracking from crawler logic
**Evolution:** Originally used `Callable[[], None]`, now uses `Callable[[float, float], None]`

```python
# In TDocCrawler._scan_directory_for_tdocs()
def _scan_directory_for_tdocs(self, ..., progress_callback: Callable[[], None] | None = None):
    for link in soup.find_all("a"):
        # ... TDoc extraction logic ...
        collected.append(metadata)
        seen_ids.add(tdoc_id)

        # Call progress callback after collecting each TDoc
        if progress_callback:
            progress_callback()
**Updated Implementation:**

# In MeetingCrawler.crawl()
def crawl(self, config: MeetingCrawlConfig, progress_callback: Callable[[], None] | None = None):
    for meeting in parsed_meetings:
        meetings.append(meeting)
        # Call progress callback after collecting each meeting
```python
# In database/connection.py
def bulk_upsert_meetings(
    self,
    meetings: Iterable[MeetingMetadata],
    progress_callback: Callable[[float, float], None] | None = None,
) -> tuple[int, int]:
    meetings_list = list(meetings)
    total = float(len(meetings_list))

    for index, meeting in enumerate(meetings_list, start=1):
        created, changed = self.upsert_meeting(meeting)
        # ... counting logic ...
        if progress_callback:
            progress_callback()
            progress_callback(float(index), total)
    return inserted, updated

# In cli/app.py
with Progress(
    SpinnerColumn(),
    TextColumn("[progress.description]{task.description}"),
    BarColumn(),
    MofNCompleteColumn(),
    console=console,
) as progress:
    task = progress.add_task("[cyan]Crawling...", total=100)

    def update_progress(completed: float, total: float) -> None:
        progress.update(task, completed=completed, total=total)

    result = crawler.crawl(config, progress_callback=update_progress)
```

**Key Improvements:**

1. **Accurate Progress Bar**: Callback now receives `(completed, total)` instead of just being invoked
2. **Visual Feedback**: Shows actual progress bar with "N/M" format (e.g., "365/365")
3. **Database-Level Tracking**: Progress tracked during database insertion, not collection
4. **Consistent Pattern**: Both `crawl-meetings` and `crawl-tdocs` use identical approach

**Benefits:**

- Crawler remains UI-agnostic (no Rich dependency in crawler layer)
- Callback can be omitted for non-interactive use cases
- Easy to test (mock callback, verify call count)
- Progress updates reflect actual work done (TDocs/meetings discovered, not just meetings processed)
- Progress updates reflect actual work done (database operations completed)
- Users see deterministic progress bar instead of indeterminate spinner

### Raw SQL Execution with pydantic_sqlite

@@ -302,11 +324,47 @@ No migration required. Existing workflows continue unchanged. New features are o

None discovered during implementation or testing.

## 🚀 Future Enhancements
## � Additional Improvements (Same Session)

### Progress Bar Enhancement: Database-Level Tracking

**Issue:** Initial implementation tracked progress during collection phase, but total count was unknown at callback time, resulting in indeterminate progress bars.

**Solution:** Moved progress tracking from collection to database insertion phase:

1. **Database Methods Updated**:
   - `bulk_upsert_meetings()` and `bulk_upsert_tdocs()` now accept `Callable[[float, float], None]`
   - Convert iterable to list to determine total count before processing
   - Call callback with `(completed, total)` after each database operation

2. **Crawler Methods Updated**:
   - `MeetingCrawler.crawl()` signature changed to use new callback type
   - `TDocCrawler.crawl()` signature changed to use new callback type
   - Removed progress callbacks from internal methods (`_crawl_meeting`, `_scan_directory_for_tdocs`)

3. **CLI Progress Bars**:
   - Added `BarColumn` and `MofNCompleteColumn` for visual progress
   - Shows "N/M" format (e.g., "365/365 meetings", "1745/1745 TDocs")
   - Callback updates task with exact completed/total values

**Result:** Users now see accurate progress bars showing exactly how many items have been processed and how many remain.

**Example Output:**

```text
Crawling meetings (working groups: RAN, SA, CT)
⠋ Crawling... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 365/365 meetings

Crawling TDocs from 5 meetings
⠋ Crawling... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1745/1745 TDocs
```

## � Future Enhancements

**Potential Improvements:**

1. Add `--clear-tdocs-by-wg` for selective clearing per working group
2. Add progress bar to `crawl-meetings` command (similar pattern)
2. ~~Add progress bar to `crawl-meetings` command (similar pattern)~~ ✅ Implemented
3. Add confirmation prompt for destructive clear operations
4. Add `--dry-run` flag to preview what would be cleared

+99 −0
Original line number Diff line number Diff line
@@ -183,10 +183,109 @@ All 63 tests pass with no changes required:
   - `crawl_meetings()`: Reordered parsing to call `parse_subgroups()` before `parse_working_groups()`
   - `crawl_tdocs()`: Same reordering for consistency

## 🔄 Additional Improvements (Same Session)

### Subgroup Alias Normalization Support

**Issue:** Users reported that full subgroup names (SA4, RAN1, CT3) were not recognized, only short codes (S4, R1, C3) worked.

**Example:**
```bash
# This worked
uv run tdoc-crawler crawl-meetings -s S4 --limit-meetings 5

# This did NOT work (0 meetings processed)
uv run tdoc-crawler crawl-meetings -s SA4 --limit-meetings 5
```

**Root Cause:** The `normalize_subgroup_alias()` function in `src/tdoc_crawler/crawlers/meetings.py` only checked for exact matches with short codes stored in `MEETING_CODE_REGISTRY`. The registry stores codes like "S4", "R1", "C3" (not "SA4", "RAN1", "CT3").

**Solution:** Enhanced `normalize_subgroup_alias()` to normalize long format to short format before matching:

```python
def normalize_subgroup_alias(alias: str) -> list[str]:
    """Normalize subgroup aliases to canonical names.
    Supports: S4, SA4, R1, RAN1, C3, CT3, RP, SP, CP, etc.
    """
    alias_upper = alias.strip().upper()
    matches: list[str] = []

    # Normalize long form to short form
    normalized_alias = alias_upper
    if alias_upper.startswith("SA") and len(alias_upper) > 2:
        normalized_alias = "S" + alias_upper[2:]  # SA4 → S4
    elif alias_upper.startswith("RAN") and len(alias_upper) > 3:
        normalized_alias = "R" + alias_upper[3:]  # RAN1 → R1
    elif alias_upper.startswith("CT") and len(alias_upper) > 2:
        normalized_alias = "C" + alias_upper[2:]  # CT1 → C1

    # Check all registries for matches
    for _working_group, codes in MEETING_CODE_REGISTRY.items():
        for code, subgroup in codes:
            if code.upper() == normalized_alias or code.upper() == alias_upper:
                if subgroup:
                    matches.append(subgroup)
            elif subgroup and subgroup.upper() == alias_upper:
                matches.append(subgroup)

    if not matches:
        matches.append(alias_upper)
    return matches
```

**Normalization Rules:**

- SA4 → S4, SA1 → S1, SA2 → S2, etc.
- RAN1 → R1, RAN2 → R2, RAN3 → R3, etc.
- CT1 → C1, CT2 → C2, CT3 → C3, etc.
- Preserves short forms: S4 → S4, R1 → R1, C3 → C3
- Handles plenary codes: SP, RP, CP (unchanged)

**Behavior After Fix:**

| Command | Result |
|---------|--------|
| `-s SA4` | ✅ Works (normalizes to S4) |
| `-s RAN1` | ✅ Works (normalizes to R1) |
| `-s CT3` | ✅ Works (normalizes to C3) |
| `-s S4` | ✅ Still works (already canonical) |
| `-s R2` | ✅ Still works |
| `-s SP` | ✅ Works (plenary code) |

**Query Behavior:**

When querying meetings with normalized aliases, the database stores and returns canonical short codes:

```bash
# Input: SA4, Output: meetings with subgroup "S4"
uv run tdoc-crawler query-meetings -s SA4 --include-without-files

# Output shows:
│ S4-141      │ SA │ S4 │ 2027-11-15 - 2027-11-19 │
│ S4-140-bis  │ SA │ S4 │ 2027-10-11 - 2027-10-15 │
```

**Testing:**

- Manual testing confirmed all formats work correctly
- All 63 tests passing
- Updated `test_query_meetings_with_subgroup_filter` to expect normalized value `["S4"]` instead of input value `["SA4"]`

**Files Modified:**

3. **`src/tdoc_crawler/crawlers/meetings.py`**:
   - Enhanced `normalize_subgroup_alias()` with long-to-short format transformation
   - Added normalization logic for SA→S, RAN→R, CT→C prefixes

4. **`tests/test_cli.py`**:
   - Updated `test_query_meetings_with_subgroup_filter` assertion to expect normalized subgroup code

## Migration Notes

No migration needed - this is a bug fix that makes the behavior more intuitive. Users who were working around the bug by using `-w` explicitly can continue to do so.

**For subgroup aliases:** Both short and long formats now work identically. Internal database storage remains in canonical short format (S4, R1, C3) for consistency.

---

**Status:** ✅ Implemented and tested
+6 −9
Original line number Diff line number Diff line
@@ -109,25 +109,22 @@ def crawl_tdocs(
        crawl_id = database.log_crawl_start("tdoc", config.working_groups, config.incremental)

        # Create progress bar for TDoc crawling
        # If limit_tdocs is specified, use it as total; otherwise use indeterminate progress
        total_tdocs = config.limits.limit_tdocs if config.limits.limit_tdocs and config.limits.limit_tdocs > 0 else None

        with Progress(
            SpinnerColumn(),
            TextColumn("[progress.description]{task.description}"),
            BarColumn(),
            TextColumn("[progress.percentage]{task.completed} TDocs") if total_tdocs is None else MofNCompleteColumn(),
            MofNCompleteColumn(),
            console=console,
        ) as progress:
            # Add progress task
            # Add progress task (total will be set by callback during DB insertion)
            task = progress.add_task(
                "[cyan]Crawling TDocs...",
                total=total_tdocs,
                total=100,  # Initial placeholder, will be updated by callback
            )

            # Define progress callback
            def update_progress() -> None:
                progress.update(task, advance=1)
            # Define progress callback that receives completed and total
            def update_progress(completed: float, total: float) -> None:
                progress.update(task, completed=completed, total=total)

            # Run crawl with progress callback
            result = crawler.crawl(config, progress_callback=update_progress)
+25 −3
Original line number Diff line number Diff line
@@ -77,21 +77,43 @@ def normalize_subgroup_alias(alias: str) -> list[str]:
    """Normalize subgroup aliases to canonical names.

    Returns a list of possible matching subgroup names.
    Supports: R1→RAN1, S4→SA4, C3→CT3, RP→RAN Plenary, etc.
    Supports:
    - Short codes: S4, R1, C3
    - Full names: SA4, RAN1, CT3
    - Plenary codes: RP, SP, CP
    - Full plenary names: RAN PLENARY, SA PLENARY, CT PLENARY
    """
    alias_upper = alias.strip().upper()
    matches: list[str] = []

    # Try to normalize long form to short form (SA4 → S4, RAN1 → R1, CT3 → C3)
    normalized_alias = alias_upper
    if alias_upper.startswith("SA") and len(alias_upper) > 2:
        # SA4 → S4, SA1 → S1, etc.
        normalized_alias = "S" + alias_upper[2:]
    elif alias_upper.startswith("RAN") and len(alias_upper) > 3:
        # RAN1 → R1, RAN2 → R2, etc.
        normalized_alias = "R" + alias_upper[3:]
    elif alias_upper.startswith("CT") and len(alias_upper) > 2:
        # CT1 → C1, CT2 → C2, etc.
        normalized_alias = "C" + alias_upper[2:]

    # Check all registries for matches
    for _working_group, codes in MEETING_CODE_REGISTRY.items():
        for code, subgroup in codes:
            if code.upper() == alias_upper:
            # Check if normalized alias matches the code
            if code.upper() == normalized_alias:
                if subgroup:
                    matches.append(subgroup)
            # Also check direct match with the original alias
            elif code.upper() == alias_upper:
                if subgroup:
                    matches.append(subgroup)
            # Check if subgroup name matches
            elif subgroup and subgroup.upper() == alias_upper:
                matches.append(subgroup)

    # If no matches found, return the alias as-is (might be exact name)
    # If no matches found, return the original alias (might be exact name)
    if not matches:
        matches.append(alias_upper)

+8 −12
Original line number Diff line number Diff line
@@ -47,14 +47,15 @@ class TDocCrawler:
    def __init__(self, database: TDocDatabase) -> None:
        self.database = database

    def crawl(self, config: TDocCrawlConfig, progress_callback: Callable[[], None] | None = None) -> TDocCrawlResult:
    def crawl(self, config: TDocCrawlConfig, progress_callback: Callable[[float, float], None] | None = None) -> TDocCrawlResult:
        """Execute a crawl using the provided configuration.

        Queries meetings from the database and crawls their HTTP directories for TDocs.

        Args:
            config: Crawl configuration
            progress_callback: Optional callback function called after each TDoc is discovered
            progress_callback: Optional callback function called after each TDoc is processed.
                Takes (completed, total) as float parameters.
        """
        errors: list[str] = []
        collected: list[TDocMetadata] = []
@@ -87,7 +88,6 @@ class TDocCrawler:
                        seen_ids,
                        existing_ids,
                        targets,
                        progress_callback,
                    )
                    if targets is not None and not targets:
                        break
@@ -104,7 +104,8 @@ class TDocCrawler:
        inserted = 0
        updated = 0
        if collected:
            inserted, updated = self.database.bulk_upsert_tdocs(collected)
            # Pass progress callback to bulk_upsert_tdocs to update after each DB operation
            inserted, updated = self.database.bulk_upsert_tdocs(collected, progress_callback=progress_callback)

        return TDocCrawlResult(
            processed=len(collected),
@@ -180,7 +181,6 @@ class TDocCrawler:
        seen_ids: set[str],
        existing_ids: set[str],
        targets: set[str] | None,
        progress_callback: Callable[[], None] | None = None,
    ) -> None:
        """Crawl a specific meeting's HTTP directory for TDocs.

@@ -232,10 +232,10 @@ class TDocCrawler:
        # Crawl subdirectories if found, otherwise crawl base directory
        if subdirs_found:
            for subdir_url in subdirs_found:
                self._scan_directory_for_tdocs(session, subdir_url, meeting, config, collected, seen_ids, existing_ids, targets, progress_callback)
                self._scan_directory_for_tdocs(session, subdir_url, meeting, config, collected, seen_ids, existing_ids, targets)
        else:
            # No subdirectories found, scan base directory directly
            self._scan_directory_for_tdocs(session, base_url, meeting, config, collected, seen_ids, existing_ids, targets, progress_callback)
            self._scan_directory_for_tdocs(session, base_url, meeting, config, collected, seen_ids, existing_ids, targets)

    def _scan_directory_for_tdocs(
        self,
@@ -247,7 +247,6 @@ class TDocCrawler:
        seen_ids: set[str],
        existing_ids: set[str],
        targets: set[str] | None,
        progress_callback: Callable[[], None] | None = None,
    ) -> None:
        """Scan a specific directory URL for TDoc files."""
        if not directory_url.endswith("/"):
@@ -323,6 +322,7 @@ class TDocCrawler:
                tdoc_id=tdoc_id,
                url=file_url,
                meeting_id=meeting.meeting_id,
                meeting_name=None,  # Not resolved from HTTP directory listing
                file_size=file_size,
                title="Pending validation",  # Will be updated after portal validation
                source="Unknown",  # Will be updated after portal validation
@@ -345,10 +345,6 @@ class TDocCrawler:
            if config.verbose:
                logger.debug("Collected TDoc %s from meeting %s", tdoc_id, meeting.short_name)

            # Call progress callback after collecting each TDoc
            if progress_callback:
                progress_callback()

    def _should_store_tdoc(
        self,
        tdoc_id: str,
Loading