📚 docs(history): add summary for TDoc crawling via document list feature (98fdd5b0) · Commits · Jan Reimes / 3gpp-crawler

docs/history/2025-11-11_SUMMARY_01_CRAWL_TDOC_LIST.md

0 → 100644

+114 −0

Original line number	Diff line number	Diff line
		# 2025-11-11 SUMMARY 01 — TDoc Crawling via Document List

		Summary
		-------
		This document summarizes the design and implementation of the "TDoc crawling via document list" feature. The feature enables the `tdoc-crawler` CLI to discover, validate and persist TDocs by parsing the meeting document lists (HTTP directory pages) provided on the 3GPP FTP/HTTP site and by validating metadata via the 3GPP portal when available.

		Key Features
		------------
		- Scans meeting `files_url` directories and detected subdirectories (e.g., `Docs/`, `Documents/`) for candidate TDoc files.
		- Uses a robust filename pattern (`TDOC_PATTERN`) to identify candidate TDoc files: case-insensitive, accepts `.zip`, `.txt`, `.pdf`.
		- Normalizes TDoc IDs to uppercase for case-insensitive storage and lookup.
		- Validates candidate TDocs by fetching portal metadata from `https://portal.3gpp.org/ngppapp/CreateTdoc.Aspx?mode=view&contributionUid=<tdoc_id>` when credentials are available.
		- Caches negative validation results to avoid repeated portal requests for invalid IDs.
		- Supports incremental crawling (skip already-known TDocs) and full re-validation via CLI flags.
		- Parallel crawling with configurable worker count (default: 4) to speed up harvesting across meetings.
		- Graceful handling of network errors and partial failures; logging includes processed/inserted counts and errors.

		CLI Integration
		---------------
		Commands involved:

		- `crawl-tdocs` (alias `ct`) — main command to crawl TDocs from meeting directories.
		- Options implemented/used:
		- `--cache-dir, -c` : Directory for HTTP cache and DB (default: `~/.tdoc-crawler`).
		- `--working-group, -w` : Filter by working group(s).
		- `--sub-group, -s` : Filter by subgroup(s).
		- `--incremental/--full` : Incremental mode (default: incremental).
		- `--clear-tdocs` : Clear existing TDoc records before crawling.
		- `--limit-tdocs` : Limit number of TDocs to crawl.
		- `--limit-meetings` / `--limit-meetings-per-wg` / `--limit-wgs` : Limits to scope the crawl.
		- `--workers` : Number of parallel workers (default: 4).
		- `--max-retries`, `--timeout` : HTTP resilience settings.
		- `--cache-ttl`, `--cache-refresh/--no-cache-refresh` : HTTP caching control.
		- `--verbose, -v` : Verbose logging.

		Implementation Notes
		--------------------

		1. Directory scanning and subdirectory detection

		- The crawler first fetches the meeting `files_url` and inspects links via BeautifulSoup.
		- It detects TDoc-specific subdirectories using a case-insensitive set like `{Docs, Documents, TDocs, DOCS}`.
		- If subdirectories are found, each is scanned for candidate files; otherwise the base directory is scanned.

		2. File detection and normalization

		- The `TDOC_PATTERN` regex is used to extract the filename stem (the TDoc ID) and extension.
		- Candidate filenames are normalized using `normalize_tdoc_id()` → uppercase and trimmed.
		- Excluded directory names such as `Inbox`, `Draft`, `Agenda` are ignored.

		3. Validation against 3GPP portal

		- When portal credentials are available (CLI/env/prompt), the crawler opens an authenticated `PortalSession` and fetches the TDoc portal page to extract metadata fields (title, meeting, contact, source, tdoc_type, for, agenda_item, status, is_revision_of, etc.).
		- Portal parsing is defensive: missing optional fields are tolerated, required fields are validated before marking a TDoc as validated.
		- Negative results (invalid IDs or parsing failures) are cached in the DB as `validation_failed` to avoid repeated checks.

		4. Incremental / Revalidate modes

		- Incremental mode skips TDocs already present and validated in the database.
		- `--force-revalidate` / running in full mode will re-fetch portal metadata for existing TDocs and update DB records.

		5. Parallelism

		- Uses a worker pool (configurable size) and processes meetings/TDoc files in parallel while keeping DB upserts serialized in the DB layer.
		- The crawler accepts an optional progress callback to report accurate progress for rich terminal UIs.

		Database Changes
		----------------

		- TDocs table (`tdocs`) stores `tdoc_id` (case-insensitive primary key), `meeting_id` (FK into `meetings`), `url`, `validated` (bool), `validation_failed` (bool), `title`, `contact`, `source`, `tdoc_type`, `for`, `agenda_item_nbr`, `agenda_item_title`, `is_revision_of`, and timestamps (`created_at`, `updated_at`).
		- Meetings must be present before TDocs are inserted — the crawler enforces the foreign key constraint by querying `meetings` table first.
		- A `crawl_log` record is created per run capturing counts of processed meetings, discovered TDocs, validated, invalid, and errors.

		Testing
		-------

		- Unit tests mock `requests.Session` and `PortalSession` to exercise directory parsing, subdirectory detection, and portal metadata parsing.
		- Integration tests use sample HTML directory listings under `tests/data` and ensure the crawler extracts expected TDoc IDs.
		- Key tests added/updated:
		- `tests/test_crawler.py` — verifies scanning base directories and subdirectory detection.
		- `tests/test_targeted_fetch.py` — tests portal metadata retrieval and negative caching behavior.
		- `tests/test_database.py` — verifies case-insensitive TDoc lookup and FK enforcement.

		QA Notes and Known Limitations
		------------------------------

		- The crawler relies on HTML directory listings. If a meeting's `files_url` redirects to a non-HTML storage backend, detection may fail — a fallback to direct FTP or alternative listing could be added later.
		- Portal authentication uses JavaScript/AJAX endpoints; changes on the portal may break the scraper — tests mock portal responses but keep an eye on portal-side changes.
		- Filename conventions are broad but intentionally conservative; some valid but rare TDoc filenames may require pattern updates.
		- Large harvests can be IO-bound; increase `--workers` and tune `--timeout`/`--max-retries` for better throughput on high-latency networks.

		Deployment / Usage
		------------------

		Typical crawling invocation (example):

		```bash
		tdoc-crawler crawl-tdocs --cache-dir ~/.tdoc-crawler --workers 6 --limit-wgs 2 --limit-meetings-per-wg 5
		```

		To force revalidation of known TDocs:

		```bash
		tdoc-crawler crawl-tdocs --incremental --force-revalidate
		```

		History / Related Design Docs
		----------------------------

		- Design notes and architecture rationale located in `docs/design_meeting_doclist_architecture.md`.
		- Feature specification and user-facing behavior in `docs/MEETING_DOCUMENT_LIST_FEATURE.md`.

		Author: tdoc-crawler team
		Date: 2025-11-11