docs(agents-md): applied mdformat on documentation (d1cf1d80) · Commits · Jan Reimes / 3gpp-crawler

docs/agents-md/300_Repo_Architecture_and_Key_Files.md

+55 −49

Original line number	Diff line number	Diff line
		@@ -7,13 +7,19 @@ This document orients coding agents in the repository and clarifies where the
		The Python package is under `src/tdoc_crawler/`:

		- [src/tdoc_crawler/cli/app.py](../../src/tdoc_crawler/cli/app.py): Typer app and command registration

		- [src/tdoc_crawler/cli/helpers.py](../../src/tdoc_crawler/cli/helpers.py): cache-dir/db-path resolution, credentials, WG/subgroup parsing

		- [src/tdoc_crawler/cli/fetching.py](../../src/tdoc_crawler/cli/fetching.py): targeted fetch orchestration

		- [src/tdoc_crawler/cli/printing.py](../../src/tdoc_crawler/cli/printing.py): output formats (table/json/yaml/csv)

		- [src/tdoc_crawler/crawlers/](../../src/tdoc_crawler/crawlers/): meeting crawler, TDoc crawler, portal session

		- [src/tdoc_crawler/models/](../../src/tdoc_crawler/models/): Pydantic models and config

		- [src/tdoc_crawler/database/](../../src/tdoc_crawler/database/): database facade and query helpers

		- [src/tdoc_crawler/http_client.py](../../src/tdoc_crawler/http_client.py): cached HTTP session factory

		Tests live in [tests/](../../tests/).

docs/agents-md/320_Database_Schema_and_Invariants.md

+105 −105

Original line number	Diff line number	Diff line
		@@ -13,10 +13,10 @@ This document describes the database contract that crawlers and CLI code must ma
		The database has five tables with foreign keys:

		1. `working_groups` (reference)
		2. `subworking_groups` (reference)
		3. `meetings`
		4. `tdocs`
		5. `crawl_log`
		1. `subworking_groups` (reference)
		1. `meetings`
		1. `tdocs`
		1. `crawl_log`

		## Reference tables

docs/agents-md/400_Crawling_and_Portal_Patterns.md

+69 −69

Original line number	Diff line number	Diff line
		@@ -43,9 +43,9 @@ The portal meeting label may differ from the meeting name stored in the database
		Resolution strategy should be multi-stage (in order):

		1. Exact match (case-insensitive)
		2. Normalized match (e.g., replace `#` with `-`, normalize `SA4` → `S4`)
		3. Prefix/suffix matching for common portal vs stored naming variants
		4. Edit-distance fallback (use carefully to avoid false positives)
		1. Normalized match (e.g., replace `#` with `-`, normalize `SA4` → `S4`)
		1. Prefix/suffix matching for common portal vs stored naming variants
		1. Edit-distance fallback (use carefully to avoid false positives)

		Avoid substring “contains” matching that can create false positives.

docs/history/2025-10-30_SUMMARY_01_HTTP_CACHING_FEATURE.md

+25 −22

Original line number	Diff line number	Diff line
		@@ -4,7 +4,7 @@
		Version: 0.6.0 (Proposed)
		Status: ✅ Complete and Tested

		---
		______________________________________________________________________

		## 🎯 Overview

		@@ -17,7 +17,7 @@ Implemented comprehensive HTTP caching functionality using the hishel library, p
		- Flexible configuration - Control cache behavior via CLI parameters or environment variables
		- Zero breaking changes - Fully backward compatible with existing workflows

		---
		______________________________________________________________________

		## 🚀 Features Implemented

		@@ -54,24 +54,27 @@ HTTP_CACHE_REFRESH_ON_ACCESS=true # Default: true
		- Default: `~/.tdoc-crawler/http-cache.sqlite3`
		- Customizable: Via `--cache-dir` parameter

		---
		______________________________________________________________________

		## 📦 New Components

		### Core Modules

		1. `src/tdoc_crawler/http_client.py` (New)

		- `create_cached_session()` factory function
		- Centralizes HTTP session creation with caching enabled
		- Built-in retry logic with exponential backoff
		- Uses hishel's `SyncSqliteStorage` backend

		2. `src/tdoc_crawler/models/base.py` (Modified)
		1. `src/tdoc_crawler/models/base.py` (Modified)

		- New `HttpCacheConfig` model
		- Default TTL: 7200 seconds (2 hours)
		- Default refresh on access: True

		3. `src/tdoc_crawler/cli/helpers.py` (Modified)
		1. `src/tdoc_crawler/cli/helpers.py` (Modified)

		- New `resolve_http_cache_config()` function
		- Configuration priority: CLI > Environment > Defaults

		@@ -84,7 +87,7 @@ Modified to use cached sessions:
		- `src/tdoc_crawler/crawlers/portal.py` - Portal authentication
		- `src/tdoc_crawler/cli/app.py` - CLI command integration

		---
		______________________________________________________________________

		## 🧪 Testing

		@@ -106,7 +109,7 @@ Added comprehensive test coverage in `tests/test_http_client.py`:
		- ✅ No regressions in existing functionality
		- ✅ All linting checks pass

		---
		______________________________________________________________________

		## 📊 Performance Improvements

		@@ -127,7 +130,7 @@ For a typical incremental crawl checking 100 meetings:
		- Bandwidth savings especially significant for large crawls
		- Reduced load on 3GPP servers

		---
		______________________________________________________________________

		## 🔧 Configuration

		@@ -136,8 +139,8 @@ For a typical incremental crawl checking 100 meetings:
		The system uses this priority order (highest to lowest):

		1. CLI Parameters - Explicit `--cache-ttl` and `--cache-refresh` options
		2. Environment Variables - `HTTP_CACHE_TTL` and `HTTP_CACHE_REFRESH_ON_ACCESS`
		3. Default Values - TTL=7200, refresh=True
		1. Environment Variables - `HTTP_CACHE_TTL` and `HTTP_CACHE_REFRESH_ON_ACCESS`
		1. Default Values - TTL=7200, refresh=True

		### Configuration Examples

		@@ -159,7 +162,7 @@ tdoc-crawler crawl-tdocs --cache-ttl 86400 --working-group SA
		tdoc-crawler crawl-tdocs --cache-ttl 2592000 --no-cache-refresh
		```

		---
		______________________________________________________________________

		## 📚 Documentation

		@@ -170,7 +173,7 @@ tdoc-crawler crawl-tdocs --cache-ttl 2592000 --no-cache-refresh
		- `README.md` - Updated with caching feature mention
		- `docs/QUICK_REFERENCE.md` - Integrated HTTP caching section

		---
		______________________________________________________________________

		## 🔄 Migration Guide

		@@ -202,7 +205,7 @@ HTTP_CACHE_REFRESH_ON_ACCESS=false
		tdoc-crawler crawl-tdocs --cache-ttl 3600 --no-cache-refresh
		```

		---
		______________________________________________________________________

		## 📝 Technical Details

		@@ -249,13 +252,13 @@ HTTP requests include automatic retry with exponential backoff:
		- Retry on: 429, 500, 502, 503, 504 status codes
		- Allowed methods: HEAD, GET, OPTIONS

		---
		______________________________________________________________________

		## 🐛 Bug Fixes

		None - This is a pure feature addition with no bug fixes.

		---
		______________________________________________________________________

		## ⚠️ Breaking Changes

		@@ -263,7 +266,7 @@ None - This is a pure feature addition with no bug fixes.

		All existing commands, parameters, and workflows continue to work exactly as before. The caching layer is transparent and requires no code changes.

		---
		______________________________________________________________________

		## 📊 Statistics

		@@ -288,7 +291,7 @@ All existing commands, parameters, and workflows continue to work exactly as bef
		\| Tests skipped \| 3 (integration tests) \|
		\| New test coverage \| 100% of new code \|

		---
		______________________________________________________________________

		## 🔮 Future Enhancements

		@@ -301,7 +304,7 @@ Potential improvements for future releases:
		- [ ] Distributed cache for multi-machine setups
		- [ ] Cache compression for space efficiency

		---
		______________________________________________________________________

		## 🙏 Acknowledgments

		@@ -309,7 +312,7 @@ Potential improvements for future releases:
		- SQLite - Reliable persistent storage
		- requests - Foundation HTTP library

		---
		______________________________________________________________________

		## 📖 Additional Resources

		@@ -317,12 +320,12 @@ Potential improvements for future releases:
		- [RFC 9111: HTTP Caching](https://www.rfc-editor.org/rfc/rfc9111.html)
		- [SQLite Documentation](https://www.sqlite.org/docs.html)

		---
		______________________________________________________________________

		## 📞 Support

		For questions or issues related to HTTP caching:

		1. Check the HTTP Caching section in QUICK_REFERENCE.md
		2. Review the FAQ section for common questions
		3. Open an issue on GitHub
		1. Review the FAQ section for common questions
		1. Open an issue on GitHub

docs/history/2025-11-11_SUMMARY_01_CRAWL_TDOC_LIST.md

+108 −114

Original line number	Diff line number	Diff line
		# 2025-11-11 SUMMARY 01 — TDoc Crawling via Document List

		Summary
		-------
		## Summary

		This document summarizes the design and implementation of the "TDoc crawling via document list" feature. The feature enables the `tdoc-crawler` CLI to discover, validate and persist TDocs by parsing the meeting document lists (HTTP directory pages) provided on the 3GPP FTP/HTTP site and by validating metadata via the 3GPP portal when available.

		Key Features
		------------
		## Key Features

		- Scans meeting `files_url` directories and detected subdirectories (e.g., `Docs/`, `Documents/`) for candidate TDoc files.
		- Uses a robust filename pattern (`TDOC_PATTERN`) to identify candidate TDoc files: case-insensitive, accepts `.zip`, `.txt`, `.pdf`.
		- Normalizes TDoc IDs to uppercase for case-insensitive storage and lookup.
		@@ -15,8 +15,8 @@ Key Features
		- Parallel crawling with configurable worker count (default: 4) to speed up harvesting across meetings.
		- Graceful handling of network errors and partial failures; logging includes processed/inserted counts and errors.

		CLI Integration
		---------------
		## CLI Integration

		Commands involved:

		- `crawl-tdocs` (alias `ct`) — main command to crawl TDocs from meeting directories.
		@@ -33,8 +33,7 @@ Commands involved:
		- `--cache-ttl`, `--cache-refresh/--no-cache-refresh` : HTTP caching control.
		- `--verbose, -v` : Verbose logging.

		Implementation Notes
		--------------------
		## Implementation Notes

		1. Directory scanning and subdirectory detection

		@@ -42,37 +41,35 @@ Implementation Notes
		- It detects TDoc-specific subdirectories using a case-insensitive set like `{Docs, Documents, TDocs, DOCS}`.
		- If subdirectories are found, each is scanned for candidate files; otherwise the base directory is scanned.

		2. File detection and normalization
		1. File detection and normalization

		- The `TDOC_PATTERN` regex is used to extract the filename stem (the TDoc ID) and extension.
		- Candidate filenames are normalized using `normalize_tdoc_id()` → uppercase and trimmed.
		- Excluded directory names such as `Inbox`, `Draft`, `Agenda` are ignored.

		3. Validation against 3GPP portal
		1. Validation against 3GPP portal

		- When portal credentials are available (CLI/env/prompt), the crawler opens an authenticated `PortalSession` and fetches the TDoc portal page to extract metadata fields (title, meeting, contact, source, tdoc_type, for, agenda_item, status, is_revision_of, etc.).
		- Portal parsing is defensive: missing optional fields are tolerated, required fields are validated before marking a TDoc as validated.
		- Negative results (invalid IDs or parsing failures) are cached in the DB as `validation_failed` to avoid repeated checks.

		4. Incremental / Revalidate modes
		1. Incremental / Revalidate modes

		- Incremental mode skips TDocs already present and validated in the database.
		- `--force-revalidate` / running in full mode will re-fetch portal metadata for existing TDocs and update DB records.

		5. Parallelism
		1. Parallelism

		- Uses a worker pool (configurable size) and processes meetings/TDoc files in parallel while keeping DB upserts serialized in the DB layer.
		- The crawler accepts an optional progress callback to report accurate progress for rich terminal UIs.

		Database Changes
		----------------
		## Database Changes

		- TDocs table (`tdocs`) stores `tdoc_id` (case-insensitive primary key), `meeting_id` (FK into `meetings`), `url`, `validated` (bool), `validation_failed` (bool), `title`, `contact`, `source`, `tdoc_type`, `for`, `agenda_item_nbr`, `agenda_item_title`, `is_revision_of`, and timestamps (`created_at`, `updated_at`).
		- Meetings must be present before TDocs are inserted — the crawler enforces the foreign key constraint by querying `meetings` table first.
		- A `crawl_log` record is created per run capturing counts of processed meetings, discovered TDocs, validated, invalid, and errors.

		Testing
		-------
		## Testing

		- Unit tests mock `requests.Session` and `PortalSession` to exercise directory parsing, subdirectory detection, and portal metadata parsing.
		- Integration tests use sample HTML directory listings under `tests/data` and ensure the crawler extracts expected TDoc IDs.
		@@ -81,16 +78,14 @@ Testing
		- `tests/test_targeted_fetch.py` — tests portal metadata retrieval and negative caching behavior.
		- `tests/test_database.py` — verifies case-insensitive TDoc lookup and FK enforcement.

		QA Notes and Known Limitations
		------------------------------
		## QA Notes and Known Limitations

		- The crawler relies on HTML directory listings. If a meeting's `files_url` redirects to a non-HTML storage backend, detection may fail — a fallback to direct FTP or alternative listing could be added later.
		- Portal authentication uses JavaScript/AJAX endpoints; changes on the portal may break the scraper — tests mock portal responses but keep an eye on portal-side changes.
		- Filename conventions are broad but intentionally conservative; some valid but rare TDoc filenames may require pattern updates.
		- Large harvests can be IO-bound; increase `--workers` and tune `--timeout`/`--max-retries` for better throughput on high-latency networks.

		Deployment / Usage
		------------------
		## Deployment / Usage

		Typical crawling invocation (example):

		@@ -104,8 +99,7 @@ To force revalidation of known TDocs:
		tdoc-crawler crawl-tdocs --incremental --force-revalidate
		```

		History / Related Design Docs
		----------------------------
		## History / Related Design Docs

		- Design notes and architecture rationale located in `docs/design_meeting_doclist_architecture.md`.
		- Feature specification and user-facing behavior in `docs/MEETING_DOCUMENT_LIST_FEATURE.md`.

docs/agents-md/600_Engineering_Standards.md

+46 −46

File changed.

Contains only whitespace changes.

docs/agents-md/700_Testing_and_Mocking_Patterns.md

+55 −55

File changed.

Contains only whitespace changes.

docs/agents-md/200_Project_Domain_3GPP_TDocs_Meetings.md

+94 −94

File changed.

Contains only whitespace changes.