Commit d1cf1d80 authored by Jan Reimes's avatar Jan Reimes
Browse files

docs(agents-md): applied mdformat on documentation

parent c48b579c
Loading
Loading
Loading
Loading
+55 −49
Original line number Diff line number Diff line
@@ -7,13 +7,19 @@ This document orients coding agents in the repository and clarifies where the
The Python package is under `src/tdoc_crawler/`:

- [src/tdoc_crawler/cli/app.py](../../src/tdoc_crawler/cli/app.py): Typer app and command registration

- [src/tdoc_crawler/cli/helpers.py](../../src/tdoc_crawler/cli/helpers.py): cache-dir/db-path resolution, credentials, WG/subgroup parsing

- [src/tdoc_crawler/cli/fetching.py](../../src/tdoc_crawler/cli/fetching.py): targeted fetch orchestration

- [src/tdoc_crawler/cli/printing.py](../../src/tdoc_crawler/cli/printing.py): output formats (table/json/yaml/csv)

- [src/tdoc_crawler/crawlers/](../../src/tdoc_crawler/crawlers/): meeting crawler, TDoc crawler, portal session

- [src/tdoc_crawler/models/](../../src/tdoc_crawler/models/): Pydantic models and config

- [src/tdoc_crawler/database/](../../src/tdoc_crawler/database/): database facade and query helpers

- [src/tdoc_crawler/http_client.py](../../src/tdoc_crawler/http_client.py): cached HTTP session factory

Tests live in [tests/](../../tests/).
+105 −105
Original line number Diff line number Diff line
@@ -13,10 +13,10 @@ This document describes the database contract that crawlers and CLI code must ma
The database has five tables with foreign keys:

1. `working_groups` (reference)
2. `subworking_groups` (reference)
3. `meetings`
4. `tdocs`
5. `crawl_log`
1. `subworking_groups` (reference)
1. `meetings`
1. `tdocs`
1. `crawl_log`

## Reference tables

+69 −69
Original line number Diff line number Diff line
@@ -43,9 +43,9 @@ The portal meeting label may differ from the meeting name stored in the database
Resolution strategy should be multi-stage (in order):

1. Exact match (case-insensitive)
2. Normalized match (e.g., replace `#` with `-`, normalize `SA4``S4`)
3. Prefix/suffix matching for common portal vs stored naming variants
4. Edit-distance fallback (use carefully to avoid false positives)
1. Normalized match (e.g., replace `#` with `-`, normalize `SA4``S4`)
1. Prefix/suffix matching for common portal vs stored naming variants
1. Edit-distance fallback (use carefully to avoid false positives)

Avoid substring “contains” matching that can create false positives.

+25 −22
Original line number Diff line number Diff line
@@ -4,7 +4,7 @@
**Version:** 0.6.0 (Proposed)
**Status:** ✅ Complete and Tested

---
______________________________________________________________________

## 🎯 Overview

@@ -17,7 +17,7 @@ Implemented comprehensive HTTP caching functionality using the hishel library, p
- **Flexible configuration** - Control cache behavior via CLI parameters or environment variables
- **Zero breaking changes** - Fully backward compatible with existing workflows

---
______________________________________________________________________

## 🚀 Features Implemented

@@ -54,24 +54,27 @@ HTTP_CACHE_REFRESH_ON_ACCESS=true # Default: true
- **Default:** `~/.tdoc-crawler/http-cache.sqlite3`
- **Customizable:** Via `--cache-dir` parameter

---
______________________________________________________________________

## 📦 New Components

### Core Modules

1. **`src/tdoc_crawler/http_client.py`** (New)

   - `create_cached_session()` factory function
   - Centralizes HTTP session creation with caching enabled
   - Built-in retry logic with exponential backoff
   - Uses hishel's `SyncSqliteStorage` backend

2. **`src/tdoc_crawler/models/base.py`** (Modified)
1. **`src/tdoc_crawler/models/base.py`** (Modified)

   - New `HttpCacheConfig` model
   - Default TTL: 7200 seconds (2 hours)
   - Default refresh on access: True

3. **`src/tdoc_crawler/cli/helpers.py`** (Modified)
1. **`src/tdoc_crawler/cli/helpers.py`** (Modified)

   - New `resolve_http_cache_config()` function
   - Configuration priority: CLI > Environment > Defaults

@@ -84,7 +87,7 @@ Modified to use cached sessions:
- `src/tdoc_crawler/crawlers/portal.py` - Portal authentication
- `src/tdoc_crawler/cli/app.py` - CLI command integration

---
______________________________________________________________________

## 🧪 Testing

@@ -106,7 +109,7 @@ Added comprehensive test coverage in `tests/test_http_client.py`:
-**No regressions** in existing functionality
-**All linting checks pass**

---
______________________________________________________________________

## 📊 Performance Improvements

@@ -127,7 +130,7 @@ For a typical incremental crawl checking 100 meetings:
- **Bandwidth savings** especially significant for large crawls
- **Reduced load** on 3GPP servers

---
______________________________________________________________________

## 🔧 Configuration

@@ -136,8 +139,8 @@ For a typical incremental crawl checking 100 meetings:
The system uses this priority order (highest to lowest):

1. **CLI Parameters** - Explicit `--cache-ttl` and `--cache-refresh` options
2. **Environment Variables** - `HTTP_CACHE_TTL` and `HTTP_CACHE_REFRESH_ON_ACCESS`
3. **Default Values** - TTL=7200, refresh=True
1. **Environment Variables** - `HTTP_CACHE_TTL` and `HTTP_CACHE_REFRESH_ON_ACCESS`
1. **Default Values** - TTL=7200, refresh=True

### Configuration Examples

@@ -159,7 +162,7 @@ tdoc-crawler crawl-tdocs --cache-ttl 86400 --working-group SA
tdoc-crawler crawl-tdocs --cache-ttl 2592000 --no-cache-refresh
```

---
______________________________________________________________________

## 📚 Documentation

@@ -170,7 +173,7 @@ tdoc-crawler crawl-tdocs --cache-ttl 2592000 --no-cache-refresh
- **`README.md`** - Updated with caching feature mention
- **`docs/QUICK_REFERENCE.md`** - Integrated HTTP caching section

---
______________________________________________________________________

## 🔄 Migration Guide

@@ -202,7 +205,7 @@ HTTP_CACHE_REFRESH_ON_ACCESS=false
tdoc-crawler crawl-tdocs --cache-ttl 3600 --no-cache-refresh
```

---
______________________________________________________________________

## 📝 Technical Details

@@ -249,13 +252,13 @@ HTTP requests include automatic retry with exponential backoff:
- **Retry on:** 429, 500, 502, 503, 504 status codes
- **Allowed methods:** HEAD, GET, OPTIONS

---
______________________________________________________________________

## 🐛 Bug Fixes

None - This is a pure feature addition with no bug fixes.

---
______________________________________________________________________

## ⚠️ Breaking Changes

@@ -263,7 +266,7 @@ None - This is a pure feature addition with no bug fixes.

All existing commands, parameters, and workflows continue to work exactly as before. The caching layer is transparent and requires no code changes.

---
______________________________________________________________________

## 📊 Statistics

@@ -288,7 +291,7 @@ All existing commands, parameters, and workflows continue to work exactly as bef
| Tests skipped | 3 (integration tests) |
| New test coverage | 100% of new code |

---
______________________________________________________________________

## 🔮 Future Enhancements

@@ -301,7 +304,7 @@ Potential improvements for future releases:
- [ ] Distributed cache for multi-machine setups
- [ ] Cache compression for space efficiency

---
______________________________________________________________________

## 🙏 Acknowledgments

@@ -309,7 +312,7 @@ Potential improvements for future releases:
- **SQLite** - Reliable persistent storage
- **requests** - Foundation HTTP library

---
______________________________________________________________________

## 📖 Additional Resources

@@ -317,12 +320,12 @@ Potential improvements for future releases:
- [RFC 9111: HTTP Caching](https://www.rfc-editor.org/rfc/rfc9111.html)
- [SQLite Documentation](https://www.sqlite.org/docs.html)

---
______________________________________________________________________

## 📞 Support

For questions or issues related to HTTP caching:

1. Check the HTTP Caching section in QUICK_REFERENCE.md
2. Review the FAQ section for common questions
3. Open an issue on GitHub
1. Review the FAQ section for common questions
1. Open an issue on GitHub
+108 −114
Original line number Diff line number Diff line
# 2025-11-11 SUMMARY 01 — TDoc Crawling via Document List

Summary
-------
## Summary

This document summarizes the design and implementation of the "TDoc crawling via document list" feature. The feature enables the `tdoc-crawler` CLI to discover, validate and persist TDocs by parsing the meeting document lists (HTTP directory pages) provided on the 3GPP FTP/HTTP site and by validating metadata via the 3GPP portal when available.

Key Features
------------
## Key Features

- Scans meeting `files_url` directories and detected subdirectories (e.g., `Docs/`, `Documents/`) for candidate TDoc files.
- Uses a robust filename pattern (`TDOC_PATTERN`) to identify candidate TDoc files: case-insensitive, accepts `.zip`, `.txt`, `.pdf`.
- Normalizes TDoc IDs to uppercase for case-insensitive storage and lookup.
@@ -15,8 +15,8 @@ Key Features
- Parallel crawling with configurable worker count (default: 4) to speed up harvesting across meetings.
- Graceful handling of network errors and partial failures; logging includes processed/inserted counts and errors.

CLI Integration
---------------
## CLI Integration

Commands involved:

- `crawl-tdocs` (alias `ct`) — main command to crawl TDocs from meeting directories.
@@ -33,8 +33,7 @@ Commands involved:
    - `--cache-ttl`, `--cache-refresh/--no-cache-refresh` : HTTP caching control.
    - `--verbose, -v` : Verbose logging.

Implementation Notes
--------------------
## Implementation Notes

1. Directory scanning and subdirectory detection

@@ -42,37 +41,35 @@ Implementation Notes
   - It detects TDoc-specific subdirectories using a case-insensitive set like `{Docs, Documents, TDocs, DOCS}`.
   - If subdirectories are found, each is scanned for candidate files; otherwise the base directory is scanned.

2. File detection and normalization
1. File detection and normalization

   - The `TDOC_PATTERN` regex is used to extract the filename stem (the TDoc ID) and extension.
   - Candidate filenames are normalized using `normalize_tdoc_id()` → uppercase and trimmed.
   - Excluded directory names such as `Inbox`, `Draft`, `Agenda` are ignored.

3. Validation against 3GPP portal
1. Validation against 3GPP portal

   - When portal credentials are available (CLI/env/prompt), the crawler opens an authenticated `PortalSession` and fetches the TDoc portal page to extract metadata fields (title, meeting, contact, source, tdoc_type, for, agenda_item, status, is_revision_of, etc.).
   - Portal parsing is defensive: missing optional fields are tolerated, required fields are validated before marking a TDoc as validated.
   - Negative results (invalid IDs or parsing failures) are cached in the DB as `validation_failed` to avoid repeated checks.

4. Incremental / Revalidate modes
1. Incremental / Revalidate modes

   - Incremental mode skips TDocs already present and validated in the database.
   - `--force-revalidate` / running in full mode will re-fetch portal metadata for existing TDocs and update DB records.

5. Parallelism
1. Parallelism

   - Uses a worker pool (configurable size) and processes meetings/TDoc files in parallel while keeping DB upserts serialized in the DB layer.
   - The crawler accepts an optional progress callback to report accurate progress for rich terminal UIs.

Database Changes
----------------
## Database Changes

- TDocs table (`tdocs`) stores `tdoc_id` (case-insensitive primary key), `meeting_id` (FK into `meetings`), `url`, `validated` (bool), `validation_failed` (bool), `title`, `contact`, `source`, `tdoc_type`, `for`, `agenda_item_nbr`, `agenda_item_title`, `is_revision_of`, and timestamps (`created_at`, `updated_at`).
- Meetings must be present before TDocs are inserted — the crawler enforces the foreign key constraint by querying `meetings` table first.
- A `crawl_log` record is created per run capturing counts of processed meetings, discovered TDocs, validated, invalid, and errors.

Testing
-------
## Testing

- Unit tests mock `requests.Session` and `PortalSession` to exercise directory parsing, subdirectory detection, and portal metadata parsing.
- Integration tests use sample HTML directory listings under `tests/data` and ensure the crawler extracts expected TDoc IDs.
@@ -81,16 +78,14 @@ Testing
  - `tests/test_targeted_fetch.py` — tests portal metadata retrieval and negative caching behavior.
  - `tests/test_database.py` — verifies case-insensitive TDoc lookup and FK enforcement.

QA Notes and Known Limitations
------------------------------
## QA Notes and Known Limitations

- The crawler relies on HTML directory listings. If a meeting's `files_url` redirects to a non-HTML storage backend, detection may fail — a fallback to direct FTP or alternative listing could be added later.
- Portal authentication uses JavaScript/AJAX endpoints; changes on the portal may break the scraper — tests mock portal responses but keep an eye on portal-side changes.
- Filename conventions are broad but intentionally conservative; some valid but rare TDoc filenames may require pattern updates.
- Large harvests can be IO-bound; increase `--workers` and tune `--timeout`/`--max-retries` for better throughput on high-latency networks.

Deployment / Usage
------------------
## Deployment / Usage

Typical crawling invocation (example):

@@ -104,8 +99,7 @@ To force revalidation of known TDocs:
tdoc-crawler crawl-tdocs --incremental --force-revalidate
```

History / Related Design Docs
----------------------------
## History / Related Design Docs

- Design notes and architecture rationale located in `docs/design_meeting_doclist_architecture.md`.
- Feature specification and user-facing behavior in `docs/MEETING_DOCUMENT_LIST_FEATURE.md`.
+94 −94

File changed.

Contains only whitespace changes.

+46 −46

File changed.

Contains only whitespace changes.

+55 −55

File changed.

Contains only whitespace changes.

Loading