applied markdown formatting/linting (9b1c03d4) · Commits · Jan Reimes / 3gpp-crawler

AGENTS.md

+14 −11

Original line number	Diff line number	Diff line
		# Assistant Rules for TDoc-Crawler - Command line tool for querying structured 3GPP TDoc data

		---
		description: General Guidelines
		alwaysApply: true
		---
		______________________________________________________________________

		## description: General Guidelines alwaysApply: true

		## Quick Start for Coding Assistants

		Before implementing features, review these critical sections:

		1. Project Structure - Understand the models/ and crawlers/ submodule organization
		2. CLI Commands - Review the command signatures in `src/tdoc_crawler/cli/app.py`
		3. Database Schema - Review models in `src/tdoc_crawler/models/` and database operations
		1. CLI Commands - Review the command signatures in `src/tdoc_crawler/cli/app.py`
		1. Database Schema - Review models in `src/tdoc_crawler/models/` and database operations

		Key Files to Examine First:

		@@ -30,6 +29,7 @@ skill 3gpp --query "working groups"
		```

		The skill provides authoritative information on:

		- Working groups (RAN, SA, CT) and their subgroups
		- TDoc naming conventions and patterns
		- Meeting identification and structure
		@@ -56,8 +56,11 @@ Therefore:
		- Avoid gratuitous enthusiasm or generalizations. Use thoughtful comparisons like saying which code is "cleaner" but don't congratulate yourself. Avoid subjective descriptions. For example, don't say "I've meticulously improved the code and it is in great shape!" That is useless generalization. Instead, specifically say what you've done, e.g., "I've added types, including generics, to all the methods in `Foo` and fixed all linter errors."

		- Use `git` for version control. Use `main` as the main branch name.

		- Use `git add ...` to add new files, but only rarely and only those that are very likely to be committed. Do not add files that are most likely to be deleted or changed significantly in the following steps. In doubt, do not add the file and ask/confirm with the user.

		- Never run `git commit` or `git push` on your own!

		- `.env` files MUST NOT be committed to version control.

		### Using Comments
		@@ -117,9 +120,9 @@ src/tdoc_crawler/
		### Module Design Principles

		1. Submodule Re-exports: Both `models/` and `crawlers/` use `__init__.py` to re-export all public symbols
		2. Single Responsibility: Each file focuses on one concern
		3. Type Safety: All modules use comprehensive type hints with `from __future__ import annotations`
		4. Import Pattern: Other modules import from `tdoc_crawler.models` and `tdoc_crawler.crawlers`, not from submodules directly
		1. Single Responsibility: Each file focuses on one concern
		1. Type Safety: All modules use comprehensive type hints with `from __future__ import annotations`
		1. Import Pattern: Other modules import from `tdoc_crawler.models` and `tdoc_crawler.crawlers`, not from submodules directly

		## Usage of uv and project management

		@@ -179,8 +182,8 @@ src/tdoc_crawler/
		The project maintains three levels of documentation:

		1. README.md - Project overview, installation, basic usage
		2. docs/QUICK_REFERENCE.md - Comprehensive command reference (MUST be kept current)
		3. docs/history/ - Chronological changelog of all significant changes
		1. docs/QUICK_REFERENCE.md - Comprehensive command reference (MUST be kept current)
		1. docs/history/ - Chronological changelog of all significant changes

		Critical Rules:

README.md

+6 −4

Original line number	Diff line number	Diff line
		@@ -207,6 +207,7 @@ uv run ty check
		The project follows a modular structure:

		1. `models/`: Pydantic models for data validation and configuration

		- `base.py`: Base configuration models, enums (OutputFormat, SortOrder)
		- `working_groups.py`: WorkingGroup enum with tbid/ftp_root properties
		- `subworking_groups.py`: SubworkingGroup reference data
		@@ -214,19 +215,20 @@ The project follows a modular structure:
		- `meetings.py`: Meeting metadata models and configurations
		- `crawl_limits.py`: Crawl throttling configuration

		2. `crawlers/`: Web scraping and FTP crawling logic
		1. `crawlers/`: Web scraping and FTP crawling logic

		- `tdocs.py`: TDocCrawler - FTP directory traversal, TDoc discovery
		- `meetings.py`: MeetingCrawler - HTML parsing from 3GPP portal
		- `portal.py`: Portal authentication and metadata extraction

		3. `database.py`: SQLite database interface with typed wrappers
		1. `database.py`: SQLite database interface with typed wrappers

		4. `cli.py`: Command-line interface using Typer and Rich
		1. `cli.py`: Command-line interface using Typer and Rich

		## License

		This project is licensed under the terms specified in the [LICENSE](LICENSE) file.

		---
		______________________________________________________________________

		Repository initiated with [fpgmaas/cookiecutter-uv](https://github.com/fpgmaas/cookiecutter-uv).

docs/QUICK_REFERENCE.md

+4 −4

Original line number	Diff line number	Diff line
		@@ -118,10 +118,10 @@ uv run tdoc-crawler open <TDoc ID> [--cache-dir PATH]
		Download (if needed), extract, and launch a TDoc with the system default application.

		1. Queries the database; if missing, runs the same targeted crawl workflow as `query`.
		2. Downloads the TDoc into `<cache-dir>/tdocs/`, only accepting `ftp://`, `http://`, or `https://` URLs.
		3. `.zip` payloads are extracted into `<cache-dir>/tdocs/<TDoc ID>/`; the first file encountered is opened.
		4. Non-archive payloads are saved once and reused.
		5. Launches via `os.startfile` on Windows, `open` on macOS, or `xdg-open` on Linux. Errors halt the command with a non-zero exit code.
		1. Downloads the TDoc into `<cache-dir>/tdocs/`, only accepting `ftp://`, `http://`, or `https://` URLs.
		1. `.zip` payloads are extracted into `<cache-dir>/tdocs/<TDoc ID>/`; the first file encountered is opened.
		1. Non-archive payloads are saved once and reused.
		1. Launches via `os.startfile` on Windows, `open` on macOS, or `xdg-open` on Linux. Errors halt the command with a non-zero exit code.

		### `query-meetings`

docs/REVIEW_AND_IMPROVEMENTS_AGENTS_MD.md

+62 −67

Original line number	Diff line number	Diff line
		# Review and Improvements for AGENTS.md# Review and Improvements for AGENTS.md



		Date: 2025-10-23Date: 2025-10-22 (Addendum)

		Summary: This document outlines proposed changes to `AGENTS.md` to align it with the current codebase. The recent implementations of progress bars, subgroup normalization, and CLI helper functions have introduced patterns that are not yet reflected in the instructions.Reviewer: AI Assistant
		@@ -10,14 +8,12 @@

		## 1. Update CLI Helper Functions and Structure

		---
		______________________________________________________________________

		### Current State in `AGENTS.md`

		The `AGENTS.md` file currently places helper function implementations (`resolve_cache_dir`, `get_credentials`, `_infer_working_groups_from_ids`) inside a single `cli.py` code block. This does not match the current, more modular project structure.## 1. Executive Summary



		### Current ImplementationThe previous review (earlier part of this file) highlighted missing documentation on HTTP crawling, subdirectory detection, fuzzy meeting name matching, and portal authentication. Those concerns have since been largely addressed inside `AGENTS.md` (the document now contains sections titled “HTTP Directory Crawling and File Detection” and subdirectory logic, plus portal module references). However, substantial new divergences have appeared after the recent refactor:

		The project has been refactored to use a dedicated `src/tdoc_crawler/cli/helpers.py` module, which contains these helper functions and more. A new helper, `infer_working_groups_from_subgroups`, was also introduced to improve CLI usability.
		@@ -28,30 +24,26 @@ High‑impact mismatches:

		1. Update Project Structure: Modify the `Project Structure` section to show that `cli/` is a submodule containing `app.py` and `helpers.py`.2. Project structure still assumes monolithic `cli.py` and `database.py` whereas code is now split (e.g. `src/tdoc_crawler/cli/app.py`, `cli/helpers.py`, `cli/fetching.py`, `cli/printing.py` and database submodules `database/schema.py`, `database/connection.py`, `database/tdocs.py`, `database/statistics.py`).

		2. Relocate Helper Functions: Move the `Helper Function Implementations` section to describe the contents of `cli/helpers.py`.3. The removal of redundant columns (`working_group`, `subgroup`, `meeting`) from the `tdocs` table (schema v2) is not explicitly documented as a completed normalization step nor its consequences (JOIN-based derivation from `meetings`).
		1. Relocate Helper Functions: Move the `Helper Function Implementations` section to describe the contents of `cli/helpers.py`.3. The removal of redundant columns (`working_group`, `subgroup`, `meeting`) from the `tdocs` table (schema v2) is not explicitly documented as a completed normalization step nor its consequences (JOIN-based derivation from `meetings`).

		3. Add New Helper Function: Document the `infer_working_groups_from_subgroups` function and its purpose.4. Testing guidance does not warn that foreign key integrity now requires inserting meetings before TDocs (was root cause for earlier failing tests).
		1. Add New Helper Function: Document the `infer_working_groups_from_subgroups` function and its purpose.4. Testing guidance does not warn that foreign key integrity now requires inserting meetings before TDocs (was root cause for earlier failing tests).

		4. Update `parse_working_groups` Logic: Explain that `parse_working_groups` in `helpers.py` should now accept an optional `subgroups` list to enable inference, and that CLI commands should parse subgroups before working groups.5. Naming consistency: The live schema uses `for_purpose` but `AGENTS.md` still refers to `for_value`; this is a potential source of regenerated incorrect code.
		1. Update `parse_working_groups` Logic: Explain that `parse_working_groups` in `helpers.py` should now accept an optional `subgroups` list to enable inference, and that CLI commands should parse subgroups before working groups.5. Naming consistency: The live schema uses `for_purpose` but `AGENTS.md` still refers to `for_value`; this is a potential source of regenerated incorrect code.

		6. Statistics and helper queries in current code derive working group counts via JOIN; `AGENTS.md` still suggests grouping directly on a removed column.
		1. Statistics and helper queries in current code derive working group counts via JOIN; `AGENTS.md` still suggests grouping directly on a removed column.

		## 2. Enhance Subgroup Alias Normalization Logic7. Crawl log structure changed (new fields: `crawl_type`, `start_time`, `end_time`, `incremental`, `items_added`, `items_updated`, `errors_count`, `status`), but the document lists an older minimal form (`timestamp`, `tdocs_discovered`, etc.).



		### Current State in `AGENTS.md`If left uncorrected, a coding assistant following the existing `AGENTS.md` would regenerate obsolete schema, reintroduce denormalized columns, misname fields, and write incompatible queries/tests.

		The `Working Group Alias Handling` section provides an inaccurate and overly simplistic implementation for `normalize_subgroup_alias`. It suggests the function just calls `normalize_working_group_alias`, which is incorrect.

		---
		______________________________________________________________________

		### Current Implementation

		The actual `normalize_subgroup_alias` function in `src/tdoc_crawler/crawlers/meetings.py` is more sophisticated. It correctly transforms long-form subgroup names to their canonical short-form equivalents (e.g., `SA4` → `S4`, `RAN1` → `R1`).## 2. Recommended Structural Updates to AGENTS.md



		### Proposed Changes to `AGENTS.md`\| Area \| Current Doc State \| Required Update \| Rationale \|

		- Replace `normalize_subgroup_alias`: Update the example in the `Working Group Alias Handling` section to reflect the current logic, which includes transforming prefixes (SA→S, RAN→R, CT→C) and returning a list of matching canonical names. This ensures the assistant generates the correct, more robust function.\|------\|-------------------\|-----------------\|-----------\|
		@@ -80,8 +72,6 @@ Progress is tracked at the database level, not the collection level.\| Statistics

		- The CLI uses Rich's `Progress` with `BarColumn` and `MofNCompleteColumn` to display a deterministic progress bar.---



		### Proposed Changes to `AGENTS.md`## 3. Detailed Change Proposals

		- Add a New Section: Create a new section under `Implementation Patterns` titled "Progress Bar Implementation".
		@@ -92,7 +82,7 @@ Progress is tracked at the database level, not the collection level.\| Statistics

		- Provide the `Callable[[float, float], None]` signature.

		- Show a conceptual example of the `bulk_upsert_*` method and the corresponding `Progress` block in the CLI.```text
		- Show a conceptual example of the `bulk_upsert_*` method and the corresponding `Progress` block in the CLI.\`\`\`text

		- Emphasize that this pattern provides a much better user experience than an indeterminate spinner.src/tdoc_crawler/

		@@ -100,13 +90,17 @@ Progress is tracked at the database level, not the collection level.\| Statistics

		## 4. Add New CLI Flags app.py # Typer application & command registration

		```
		fetching.py # Targeted fetch & portal orchestration
		```

		### Current State in `AGENTS.md` helpers.py # Path/credentials resolution, fuzzy resolution helpers

		The `CLI Commands Implementation` section is missing the `--clear-db` and `--clear-tdocs` flags that were recently added to the `crawl-meetings` and `crawl-tdocs` commands, respectively. printing.py # Output formatting (table/json/yaml/csv)

		```
		crawlers/
		```

		### Current Implementation tdocs.py # HTTP traversal, subdirectory detection, pattern filtering

		@@ -116,7 +110,9 @@ The `CLI Commands Implementation` section is missing the `--clear-db` and `--cle

		- The `TDocDatabase` class has corresponding `clear_all_data()` and `clear_tdocs()` methods. database/

		```
		schema.py # SCHEMA_VERSION, table DDL, reference population
		```

		### Proposed Changes to `AGENTS.md` connection.py # TDocDatabase facade/context manager

		@@ -130,7 +126,7 @@ The `CLI Commands Implementation` section is missing the `--clear-db` and `--cle

		By incorporating these changes, `AGENTS.md` will be more aligned with the current state of the project, enabling a coding assistant to reproduce the existing functionality more accurately. __main__.py

		```
		````

		Add an explicit note: “Monolithic `cli.py` and `database.py` referenced elsewhere are legacy; new contributions MUST use the modular structure above.”

		@@ -160,7 +156,7 @@ tdocs(
		validated INTEGER NOT NULL DEFAULT 0,
		validation_failed INTEGER NOT NULL DEFAULT 0
		)
		```
		````

		Add note: “Columns `working_group`, `subgroup`, `meeting` removed in v2 – derive via JOIN on `meetings`.”

		@@ -202,16 +198,15 @@ Add “Foreign Key Preparation” subsection:
		- Fixtures should centralize meeting creation (e.g., helper akin to `insert_sample_meetings`).
		- Negative tests: intentionally omit meeting to assert FK failure only when testing integrity.


		### 3.9 Migration Guidance (Schema v1→v2)

		Outline minimal steps:

		1. Backup DB.
		2. Create new schema (v2) in empty DB.
		3. Copy `meetings`, `working_groups`, `subworking_groups`.
		4. Transform old `tdocs` rows: drop redundant columns, map `for_value`→`for_purpose`, inject `date_updated` = `date_retrieved` if missing.
		5. Recompute statistics/crawl log if structure changed.
		1. Create new schema (v2) in empty DB.
		1. Copy `meetings`, `working_groups`, `subworking_groups`.
		1. Transform old `tdocs` rows: drop redundant columns, map `for_value`→`for_purpose`, inject `date_updated` = `date_retrieved` if missing.
		1. Recompute statistics/crawl log if structure changed.

		Include warning: direct in-place ALTER sequence is more complex due to column removals; prefer rebuild + migrate.

		@@ -223,7 +218,7 @@ Ensure AGENTS.md states: “Portal metadata fetch is optional for basic crawling

		Add brief editorial checklist for maintainers: verify schema version constant, confirm normalized queries, ensure no resurrected legacy column names, validate tests insert meetings first, confirm docs reflect current CLI module split.

		---
		______________________________________________________________________

		## 4. Redundant / Outdated Content to Remove or Condense

		@@ -235,8 +230,7 @@ Add brief editorial checklist for maintainers: verify schema version constant, c
		\| Grouping examples over `working_group` column \| Replace with JOIN-based examples \|
		\| Outdated crawl_log schema (timestamp + counts) \| Replace with expanded fields table \|


		---
		______________________________________________________________________

		## 5. Items Already Correct (No Change Needed)

		@@ -247,8 +241,7 @@ These portions of `AGENTS.md` accurately reflect current design:
		- Fuzzy meeting name matching principles (prefix/suffix strategy without substring matching).
		- Credential resolution order (CLI args → env vars → prompt).


		---
		______________________________________________________________________

		## 6. Risk Assessment if Not Updated

		@@ -260,41 +253,43 @@ These portions of `AGENTS.md` accurately reflect current design:
		\| Missing FK setup in tests \| High volume of failing tests, false negatives \|
		\| Misleading project structure \| Contributor confusion, bloated single-file rewrites \|


		---
		______________________________________________________________________

		## 7. Summary of Required Doc Changes (Action Checklist)

		1. Update project structure tree → modular layout.
		2. Replace TDocs table definition with v2 schema; add normalization rationale.
		3. Add JOIN pattern for deriving working group/subgroup.
		4. Update statistics aggregation section.
		5. Document new crawl_log fields.
		6. Standardize naming (`for_purpose`).
		7. Add FK test preparation guidance.
		8. Add migration guide v1→v2.
		9. Clarify portal failure handling (non-fatal, mark `validation_failed`).
		10. Insert consistency editorial checklist.


		---
		1. Replace TDocs table definition with v2 schema; add normalization rationale.
		1. Add JOIN pattern for deriving working group/subgroup.
		1. Update statistics aggregation section.
		1. Document new crawl_log fields.
		1. Standardize naming (`for_purpose`).
		1. Add FK test preparation guidance.
		1. Add migration guide v1→v2.
		1. Clarify portal failure handling (non-fatal, mark `validation_failed`).
		1. Insert consistency editorial checklist.

		______________________________________________________________________

		## 8. Final Recommendation

		`AGENTS.md` should be revised to reflect schema v2 and modular architecture before further feature work. The changes are corrective (alignment with implemented code) rather than new feature proposals. Proceed with updating the document; no blocking concerns identified.

		If implementers decide not to update immediately, at minimum pin a banner at top: “WARNING: Some sections are outdated for schema v2; consult `database/schema.py` and `cli/` submodules.”

		---
		______________________________________________________________________

		## 9. If No Immediate Edits Are Possible

		State plainly: regeneration attempts MUST read actual source modules for authoritative column lists; treat existing schema description in `AGENTS.md` as deprecated.

		---
		______________________________________________________________________

		## 10. Conclusion

		The current `AGENTS.md` is partially updated but still materially inconsistent with live code. Updating it per the checklist above will materially reduce regeneration errors and ensure assistants produce normalized, forward-compatible implementations.

		---
		______________________________________________________________________

		End of addendum.

		Note: Legacy content from earlier review iterations has been removed for clarity.

docs/history/2025-10-20_SUMMARY_02_git_repository_setup.md

+15 −5

Original line number	Diff line number	Diff line
		@@ -33,6 +33,7 @@ git commit -m "Initial commit: tdoc-crawler with comprehensive test suite and .e
		```

		Files Committed: 29 files, 6,198 insertions

		- All source code
		- Complete test suite (55 tests)
		- Documentation (README, QUICK_REFERENCE, history files)
		@@ -46,6 +47,7 @@ gh repo create tdoc-crawler --private --source=. --remote=origin \
		```

		Repository Details:

		- URL: https://github.com/Jan-Reimes_HEAD/tdoc-crawler
		- Visibility: Private (Enterprise Managed User account requirement)
		- Remote: origin
		@@ -54,10 +56,12 @@ gh repo create tdoc-crawler --private --source=. --remote=origin \
		## Repository Status

		### Current Branch

		- Branch: main
		- Commits: 1 (initial commit)

		### Remote Configuration

		```
		origin https://github.com/Jan-Reimes_HEAD/tdoc-crawler.git (fetch)
		origin https://github.com/Jan-Reimes_HEAD/tdoc-crawler.git (push)
		@@ -72,6 +76,7 @@ git push -u origin main
		```

		This will:

		- Push the main branch to the remote repository
		- Set up upstream tracking for the main branch
		- Make the code available on GitHub at https://github.com/Jan-Reimes_HEAD/tdoc-crawler
		@@ -118,34 +123,39 @@ gh pr create --title "Your Feature" --body "Description"

		1. Private Repository: The repository was created as private due to Enterprise Managed User restrictions.

		2. .gitignore: Already configured to exclude:
		1. .gitignore: Already configured to exclude:

		- `.env` files (as required by AGENTS.md)
		- Python cache files (`__pycache__`, `*.pyc`)
		- Virtual environments (`.venv`, `venv/`)
		- Build artifacts (`dist/`, `build/`)
		- IDE files (`.idea/`, `.vscode/`)

		3. Uncommitted Changes: There is currently one modified file (`.gitignore`) that needs to be committed separately if changes were made.
		1. Uncommitted Changes: There is currently one modified file (`.gitignore`) that needs to be committed separately if changes were made.

		## GitHub Repository Features to Enable

		Consider enabling these on GitHub:

		1. Branch Protection Rules for `main`:

		- Require pull request reviews
		- Require status checks to pass
		- Require branches to be up to date

		2. GitHub Actions:
		1. GitHub Actions:

		- CI/CD workflow for running tests
		- Code coverage reporting
		- Automated releases

		3. Issue Templates:
		1. Issue Templates:

		- Bug report template
		- Feature request template

		4. Pull Request Template:
		1. Pull Request Template:

		- Standardized PR description format

		## Compliance with AGENTS.md