- Working groups (RAN, SA, CT) and their subgroups
- TDoc naming conventions and patterns
- Meeting identification and structure
@@ -56,8 +56,11 @@ Therefore:
- Avoid gratuitous enthusiasm or generalizations. Use thoughtful comparisons like saying which code is "cleaner" but don't congratulate yourself. Avoid subjective descriptions. For example, don't say "I've meticulously improved the code and it is in great shape!" That is useless generalization. Instead, specifically say what you've done, e.g., "I've added types, including generics, to all the methods in `Foo` and fixed all linter errors."
- Use `git` for version control. Use `main` as the main branch name.
- Use `git add ...` to add new files, **but only rarely** and **only those that are very likely** to be committed. **Do not add files** that are most likely to be deleted or changed significantly in the following steps. In doubt, do not add the file and ask/confirm with the user.
-**Never** run `git commit` or `git push` on your own!
-`.env` files **MUST NOT** be committed to version control.
### Using Comments
@@ -117,9 +120,9 @@ src/tdoc_crawler/
### Module Design Principles
1.**Submodule Re-exports**: Both `models/` and `crawlers/` use `__init__.py` to re-export all public symbols
2.**Single Responsibility**: Each file focuses on one concern
3.**Type Safety**: All modules use comprehensive type hints with `from __future__ import annotations`
4.**Import Pattern**: Other modules import from `tdoc_crawler.models` and `tdoc_crawler.crawlers`, not from submodules directly
1.**Single Responsibility**: Each file focuses on one concern
1.**Type Safety**: All modules use comprehensive type hints with `from __future__ import annotations`
1.**Import Pattern**: Other modules import from `tdoc_crawler.models` and `tdoc_crawler.crawlers`, not from submodules directly
## Usage of uv and project management
@@ -179,8 +182,8 @@ src/tdoc_crawler/
The project maintains three levels of documentation:
**Summary:** This document outlines proposed changes to `AGENTS.md` to align it with the current codebase. The recent implementations of progress bars, subgroup normalization, and CLI helper functions have introduced patterns that are not yet reflected in the instructions.**Reviewer**: AI Assistant
The `AGENTS.md` file currently places helper function implementations (`resolve_cache_dir`, `get_credentials`, `_infer_working_groups_from_ids`) inside a single `cli.py` code block. This does not match the current, more modular project structure.## 1. Executive Summary
### Current ImplementationThe previous review (earlier part of this file) highlighted missing documentation on HTTP crawling, subdirectory detection, fuzzy meeting name matching, and portal authentication. Those concerns have since been largely addressed inside `AGENTS.md` (the document now contains sections titled “HTTP Directory Crawling and File Detection” and subdirectory logic, plus portal module references). However, substantial new divergences have appeared after the recent refactor:
The project has been refactored to use a dedicated `src/tdoc_crawler/cli/helpers.py` module, which contains these helper functions and more. A new helper, `infer_working_groups_from_subgroups`, was also introduced to improve CLI usability.
@@ -28,30 +24,26 @@ High‑impact mismatches:
1.**Update Project Structure:** Modify the `Project Structure` section to show that `cli/` is a submodule containing `app.py` and `helpers.py`.2. Project structure still assumes monolithic `cli.py` and `database.py` whereas code is now split (e.g. `src/tdoc_crawler/cli/app.py`, `cli/helpers.py`, `cli/fetching.py`, `cli/printing.py` and database submodules `database/schema.py`, `database/connection.py`, `database/tdocs.py`, `database/statistics.py`).
2.**Relocate Helper Functions:** Move the `Helper Function Implementations` section to describe the contents of `cli/helpers.py`.3. The removal of redundant columns (`working_group`, `subgroup`, `meeting`) from the `tdocs` table (schema v2) is not explicitly documented as a completed normalization step nor its consequences (JOIN-based derivation from `meetings`).
1.**Relocate Helper Functions:** Move the `Helper Function Implementations` section to describe the contents of `cli/helpers.py`.3. The removal of redundant columns (`working_group`, `subgroup`, `meeting`) from the `tdocs` table (schema v2) is not explicitly documented as a completed normalization step nor its consequences (JOIN-based derivation from `meetings`).
3.**Add New Helper Function:** Document the `infer_working_groups_from_subgroups` function and its purpose.4. Testing guidance does not warn that foreign key integrity now requires inserting meetings before TDocs (was root cause for earlier failing tests).
1.**Add New Helper Function:** Document the `infer_working_groups_from_subgroups` function and its purpose.4. Testing guidance does not warn that foreign key integrity now requires inserting meetings before TDocs (was root cause for earlier failing tests).
4.**Update `parse_working_groups` Logic:** Explain that `parse_working_groups` in `helpers.py` should now accept an optional `subgroups` list to enable inference, and that CLI commands should parse subgroups *before* working groups.5. Naming consistency: The live schema uses `for_purpose` but `AGENTS.md` still refers to `for_value`; this is a potential source of regenerated incorrect code.
1.**Update `parse_working_groups` Logic:** Explain that `parse_working_groups` in `helpers.py` should now accept an optional `subgroups` list to enable inference, and that CLI commands should parse subgroups *before* working groups.5. Naming consistency: The live schema uses `for_purpose` but `AGENTS.md` still refers to `for_value`; this is a potential source of regenerated incorrect code.
6. Statistics and helper queries in current code derive working group counts via JOIN; `AGENTS.md` still suggests grouping directly on a removed column.
1. Statistics and helper queries in current code derive working group counts via JOIN; `AGENTS.md` still suggests grouping directly on a removed column.
## 2. Enhance Subgroup Alias Normalization Logic7. Crawl log structure changed (new fields: `crawl_type`, `start_time`, `end_time`, `incremental`, `items_added`, `items_updated`, `errors_count`, `status`), but the document lists an older minimal form (`timestamp`, `tdocs_discovered`, etc.).
### Current State in `AGENTS.md`If left uncorrected, a coding assistant following the existing `AGENTS.md` would regenerate obsolete schema, reintroduce denormalized columns, misname fields, and write incompatible queries/tests.
The `Working Group Alias Handling` section provides an inaccurate and overly simplistic implementation for `normalize_subgroup_alias`. It suggests the function just calls `normalize_working_group_alias`, which is incorrect.
The actual `normalize_subgroup_alias` function in `src/tdoc_crawler/crawlers/meetings.py` is more sophisticated. It correctly transforms long-form subgroup names to their canonical short-form equivalents (e.g., `SA4` → `S4`, `RAN1` → `R1`).## 2. Recommended Structural Updates to AGENTS.md
### Proposed Changes to `AGENTS.md`| Area | Current Doc State | Required Update | Rationale |
-**Replace `normalize_subgroup_alias`:** Update the example in the `Working Group Alias Handling` section to reflect the current logic, which includes transforming prefixes (SA→S, RAN→R, CT→C) and returning a list of matching canonical names. This ensures the assistant generates the correct, more robust function.|------|-------------------|-----------------|-----------|
@@ -80,8 +72,6 @@ Progress is tracked at the database level, not the collection level.| Statistics
- The CLI uses Rich's `Progress` with `BarColumn` and `MofNCompleteColumn` to display a deterministic progress bar.---
### Proposed Changes to `AGENTS.md`## 3. Detailed Change Proposals
-**Add a New Section:** Create a new section under `Implementation Patterns` titled "Progress Bar Implementation".
@@ -92,7 +82,7 @@ Progress is tracked at the database level, not the collection level.| Statistics
- Provide the `Callable[[float, float], None]` signature.
- Show a conceptual example of the `bulk_upsert_*` method and the corresponding `Progress` block in the CLI.```text
- Show a conceptual example of the `bulk_upsert_*` method and the corresponding `Progress` block in the CLI.\`\`\`text
- Emphasize that this pattern provides a much better user experience than an indeterminate spinner.src/tdoc_crawler/
@@ -100,13 +90,17 @@ Progress is tracked at the database level, not the collection level.| Statistics
### Current State in `AGENTS.md` helpers.py # Path/credentials resolution, fuzzy resolution helpers
The `CLI Commands Implementation` section is missing the `--clear-db` and `--clear-tdocs` flags that were recently added to the `crawl-meetings` and `crawl-tdocs` commands, respectively. printing.py # Output formatting (table/json/yaml/csv)
@@ -116,7 +110,9 @@ The `CLI Commands Implementation` section is missing the `--clear-db` and `--cle
- The `TDocDatabase` class has corresponding `clear_all_data()` and `clear_tdocs()` methods. database/
```
schema.py # SCHEMA_VERSION, table DDL, reference population
```
### Proposed Changes to `AGENTS.md` connection.py # TDocDatabase facade/context manager
@@ -130,7 +126,7 @@ The `CLI Commands Implementation` section is missing the `--clear-db` and `--cle
By incorporating these changes, `AGENTS.md` will be more aligned with the current state of the project, enabling a coding assistant to reproduce the existing functionality more accurately. __main__.py
```
````
Add an explicit note: “Monolithic `cli.py` and `database.py` referenced elsewhere are legacy; new contributions MUST use the modular structure above.”
@@ -160,7 +156,7 @@ tdocs(
validated INTEGER NOT NULL DEFAULT 0,
validation_failed INTEGER NOT NULL DEFAULT 0
)
```
````
Add note: “Columns `working_group`, `subgroup`, `meeting` removed in v2 – derive via JOIN on `meetings`.”
`AGENTS.md` should be revised to reflect schema v2 and modular architecture before further feature work. The changes are corrective (alignment with implemented code) rather than new feature proposals. Proceed with updating the document; no blocking concerns identified.
If implementers decide not to update immediately, at minimum pin a banner at top: “WARNING: Some sections are outdated for schema v2; consult `database/schema.py` and `cli/` submodules.”
State plainly: regeneration attempts MUST read actual source modules for authoritative column lists; treat existing schema description in `AGENTS.md` as deprecated.
The current `AGENTS.md` is partially updated but still materially inconsistent with live code. Updating it per the checklist above will materially reduce regeneration errors and ensure assistants produce normalized, forward-compatible implementations.