docs(AGENTS): create AGENTS.md for tdoc_crawler package guidelines (5137f0e0) · Commits · Jan Reimes / 3gpp-crawler

AGENTS.md

+85 −163

Original line number	Diff line number	Diff line
		@@ -2,202 +2,142 @@

		Command line tool for querying structured 3GPP TDoc data.

		## Quick Start
		---

		Before implementing features, review:
		## Development Commands

		1. Project Structure - Domain-oriented architecture (`tdocs/`, `meetings/`, `specs/`)
		2. CLI Commands - Command signatures in `src/tdoc_crawler/cli/app.py`
		3. Database Schema - Models in `src/tdoc_crawler/models/` and database operations
		> All Python commands use `uv run` to activate the virtual environment.

		## Core Architecture Rules

		### Import Patterns

		Correct imports:

		```python
		# TDoc operations
		from tdoc_crawler.tdocs import TDocCrawler
		from tdoc_crawler.tdocs.operations.fetch import fetch_missing_tdocs
		from tdoc_crawler.tdocs.sources.whatthespec import resolve_via_whatthespec

		# Meeting operations
		from tdoc_crawler.meetings import MeetingCrawler, normalize_working_group_alias

		# Spec operations
		from tdoc_crawler.specs import SpecDatabase, SpecDownloads
		from tdoc_crawler.specs.operations.checkout import checkout_spec
		```bash
		uv run pytest -v # Run tests
		ruff check src/ tests/ # Lint after changes
		uv add <package> # Add dependency
		uv build # Package application
		```

		### Circular Import Prevention

		Rule: If you encounter a circular import, refactor the code to eliminate it. Never use `TYPE_CHECKING` guards or lazy imports as a permanent solution.

		Strategy:

		1. Identify the circular dependency
		2. Extract shared types to `models/` layer
		3. Both modules import from the neutral models layer
		4. Use lazy imports (inside functions) only for temporary fixes during refactoring

		### Anti-Duplication (DRY)

		CRITICAL: Code duplication drove major domain-oriented refactoring. Future assistants MUST NOT introduce duplicated logic.

		Search Before Implement:

		1. Use grep tools to check if similar implementation exists
		2. Check relevant domain package (`tdocs/`, `meetings/`, `specs/`)
		3. If logic exists but needs modification, REFACTOR rather than creating second version
		For package-specific commands, see the respective `AGENTS.md` in each package.

		Logic Placement Rules:
		---

		- Domain Logic: Must live in `src/tdoc_crawler/<domain>/`. NEVER in `cli/`, `parsers/`, or `utils/`
		- Parsing Logic: Must live in `src/tdoc_crawler/parsers/`
		- API Clients: Must live in `src/tdoc_crawler/clients/`
		- Shared Utilities: Must live in `src/tdoc_crawler/utils/` only if truly generic
		## Project Constraints

		Prohibited Patterns:

		- CLI Duplication: Do not copy domain logic into CLI. CLI handles I/O only
		- Test Duplication: Do not copy library code into tests. Use proper mocking
		- Helper Bloat: Do not create `utils.py` files that duplicate `src/tdoc_crawler/utils/`

		## Skills Usage

		This project uses specialized skills for domain-specific guidance. Load skills based on context:

		### 3GPP Domain Skills

		Located in `.agents/skills/3gpp/`:

		\| Skill \| When to Use \|
		\|-------\|-------------\|
		\| `3gpp-basics` \| 3GPP organization, hierarchy, releases, TDocs overview \|
		\| `3gpp-working-groups` \| WG codes, tbid/SubTB identifiers, subgroup hierarchy \|
		\| `3gpp-meetings` \| Meeting structure, naming conventions, quarterly plenaries \|
		\| `3gpp-tdocs` \| TDoc patterns, metadata, FTP server access \|
		\| `3gpp-specifications` \| TS/TR numbering, spec file formats, FTP directories \|
		\| `3gpp-releases` \| Release structure, versioning, TSG rounds \|
		\| `3gpp-change-request` \| CR procedure, workflow, status tracking \|
		\| `3gpp-portal-authentication` \| EOL authentication, portal data fetching \|

		### Programming Skills

		Located in `.agents/skills/`:
		### Virtual Environment (MANDATORY)

		\| Skill \| When to Use \|
		\|-------\|-------------\|
		\| `python-standards` \| Writing/reviewing Python code, type hints, linting \|
		\| `test-driven-development` \| TDD with pytest, fixtures, mocking, coverage \|
		\| `code-deduplication` \| Preventing semantic duplication, capability index \|
		\| `documentation-workflow` \| Updating docs, structure, best practices \|
		\| `visual-explainer` \| Creating diagrams, architecture overviews \|
		Use `uv run <command>` for all Python commands. The virtual environment must be activated before running pytest, CLI, or any project scripts.

		## Mandatory Constraints
		### Linter Rules

		### Virtual Environment (MANDATORY)
		- NEVER suppress linter issues with `# noqa` in `src/` or `tests/`
		- MUST NOT introduce: `PLC0415`, `ANN001`, `E402`, `ANN201`, `ANN202`
		- Run `ruff check src/ tests/` after changes

		Whenever executing shell commands (via `uv`, `pytest`, CLI), you MUST ensure the Python virtual environment is activated. Use `uv run <command>` for all Python commands.
		### Git and Version Control

		### HTTP Caching (MANDATORY)
		- Use `git` with `main` as main branch
		- Use `git add` sparingly — only for files likely to be committed
		- Never run `git commit` or `git push` autonomously
		- `.env` files MUST NOT be committed

		For core crawler source traffic (3gpp.org, whatthespec.net, portal), all HTTP requests MUST use `create_cached_session()` from `tdoc_crawler.http_client` to enable hishel caching.
		---

		- Reduces network overhead for incremental crawls (50-90% faster)
		- Prevents rate-limiting/blocking from 3GPP servers
		- AI model-provider traffic is exempt (follows approved provider integration)
		## Code Style

		### Python Standards

		Use skill `python-standards` for all Python coding tasks. Key rules:
		Use skill `python-standards` for all Python coding tasks.

		Project-Specific Rules:

		- Type hints mandatory everywhere (use `T \| None`, not `Optional[T]`)
		- Use f-strings, `pathlib`, `enumerate()`, `with` statements
		- Use `is`/`is not` for `None` comparisons
		- Keep modules < 250 lines, functions < 75 lines, classes < 200 lines
		- Use `logging` instead of `print()`
		- Use `typer` (CLI), `rich` (formatting), `pydantic` (models), `pydantic-sqlite` (DB)
		- Use `pandas` + `python-calamine` (Excel read), `xlsxwriter` (Excel write)

		Linter Rules:
		For package-specific libraries and patterns, see the respective `AGENTS.md` in each package.

		- NEVER suppress linter issues with `# noqa` in `src/` or `tests/`
		- MUST NOT introduce: PLC0415, ANN001, E402, ANN201, ANN202
		- Run `ruff check src/ tests/` after changes
		### Comments

		- Explain WHY, not WHAT (code is self-documenting)
		- DO NOT use numbered steps ("Step 3: ...") — hard to maintain
		- DO NOT use decorative headings ("===== TOOLS =====")
		- DO NOT use emojis/Unicode (①, •, –, —) in comments
		- Emojis in user-facing output only when they enhance clarity (✔︎, ✘, ∆, ‼︎)

		### Testing

		See `tests/AGENTS.md` for detailed test patterns and fixtures.

		- Use `pytest` with fixtures and parameterized tests
		- Use mocking to isolate external systems
		- Test locations: cache `./tests/test-cache`, DB `./tests/test-cache/tdoc_crawler.db`
		- Aim for 70%+ coverage
		- Run: `uv run pytest -v`

		### Database
		For package-specific test locations, see the respective `AGENTS.md` in each package.

		- SQLite for TDoc and meeting metadata
		- Pydantic models for schema, `pydantic-sqlite` for operations
		- Five tables: `working_groups`, `subworking_groups`, `meetings`, `tdocs`, `crawl_log`
		- Return Pydantic models, not raw tuples
		- Handle case-insensitive TDoc IDs via `.upper()` normalization
		---

		## TDoc Data Sources
		## Documentation

		The project uses three distinct mechanisms for fetching TDoc metadata. Do NOT add new crawl mechanisms without understanding why these three exist.
		### Code Documentation

		\| Source \| Module \| Auth \| Batch \| Single \| Use Case \|
		\|--------\|--------\|:----:\|:-----:\|:------:\|----------\|
		\| Excel DocList \| `tdocs/sources/doclist.py` \| No \| ✓ \| ✗ \| Batch crawl all TDocs per meeting (`crawl-tdocs`) \|
		\| WhatTheSpec API \| `tdocs/sources/whatthespec.py` \| No \| ✗ \| ✓ \| Single/few TDoc lookups (`query`, `open`) \|
		\| 3GPP Portal \| `tdocs/sources/portal.py` \| Yes (EOL) \| ✗ \| ✓ \| Authenticated fallback when WhatTheSpec unavailable \|
		- Clear, concise docstrings using Google style
		- Include type hints in docstrings
		- Use examples for non-obvious usage

		Excel DocList (batch): Primary for `crawl-tdocs` — downloads per-meeting Excel from 3GPP FTP. Cannot resolve single TDoc without meeting.
		### User Documentation

		WhatTheSpec API (single): Community API at `whatthespec.net` — preferred for `query` and `open` commands. No auth required.
		Files:

		3GPP Portal (fallback): Official authenticated source via EOL portal. Use only as fallback or when explicitly requested.
		1. `README.md` — Project overview, installation, Quick Start
		2. `docs/index.md` — Main documentation entry (Jekyll-ready)
		3. `docs/*.md` — Modular task-oriented guides (crawl, query, utils)
		4. `docs/history/` — Chronological changelog

		## Documentation
		---

		### Code Documentation
		## Skills Reference

		- Write clear, concise docstrings using Google style
		- Include type hints in docstrings
		- Use examples in docstrings to illustrate usage
		Load skills based on context. Skills are located in `.agents/skills/`.

		### User Documentation
		### 3GPP Domain Skills

		Files:
		\| Skill \| When to Use \|
		\|-------\|-------------\|
		\| `3gpp-basics` \| 3GPP organization, hierarchy, releases, TDocs overview \|
		\| `3gpp-working-groups` \| WG codes, tbid/SubTB identifiers, subgroup hierarchy \|
		\| `3gpp-meetings` \| Meeting structure, naming conventions, quarterly plenaries \|
		\| `3gpp-tdocs` \| TDoc patterns, metadata, FTP server access \|
		\| `3gpp-specifications` \| TS/TR numbering, spec file formats, FTP directories \|
		\| `3gpp-releases` \| Release structure, versioning, TSG rounds \|
		\| `3gpp-change-request` \| CR procedure, workflow, status tracking \|
		\| `3gpp-portal-authentication` \| EOL authentication, portal data fetching \|

		1. README.md - Project overview, installation, Quick Start
		2. docs/index.md - Main documentation entry point (Jekyll-ready)
		3. *docs/.md** - Modular task-oriented guides (crawl, query, utils)
		4. docs/history/ - Chronological changelog
		### Programming Skills

		Critical Rules:
		\| Skill \| When to Use \|
		\|-------\|-------------\|
		\| `python-standards` \| Writing/reviewing Python code, type hints, linting \|
		\| `test-driven-development` \| TDD with pytest, fixtures, mocking, coverage \|
		\| `code-deduplication` \| Preventing semantic duplication, capability index \|
		\| `documentation-workflow` \| Updating docs, structure, best practices \|
		\| `visual-explainer` \| Creating diagrams, architecture overviews \|

		- Modular user documentation in `docs/` MUST reflect current CLI behavior
		- When adding/modifying commands, update BOTH history file AND relevant documentation files
		---

		## Git and Version Control
		## Packages

		- Use `git` with `main` as main branch
		- Use `git add` sparingly — only for files likely to be committed
		- Never run `git commit` or `git push` on your own
		- `.env` files MUST NOT be committed
		This workspace contains multiple packages with their own AGENTS.md:

		## Comments
		\| Location \| Purpose \|
		\|----------\|---------\|
		\| `src/tdoc_crawler/AGENTS.md` \| Core crawler library (TDocs, meetings, specs) \|
		\| `src/tdoc_crawler/cli/AGENTS.md` \| CLI patterns and constraints \|
		\| `src/tdoc-ai/AGENTS.md` \| AI document processing (embeddings, graphs) \|
		\| `src/teddi-mcp/AGENTS.md` \| TEDDI MCP server patterns \|
		\| `tests/AGENTS.md` \| Test organization and fixtures \|

		- Explain intent or subtle constraints, not what's obvious from names
		- DO document WHY something is done, not WHAT it does
		- DO NOT use numbered steps ("Step 3: ...") — hard to maintain
		- DO NOT use decorative headings ("===== TOOLS =====")
		- DO NOT use emojis or Unicode (①, •, –, —) in comments
		- Use emojis in user-facing output only when they enhance clarity (✔︎, ✘, ∆, ‼︎)
		---

		## AGENTS.md File Design
		## AGENTS.md Maintenance

		This file serves as long-term memory for coding assistants. Principles:

		@@ -212,7 +152,7 @@ This file serves as long-term memory for coding assistants. Principles:
		What NOT to Include:

		- Checklists of completed items (belongs in git history)
		- Active TODO lists (use beads issues)
		- Active TODO lists (use issue tracker)
		- Step-by-step implementation plans
		- Temporary debugging notes
		- File directory trees (changes too often)
		@@ -221,21 +161,3 @@ This file serves as long-term memory for coding assistants. Principles:

		- Update after refactoring sessions with architectural insights
		- Document patterns and anti-patterns
		- When explicitly requested, review using `docs/REVIEW_AND_IMPROVEMENTS_AGENTS_MD.md`

		## Project Management

		uv commands:

		- `uv run <script>` — Run Python scripts in isolated environment
		- `uv run pytest` — Run tests
		- `uv add <package>` — Add dependency
		- `uv add <package> --dev` — Add dev dependency
		- `uv build` — Package application

		Code Size Limits:

		- Modules: < 250 lines (CLI command registration exempt)
		- Functions: < 75 lines
		- Classes: < 200 lines
		- Refactor when limits exceeded

src/tdoc_crawler/AGENTS.md

0 → 100644

+168 −0

Original line number	Diff line number	Diff line
		# Assistant Rules for tdoc_crawler Package

		Core library for crawling and querying 3GPP TDoc data.

		---

		## Import Patterns

		Correct imports:

		```python
		# TDoc operations (use explicit submodule imports)
		from tdoc_crawler.tdocs.operations.fetch import fetch_missing_tdocs
		from tdoc_crawler.tdocs.operations.checkout import checkout_tdoc
		from tdoc_crawler.tdocs.sources.whatthespec import resolve_via_whatthespec
		from tdoc_crawler.tdocs.models import TDocMetadata, TDocQueryConfig

		# Meeting operations
		from tdoc_crawler.meeting_crawler import MeetingCrawler
		from tdoc_crawler.meetings.models import MeetingMetadata

		# Spec operations
		from tdoc_crawler.database import SpecDatabase
		from tdoc_crawler.specs.operations.checkout import checkout_spec

		# Database operations
		from tdoc_crawler.database import TDocDatabase, MeetingDatabase
		from tdoc_crawler.models import WorkingGroup, PortalCredentials
		```

		Note: Domain packages (`tdocs/`, `meetings/`) do not re-export operations to avoid circular imports. Use explicit submodule imports.

		---

		## HTTP Caching (MANDATORY)

		For core crawler source traffic (3gpp.org, whatthespec.net, portal), all HTTP requests MUST use `create_cached_session()` from `tdoc_crawler.http_client`:

		```python
		from tdoc_crawler.http_client import create_cached_session

		with create_cached_session() as session:
		# All HTTP calls use hishel caching
		response = session.get(url)
		```

		Benefits:
		- 50-90% faster incremental crawls
		- Prevents rate-limiting from 3GPP servers
		- AI model-provider traffic is exempt

		---

		## Anti-Duplication (DRY)

		CRITICAL: Code duplication drove major domain-oriented refactoring. Future assistants MUST NOT introduce duplicated logic.

		Search Before Implementing:

		1. Use grep/glob to check if similar implementation exists
		2. Check relevant domain package (`tdocs/`, `meetings/`, `specs/`)
		3. If logic exists but needs modification, REFACTOR rather than creating second version

		Logic Placement Rules:

		\| Logic Type \| Location \|
		\|------------\|----------\|
		\| Domain logic \| `src/tdoc_crawler/<domain>/` \|
		\| Parsing logic \| `src/tdoc_crawler/parsers/` \|
		\| API clients \| `src/tdoc_crawler/clients/` \|
		\| Shared utilities \| `src/tdoc_crawler/utils/` (only if truly generic) \|

		Prohibited Patterns:

		- CLI Duplication: Do not copy domain logic into CLI. CLI handles I/O only
		- Test Duplication: Do not copy library code into tests. Use proper mocking
		- Helper Bloat: Do not create `utils.py` files that duplicate `src/tdoc_crawler/utils/`

		---

		## TDoc Data Sources

		The package uses three distinct mechanisms for fetching TDoc metadata. Do NOT add new crawl mechanisms without understanding why these three exist.

		\| Source \| Module \| Auth \| Batch \| Single \| Use Case \|
		\|--------\|--------\|:----:\|:-----:\|:------:\|----------\|
		\| Excel DocList \| `tdocs/sources/doclist.py` \| No \| ✓ \| ✗ \| Batch crawl all TDocs per meeting (`crawl-tdocs`) \|
		\| WhatTheSpec API \| `tdocs/sources/whatthespec.py` \| No \| ✗ \| ✓ \| Single/few TDoc lookups (`query`, `open`) \|
		\| 3GPP Portal \| `tdocs/sources/portal.py` \| Yes (EOL) \| ✗ \| ✓ \| Authenticated fallback when WhatTheSpec unavailable \|

		Excel DocList (batch): Primary for `crawl-tdocs` — downloads per-meeting Excel from 3GPP FTP. Cannot resolve single TDoc without meeting.

		WhatTheSpec API (single): Community API at `whatthespec.net` — preferred for `query` and `open` commands. No auth required.

		3GPP Portal (fallback): Official authenticated source via EOL portal. Use only as fallback or when explicitly requested.

		---

		## Database Schema

		SQLite database with Pydantic models. Key tables managed via `pydantic-sqlite`:

		\| Table \| Purpose \|
		\|-------\|---------\|
		\| `tdocs` \| TDoc metadata \|
		\| `meetings` \| Meeting records (working group, dates) \|
		\| `working_groups` \| Reference data (TBID → WG mapping) \|
		\| `subworking_groups` \| Reference data (SubTB → subgroup mapping) \|
		\| `crawl_log` \| Crawl history and audit \|
		\| `specs` \| Technical specification catalog \|
		\| `spec_versions` \| Spec version tracking \|
		\| `spec_downloads` \| Download cache \|

		Key Patterns:
		- Return Pydantic models, not raw tuples
		- Handle case-insensitive TDoc IDs via `.upper()` normalization
		- Working group derived from meeting via JOIN (not stored column)

		See `src/tdoc_crawler/models/` for current schema definition.

		---

		## Submodules

		This package has submodule-specific AGENTS.md files:

		\| Location \| Purpose \|
		\|----------\|---------\|
		\| `src/tdoc_crawler/cli/AGENTS.md` \| CLI patterns and constraints \|

		---

		## Circular Import Prevention

		Rule: If you encounter a circular import, refactor the code to eliminate it. Never use `TYPE_CHECKING` guards or lazy imports as a permanent solution.

		Strategy:

		1. Identify the circular dependency
		2. Extract shared types to `models/` layer
		3. Both modules import from the neutral models layer
		4. Use lazy imports (inside functions) only for temporary fixes during refactoring

		---

		## Libraries

		Primary dependencies for this package:

		\| Library \| Purpose \|
		\|---------\|---------\|
		\| `typer` \| CLI framework \|
		\| `rich` \| Console formatting \|
		\| `pydantic` \| Data models \|
		\| `pydantic-sqlite` \| Database layer \|
		\| `hishel` \| HTTP caching \|
		\| `httpx` \| HTTP client \|

		---

		## Testing

		See `tests/AGENTS.md` for detailed test patterns and fixtures.

		- Use `pytest` with fixtures and parameterized tests
		- Use mocking to isolate external systems
		- Test locations: `./tests/test-cache/`, `./tests/test-cache/tdoc_crawler.db`
		- Aim for 70%+ coverage
		No newline at end of file