Commit 5137f0e0 authored by Jan Reimes's avatar Jan Reimes
Browse files

docs(AGENTS): create AGENTS.md for tdoc_crawler package guidelines

- Establish import patterns and rules for domain logic
- Outline HTTP caching requirements for performance
- Define anti-duplication practices to prevent code redundancy
- Document TDoc data sources and database schema
- Include circular import prevention strategies
- List primary libraries and testing guidelines
parent 189a6593
Loading
Loading
Loading
Loading
+85 −163
Original line number Diff line number Diff line
@@ -2,202 +2,142 @@

Command line tool for querying structured 3GPP TDoc data.

## Quick Start
---

Before implementing features, review:
## Development Commands

1. **Project Structure** - Domain-oriented architecture (`tdocs/`, `meetings/`, `specs/`)
2. **CLI Commands** - Command signatures in `src/tdoc_crawler/cli/app.py`
3. **Database Schema** - Models in `src/tdoc_crawler/models/` and database operations
> All Python commands use `uv run` to activate the virtual environment.

## Core Architecture Rules

### Import Patterns

**Correct imports:**

```python
# TDoc operations
from tdoc_crawler.tdocs import TDocCrawler
from tdoc_crawler.tdocs.operations.fetch import fetch_missing_tdocs
from tdoc_crawler.tdocs.sources.whatthespec import resolve_via_whatthespec

# Meeting operations
from tdoc_crawler.meetings import MeetingCrawler, normalize_working_group_alias

# Spec operations
from tdoc_crawler.specs import SpecDatabase, SpecDownloads
from tdoc_crawler.specs.operations.checkout import checkout_spec
```bash
uv run pytest -v              # Run tests
ruff check src/ tests/         # Lint after changes
uv add <package>              # Add dependency
uv build                       # Package application
```

### Circular Import Prevention

**Rule:** If you encounter a circular import, refactor the code to eliminate it. Never use `TYPE_CHECKING` guards or lazy imports as a permanent solution.

**Strategy:**

1. Identify the circular dependency
2. Extract shared types to `models/` layer
3. Both modules import from the neutral models layer
4. Use lazy imports (inside functions) only for temporary fixes during refactoring

### Anti-Duplication (DRY)

**CRITICAL:** Code duplication drove major domain-oriented refactoring. Future assistants MUST NOT introduce duplicated logic.

**Search Before Implement:**

1. Use grep tools to check if similar implementation exists
2. Check relevant domain package (`tdocs/`, `meetings/`, `specs/`)
3. If logic exists but needs modification, REFACTOR rather than creating second version
For package-specific commands, see the respective `AGENTS.md` in each package.

**Logic Placement Rules:**
---

- **Domain Logic:** Must live in `src/tdoc_crawler/<domain>/`. NEVER in `cli/`, `parsers/`, or `utils/`
- **Parsing Logic:** Must live in `src/tdoc_crawler/parsers/`
- **API Clients:** Must live in `src/tdoc_crawler/clients/`
- **Shared Utilities:** Must live in `src/tdoc_crawler/utils/` only if truly generic
## Project Constraints

**Prohibited Patterns:**

- **CLI Duplication:** Do not copy domain logic into CLI. CLI handles I/O only
- **Test Duplication:** Do not copy library code into tests. Use proper mocking
- **Helper Bloat:** Do not create `utils.py` files that duplicate `src/tdoc_crawler/utils/`

## Skills Usage

This project uses specialized skills for domain-specific guidance. Load skills based on context:

### 3GPP Domain Skills

Located in `.agents/skills/3gpp/`:

| Skill | When to Use |
|-------|-------------|
| `3gpp-basics` | 3GPP organization, hierarchy, releases, TDocs overview |
| `3gpp-working-groups` | WG codes, tbid/SubTB identifiers, subgroup hierarchy |
| `3gpp-meetings` | Meeting structure, naming conventions, quarterly plenaries |
| `3gpp-tdocs` | TDoc patterns, metadata, FTP server access |
| `3gpp-specifications` | TS/TR numbering, spec file formats, FTP directories |
| `3gpp-releases` | Release structure, versioning, TSG rounds |
| `3gpp-change-request` | CR procedure, workflow, status tracking |
| `3gpp-portal-authentication` | EOL authentication, portal data fetching |

### Programming Skills

Located in `.agents/skills/`:
### Virtual Environment (MANDATORY)

| Skill | When to Use |
|-------|-------------|
| `python-standards` | Writing/reviewing Python code, type hints, linting |
| `test-driven-development` | TDD with pytest, fixtures, mocking, coverage |
| `code-deduplication` | Preventing semantic duplication, capability index |
| `documentation-workflow` | Updating docs, structure, best practices |
| `visual-explainer` | Creating diagrams, architecture overviews |
Use `uv run <command>` for all Python commands. The virtual environment must be activated before running pytest, CLI, or any project scripts.

## Mandatory Constraints
### Linter Rules

### Virtual Environment (MANDATORY)
- **NEVER** suppress linter issues with `# noqa` in `src/` or `tests/`
- **MUST NOT introduce:** `PLC0415`, `ANN001`, `E402`, `ANN201`, `ANN202`
- Run `ruff check src/ tests/` after changes

Whenever executing shell commands (via `uv`, `pytest`, CLI), you MUST ensure the Python virtual environment is activated. Use `uv run <command>` for all Python commands.
### Git and Version Control

### HTTP Caching (MANDATORY)
- Use `git` with `main` as main branch
- Use `git add` sparingly — only for files likely to be committed
- **Never** run `git commit` or `git push` autonomously
- `.env` files **MUST NOT** be committed

For core crawler source traffic (3gpp.org, whatthespec.net, portal), all HTTP requests **MUST** use `create_cached_session()` from `tdoc_crawler.http_client` to enable hishel caching.
---

- Reduces network overhead for incremental crawls (50-90% faster)
- Prevents rate-limiting/blocking from 3GPP servers
- AI model-provider traffic is exempt (follows approved provider integration)
## Code Style

### Python Standards

**Use skill `python-standards` for all Python coding tasks.** Key rules:
Use skill `python-standards` for all Python coding tasks.

**Project-Specific Rules:**

- Type hints mandatory everywhere (use `T | None`, not `Optional[T]`)
- Use f-strings, `pathlib`, `enumerate()`, `with` statements
- Use `is`/`is not` for `None` comparisons
- Keep modules < 250 lines, functions < 75 lines, classes < 200 lines
- Use `logging` instead of `print()`
- Use `typer` (CLI), `rich` (formatting), `pydantic` (models), `pydantic-sqlite` (DB)
- Use `pandas` + `python-calamine` (Excel read), `xlsxwriter` (Excel write)

**Linter Rules:**
For package-specific libraries and patterns, see the respective `AGENTS.md` in each package.

- NEVER suppress linter issues with `# noqa` in `src/` or `tests/`
- MUST NOT introduce: PLC0415, ANN001, E402, ANN201, ANN202
- Run `ruff check src/ tests/` after changes
### Comments

- Explain **WHY**, not WHAT (code is self-documenting)
- DO NOT use numbered steps ("Step 3: ...") — hard to maintain
- DO NOT use decorative headings ("===== TOOLS =====")
- DO NOT use emojis/Unicode (①, •, –, —) in comments
- Emojis in user-facing output only when they enhance clarity (✔︎, ✘, ∆, ‼︎)

### Testing

See `tests/AGENTS.md` for detailed test patterns and fixtures.

- Use `pytest` with fixtures and parameterized tests
- Use mocking to isolate external systems
- Test locations: cache `./tests/test-cache`, DB `./tests/test-cache/tdoc_crawler.db`
- Aim for 70%+ coverage
- Run: `uv run pytest -v`

### Database
For package-specific test locations, see the respective `AGENTS.md` in each package.

- SQLite for TDoc and meeting metadata
- Pydantic models for schema, `pydantic-sqlite` for operations
- Five tables: `working_groups`, `subworking_groups`, `meetings`, `tdocs`, `crawl_log`
- Return Pydantic models, not raw tuples
- Handle case-insensitive TDoc IDs via `.upper()` normalization
---

## TDoc Data Sources
## Documentation

The project uses **three distinct mechanisms** for fetching TDoc metadata. Do NOT add new crawl mechanisms without understanding why these three exist.
### Code Documentation

| Source | Module | Auth | Batch | Single | Use Case |
|--------|--------|:----:|:-----:|:------:|----------|
| Excel DocList | `tdocs/sources/doclist.py` | No | ✓ | ✗ | Batch crawl all TDocs per meeting (`crawl-tdocs`) |
| WhatTheSpec API | `tdocs/sources/whatthespec.py` | No | ✗ | ✓ | Single/few TDoc lookups (`query`, `open`) |
| 3GPP Portal | `tdocs/sources/portal.py` | Yes (EOL) | ✗ | ✓ | Authenticated fallback when WhatTheSpec unavailable |
- Clear, concise docstrings using Google style
- Include type hints in docstrings
- Use examples for non-obvious usage

**Excel DocList (batch):** Primary for `crawl-tdocs` — downloads per-meeting Excel from 3GPP FTP. Cannot resolve single TDoc without meeting.
### User Documentation

**WhatTheSpec API (single):** Community API at `whatthespec.net` — preferred for `query` and `open` commands. No auth required.
**Files:**

**3GPP Portal (fallback):** Official authenticated source via EOL portal. Use only as fallback or when explicitly requested.
1. `README.md` — Project overview, installation, Quick Start
2. `docs/index.md` — Main documentation entry (Jekyll-ready)
3. `docs/*.md` — Modular task-oriented guides (crawl, query, utils)
4. `docs/history/` — Chronological changelog

## Documentation
---

### Code Documentation
## Skills Reference

- Write clear, concise docstrings using Google style
- Include type hints in docstrings
- Use examples in docstrings to illustrate usage
Load skills based on context. Skills are located in `.agents/skills/`.

### User Documentation
### 3GPP Domain Skills

**Files:**
| Skill | When to Use |
|-------|-------------|
| `3gpp-basics` | 3GPP organization, hierarchy, releases, TDocs overview |
| `3gpp-working-groups` | WG codes, tbid/SubTB identifiers, subgroup hierarchy |
| `3gpp-meetings` | Meeting structure, naming conventions, quarterly plenaries |
| `3gpp-tdocs` | TDoc patterns, metadata, FTP server access |
| `3gpp-specifications` | TS/TR numbering, spec file formats, FTP directories |
| `3gpp-releases` | Release structure, versioning, TSG rounds |
| `3gpp-change-request` | CR procedure, workflow, status tracking |
| `3gpp-portal-authentication` | EOL authentication, portal data fetching |

1. **README.md** - Project overview, installation, Quick Start
2. **docs/index.md** - Main documentation entry point (Jekyll-ready)
3. **docs/*.md** - Modular task-oriented guides (crawl, query, utils)
4. **docs/history/** - Chronological changelog
### Programming Skills

**Critical Rules:**
| Skill | When to Use |
|-------|-------------|
| `python-standards` | Writing/reviewing Python code, type hints, linting |
| `test-driven-development` | TDD with pytest, fixtures, mocking, coverage |
| `code-deduplication` | Preventing semantic duplication, capability index |
| `documentation-workflow` | Updating docs, structure, best practices |
| `visual-explainer` | Creating diagrams, architecture overviews |

- Modular user documentation in `docs/` **MUST** reflect current CLI behavior
- When adding/modifying commands, update **BOTH** history file AND relevant documentation files
---

## Git and Version Control
## Packages

- Use `git` with `main` as main branch
- Use `git add` sparingly — only for files likely to be committed
- **Never** run `git commit` or `git push` on your own
- `.env` files **MUST NOT** be committed
This workspace contains multiple packages with their own AGENTS.md:

## Comments
| Location | Purpose |
|----------|---------|
| `src/tdoc_crawler/AGENTS.md` | Core crawler library (TDocs, meetings, specs) |
| `src/tdoc_crawler/cli/AGENTS.md` | CLI patterns and constraints |
| `src/tdoc-ai/AGENTS.md` | AI document processing (embeddings, graphs) |
| `src/teddi-mcp/AGENTS.md` | TEDDI MCP server patterns |
| `tests/AGENTS.md` | Test organization and fixtures |

- Explain intent or subtle constraints, not what's obvious from names
- DO document WHY something is done, not WHAT it does
- DO NOT use numbered steps ("Step 3: ...") — hard to maintain
- DO NOT use decorative headings ("===== TOOLS =====")
- DO NOT use emojis or Unicode (①, •, –, —) in comments
- Use emojis in user-facing output only when they enhance clarity (✔︎, ✘, ∆, ‼︎)
---

## AGENTS.md File Design
## AGENTS.md Maintenance

This file serves as long-term memory for coding assistants. Principles:

@@ -212,7 +152,7 @@ This file serves as long-term memory for coding assistants. Principles:
**What NOT to Include:**

- Checklists of completed items (belongs in git history)
- Active TODO lists (use beads issues)
- Active TODO lists (use issue tracker)
- Step-by-step implementation plans
- Temporary debugging notes
- File directory trees (changes too often)
@@ -221,21 +161,3 @@ This file serves as long-term memory for coding assistants. Principles:

- Update after refactoring sessions with architectural insights
- Document patterns and anti-patterns
- When explicitly requested, review using `docs/REVIEW_AND_IMPROVEMENTS_AGENTS_MD.md`

## Project Management

**uv commands:**

- `uv run <script>` — Run Python scripts in isolated environment
- `uv run pytest` — Run tests
- `uv add <package>` — Add dependency
- `uv add <package> --dev` — Add dev dependency
- `uv build` — Package application

**Code Size Limits:**

- Modules: < 250 lines (CLI command registration exempt)
- Functions: < 75 lines
- Classes: < 200 lines
- Refactor when limits exceeded
+168 −0
Original line number Diff line number Diff line
# Assistant Rules for tdoc_crawler Package

Core library for crawling and querying 3GPP TDoc data.

---

## Import Patterns

**Correct imports:**

```python
# TDoc operations (use explicit submodule imports)
from tdoc_crawler.tdocs.operations.fetch import fetch_missing_tdocs
from tdoc_crawler.tdocs.operations.checkout import checkout_tdoc
from tdoc_crawler.tdocs.sources.whatthespec import resolve_via_whatthespec
from tdoc_crawler.tdocs.models import TDocMetadata, TDocQueryConfig

# Meeting operations
from tdoc_crawler.meeting_crawler import MeetingCrawler
from tdoc_crawler.meetings.models import MeetingMetadata

# Spec operations
from tdoc_crawler.database import SpecDatabase
from tdoc_crawler.specs.operations.checkout import checkout_spec

# Database operations
from tdoc_crawler.database import TDocDatabase, MeetingDatabase
from tdoc_crawler.models import WorkingGroup, PortalCredentials
```

**Note:** Domain packages (`tdocs/`, `meetings/`) do not re-export operations to avoid circular imports. Use explicit submodule imports.

---

## HTTP Caching (MANDATORY)

For core crawler source traffic (3gpp.org, whatthespec.net, portal), all HTTP requests **MUST** use `create_cached_session()` from `tdoc_crawler.http_client`:

```python
from tdoc_crawler.http_client import create_cached_session

with create_cached_session() as session:
    # All HTTP calls use hishel caching
    response = session.get(url)
```

Benefits:
- 50-90% faster incremental crawls
- Prevents rate-limiting from 3GPP servers
- AI model-provider traffic is exempt

---

## Anti-Duplication (DRY)

**CRITICAL:** Code duplication drove major domain-oriented refactoring. Future assistants MUST NOT introduce duplicated logic.

**Search Before Implementing:**

1. Use grep/glob to check if similar implementation exists
2. Check relevant domain package (`tdocs/`, `meetings/`, `specs/`)
3. If logic exists but needs modification, REFACTOR rather than creating second version

**Logic Placement Rules:**

| Logic Type | Location |
|------------|----------|
| Domain logic | `src/tdoc_crawler/<domain>/` |
| Parsing logic | `src/tdoc_crawler/parsers/` |
| API clients | `src/tdoc_crawler/clients/` |
| Shared utilities | `src/tdoc_crawler/utils/` (only if truly generic) |

**Prohibited Patterns:**

- **CLI Duplication:** Do not copy domain logic into CLI. CLI handles I/O only
- **Test Duplication:** Do not copy library code into tests. Use proper mocking
- **Helper Bloat:** Do not create `utils.py` files that duplicate `src/tdoc_crawler/utils/`

---

## TDoc Data Sources

The package uses **three distinct mechanisms** for fetching TDoc metadata. Do NOT add new crawl mechanisms without understanding why these three exist.

| Source | Module | Auth | Batch | Single | Use Case |
|--------|--------|:----:|:-----:|:------:|----------|
| Excel DocList | `tdocs/sources/doclist.py` | No | ✓ | ✗ | Batch crawl all TDocs per meeting (`crawl-tdocs`) |
| WhatTheSpec API | `tdocs/sources/whatthespec.py` | No | ✗ | ✓ | Single/few TDoc lookups (`query`, `open`) |
| 3GPP Portal | `tdocs/sources/portal.py` | Yes (EOL) | ✗ | ✓ | Authenticated fallback when WhatTheSpec unavailable |

**Excel DocList (batch):** Primary for `crawl-tdocs` — downloads per-meeting Excel from 3GPP FTP. Cannot resolve single TDoc without meeting.

**WhatTheSpec API (single):** Community API at `whatthespec.net` — preferred for `query` and `open` commands. No auth required.

**3GPP Portal (fallback):** Official authenticated source via EOL portal. Use only as fallback or when explicitly requested.

---

## Database Schema

SQLite database with Pydantic models. Key tables managed via `pydantic-sqlite`:

| Table | Purpose |
|-------|---------|
| `tdocs` | TDoc metadata |
| `meetings` | Meeting records (working group, dates) |
| `working_groups` | Reference data (TBID → WG mapping) |
| `subworking_groups` | Reference data (SubTB → subgroup mapping) |
| `crawl_log` | Crawl history and audit |
| `specs` | Technical specification catalog |
| `spec_versions` | Spec version tracking |
| `spec_downloads` | Download cache |

**Key Patterns:**
- Return Pydantic models, not raw tuples
- Handle case-insensitive TDoc IDs via `.upper()` normalization
- Working group derived from meeting via JOIN (not stored column)

See `src/tdoc_crawler/models/` for current schema definition.

---

## Submodules

This package has submodule-specific AGENTS.md files:

| Location | Purpose |
|----------|---------|
| `src/tdoc_crawler/cli/AGENTS.md` | CLI patterns and constraints |

---

## Circular Import Prevention

**Rule:** If you encounter a circular import, refactor the code to eliminate it. Never use `TYPE_CHECKING` guards or lazy imports as a permanent solution.

**Strategy:**

1. Identify the circular dependency
2. Extract shared types to `models/` layer
3. Both modules import from the neutral models layer
4. Use lazy imports (inside functions) only for temporary fixes during refactoring

---

## Libraries

Primary dependencies for this package:

| Library | Purpose |
|---------|---------|
| `typer` | CLI framework |
| `rich` | Console formatting |
| `pydantic` | Data models |
| `pydantic-sqlite` | Database layer |
| `hishel` | HTTP caching |
| `httpx` | HTTP client |

---

## Testing

See `tests/AGENTS.md` for detailed test patterns and fixtures.

- Use `pytest` with fixtures and parameterized tests
- Use mocking to isolate external systems
- Test locations: `./tests/test-cache/`, `./tests/test-cache/tdoc_crawler.db`
- Aim for 70%+ coverage
 No newline at end of file