Commit a968acdc authored by Jan Reimes's avatar Jan Reimes
Browse files

docs(agents): update guidelines for domain-oriented architecture and imports

- Revise project structure to reflect the removal of the legacy `crawlers/` package.
- Emphasize the importance of domain-specific imports and avoiding duplication.
- Provide clear rules for organizing code and tests to maintain a clean architecture.
parent 91b7273e
Loading
Loading
Loading
Loading
+100 −1
Original line number Diff line number Diff line
@@ -4,10 +4,109 @@

Before implementing features, review these critical sections:

1. **Project Structure** - Understand the models/ and crawlers/ submodule organization
1. **Project Structure** - Understand the domain-oriented architecture (tdocs/, meetings/, specs/)
1. **CLI Commands** - Review the command signatures in `src/tdoc_crawler/cli/app.py`
1. **Database Schema** - Review models in `src/tdoc_crawler/models/` and database operations

## Domain-Oriented Architecture

**IMPORTANT: The project uses a clean domain-driven structure. The legacy `crawlers/` folder has been completely removed.**

### Domain Package Structure

```
src/tdoc_crawler/
├── tdocs/              # TDoc domain (operations, sources, models)
│   ├── operations/     # TDoc operations (crawl, fetch, checkout)
│   ├── sources/        # TDoc data sources (portal, doclist, whatthespec)
│   └── models.py       # TDoc-specific models
├── meetings/           # Meeting domain (operations, crawl logic)
├── specs/              # Specification domain (operations, sources, database)
│   ├── operations/     # Spec operations (crawl, checkout, normalize)
│   └── sources/        # Spec data sources (3gpp, whatthespec)
├── clients/            # External API clients (Portal)
├── parsers/            # HTML/data parsers (portal, meetings, directory)
├── workers/            # Parallel processing workers
├── database/           # Database layer (base, connection)
├── models/             # Shared data models
├── constants/          # Patterns, URLs, registries
├── utils/              # Shared utilities
└── cli/                # Command-line interface (optional)
```

### Import Patterns

**Correct imports:**

```python
# TDoc operations
from tdoc_crawler.tdocs import TDocCrawler, HybridTDocCrawler
from tdoc_crawler.tdocs.operations.fetch import fetch_missing_tdocs
from tdoc_crawler.tdocs.operations.checkout import checkout_tdoc
from tdoc_crawler.tdocs.sources.whatthespec import resolve_via_whatthespec
from tdoc_crawler.tdocs.sources.doclist import fetch_meeting_document_list

# Meeting operations
from tdoc_crawler.meetings import MeetingCrawler, normalize_working_group_alias

# Spec operations
from tdoc_crawler.specs import SpecDatabase, SpecDownloads
from tdoc_crawler.specs.operations.checkout import checkout_spec
```

**NEVER use:** `from tdoc_crawler.crawlers import ...` (this package no longer exists)

### Circular Import Prevention

**Rule:** If you encounter a circular import, refactor the code to eliminate it. Never use `TYPE_CHECKING` guards or lazy imports as a permanent solution.

**Strategy:**

1. Identify the circular dependency
2. Extract shared types to `models/` layer
3. Both modules import from the neutral models layer
4. Use lazy imports (inside functions) only for temporary fixes during refactoring

**Example:**

```python
# WRONG: Using TYPE_CHECKING as permanent solution
from typing import TYPE_CHECKING
if TYPE_CHECKING:
    from tdoc_crawler.database import TDocDatabase

# RIGHT: Use lazy import inside function (temporary only)
def _resolve_meeting_id(db_file: Path) -> int:
    from tdoc_crawler.database import TDocDatabase  # Lazy import
    with TDocDatabase(db_file) as db:
        return db.resolve_meeting_id(name)
```

## Anti-Duplication and DRY (Don't Repeat Yourself)

**CRITICAL: Code duplication was the primary driver for the major domain-oriented refactoring. Future assistants MUST NOT introduce duplicated logic.**

### Search Before Implement

Before implementing any new functionality:

1. Use `grepai search` to check if a similar implementation exists.
2. Check the relevant domain package (`tdocs/`, `meetings/`, `specs/`).
3. If logic exists but needs modification, REFACTOR the existing code rather than creating a second version.

### Logic Placement Rules

- **Domain Logic:** Must live in `src/tdoc_crawler/<domain>/`. NEVER re-implement this in `cli/`, `parsers/`, or `utils/`.
- **Parsing Logic:** Must live in `src/tdoc_crawler/parsers/`.
- **API Clients:** Must live in `src/tdoc_crawler/clients/`.
- **Shared Utilities:** Must live in `src/tdoc_crawler/utils/` only if they are truly generic.

### Prohibited Patterns

- **CLI Duplication:** Do not copy domain logic into `src/tdoc_crawler/cli/`. The CLI should only handle input/output and call core functions.
- **Test Duplication:** Do not copy library code into tests to mock behavior. Use proper mocking or test the actual imported code.
- **Helper Bloat:** Do not create `utils.py` files in subdirectories that duplicate functions already present in `src/tdoc_crawler/utils/`.

## grepai - Semantic Code Search

**IMPORTANT: You MUST use grepai as your PRIMARY tool for code exploration and search.**
+25 −0
Original line number Diff line number Diff line
@@ -6,6 +6,8 @@ This document provides guidelines for development in the `tdoc_crawler.cli` subm

The `cli/` submodule should contain **only CLI-related functionality**. The core `tdoc_crawler` package should be usable as a standalone library without depending on the CLI. Think of `cli/` as an optional extras package (installable as `tdoc_crawler[cli]`).

**STRICT RULE: NEVER duplicate logic from the core library in the CLI.** If you need functionality that is partially implemented in the CLI but belongs in the core, move it to the core and have the CLI import it.

## Classification Rules

When deciding whether code belongs in `cli/` or the core library, ask:
@@ -55,6 +57,29 @@ database/__init__.py → database/connection.py → specs/query.py → specs/__i

**Key Insight:** Circular imports always indicate a structural problem. Never use TYPE_CHECKING or local imports to work around them - refactor the module organization instead.

### Domain-Oriented Refactoring (Steps 1-14)

The project underwent a complete restructuring to eliminate the legacy `crawlers/` package:

**Before:** Mixed orchestration and domain logic in `crawlers/`

```python
from tdoc_crawler.crawlers import TDocCrawler, MeetingCrawler
from tdoc_crawler.crawlers.whatthespec import resolve_via_whatthespec
from tdoc_crawler.crawlers.meeting_doclist import fetch_meeting_document_list
```

**After:** Clean domain packages with clear responsibilities

```python
from tdoc_crawler.tdocs import TDocCrawler
from tdoc_crawler.tdocs.sources.whatthespec import resolve_via_whatthespec
from tdoc_crawler.tdocs.sources.doclist import fetch_meeting_document_list
from tdoc_crawler.meetings import MeetingCrawler
```

**Result:** The `crawlers/` folder no longer exists. All functionality lives in domain packages.

### Import Pattern

The correct import direction is:
+38 −0
Original line number Diff line number Diff line
@@ -72,6 +72,44 @@ Use package-level imports:
from pool_executors.pool_executors import SerialPoolExecutor, create_executor
```

### Domain-Specific Imports

**Since the refactoring (Steps 1-14), use domain-specific imports:**

```python
# TDoc functionality
from tdoc_crawler.tdocs import TDocCrawler, HybridTDocCrawler
from tdoc_crawler.tdocs.sources.whatthespec import resolve_via_whatthespec
from tdoc_crawler.tdocs.sources.doclist import fetch_meeting_document_list

# Meeting functionality
from tdoc_crawler.meetings import MeetingCrawler

# Spec functionality
from tdoc_crawler.specs import SpecDatabase

# Constants (patterns, URLs)
from tdoc_crawler.constants.patterns import EXCLUDED_DIRS, TDOC_PATTERN
from tdoc_crawler.constants.urls import TDOC_DOWNLOAD_URL

# Portal client
from tdoc_crawler.clients.portal import PortalClient, create_portal_client

# Parsers
from tdoc_crawler.parsers.portal import parse_tdoc_portal_page
```

**NEVER use:** `from tdoc_crawler.crawlers import ...` (this package was removed)

## Anti-Duplication in Tests

### Use Fixtures

Avoid duplicating setup code or data loading in tests. Use `conftest.py` and pytest fixtures.

### No Library Logic in Tests

Do not re-implement or copy logic from `src/` into `tests/` for the sake of mocking or testing. Tests should verify the actual implementation by importing it. If the code is hard to test, refactor the code to be more testable rather than duplicating logic.

## Best Practices