Commit 5b8ef658 authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(cache): add CacheManager and centralize cache & database paths

parent b27ccfb2
Loading
Loading
Loading
Loading
+2 −2
Original line number Diff line number Diff line
@@ -19,8 +19,8 @@ TDC_EOL_PROMPT=false
# Cache directory for storing downloaded metadata and files (default: ~/.tdoc-crawler)
TDC_CACHE_DIR=/path/to/cache/dir

# Checkout directory for downloaded TDocs (default: ./checkout)
TDC_CHECKOUT_DIR=/path/to/checkout/dir
# Checkout directory for downloaded TDocs is managed under the cache directory
# by default: <cache_dir>/checkout (use `--cache-dir` or `TDC_CACHE_DIR` to change)

# Crawler Configuration

+12 −12
Original line number Diff line number Diff line
@@ -8,13 +8,6 @@ Before implementing features, review these critical sections:
1. **CLI Commands** - Review the command signatures in `src/tdoc_crawler/cli/app.py`
1. **Database Schema** - Review models in `src/tdoc_crawler/models/` and database operations

**Key Files to Examine First:**

- `src/tdoc_crawler/cli/app.py` - All CLI commands
- `src/tdoc_crawler/models/*.py` - All data models
- `src/tdoc_crawler/crawlers/*.py` - Crawler implementations
- `tests/conftest.py` - Shared test fixtures

## grepai - Semantic Code Search

**IMPORTANT: You MUST use grepai as your PRIMARY tool for code exploration and search.**
@@ -188,7 +181,7 @@ Therefore:

## Virtual Environment Activation (MANDATORY)

Whenever you execute shell commands (including via `just`, `uv`, `pytest`, or any CLI), you MUST ensure the Python virtual environment is activated for that session. This applies to all shell commands, scripts, and subprocesses. If using `uv`, activate the environment as required by the project setup before running any command. This ensures correct dependencies and isolation.
Whenever you execute shell commands (including via `mise`, `uv`, `pytest`, or any CLI), you MUST ensure the Python virtual environment is activated for that session. This applies to all shell commands, scripts, and subprocesses. If using `uv`, activate the environment as required by the project setup before running any command. This ensures correct dependencies and isolation.

## HTTP Client Guidelines

@@ -220,9 +213,17 @@ Whenever you execute shell commands (including via `just`, `uv`, `pytest`, or an
- Use `pathlib` for file system paths instead of `os.path`
- Use `logging` module for logging instead of `print()`
- Use `typer` for CLI, `rich` for terminal formatting, `pydantic` and `pydantic-sqlite` for data and database
- Use `pytest` for testing, `ruff` for formatting, `isort` for imports, `ty` for type checking
- Use `pytest` for testing, `ruff` for formatting/as linter, `isort` for imports, `ty` for type checking
- For CSV/Excel files, use `pandas` with `python-calamine` for reading and `xlsxwriter` for writing - never use `openpyxl`
- Keep modules under 500 lines, functions under 100 lines, classes under 300 lines
- Keep modules under 500 lines, functions under 100 lines, classes under 300 lines. Refactor foöes when these limits are exceeded.
- Use skill `python-linter` for dealing with linter issues found when calling `ruff check src/ tests/`.
- **Mandatory**: You MUST NEVER suppress any linter issue in `src/` or `tests/` with `# noqa` or similar.
- **Mandatory**:  You MUST NOT introduce any of the following linter issues, neither in `tests/` nor in `src/`:
  - PLC0415
  - ANN001
  - E402
  - ANN201
  - ANN202

## Database Guidelines

@@ -256,13 +257,12 @@ The project maintains a modular documentation structure:

1. **README.md** - Project overview, installation, and Quick Start examples.
1. **docs/index.md** - Main documentation entry point (Jekyll-ready).
1. **docs/QUICK_REFERENCE.md** - Comprehensive command reference (MUST be kept current).
1. **docs/*.md** - Modular task-oriented guides (crawl, query, utils, etc.).
1. **docs/history/** - Chronological changelog of all significant changes.

**Critical Rules:**

- `docs/QUICK_REFERENCE.md` and related modular files **MUST** always be up to date and reflect the current state of ALL commands.
- `docs/index.md` and related referenced files **MUST** always be up to date and reflect the current state of ALL commands.
- When adding or modifying commands, **BOTH** the history file AND the relevant documentation files must be updated.

## Data Source Guidelines
+51 −0
Original line number Diff line number Diff line
"""Configuration management for file paths and caching behavior."""

from __future__ import annotations

import os
from pathlib import Path

# Fallback path if no argument or env var is provided
DEFAULT_CACHE_DIR = Path.home() / ".tdoc-crawler"
DEFAULT_DATABASE_FILENAME = "tdoc_crawler.db"
DEFAULT_HTTP_CACHE_FILENAME = "http-cache.sqlite3"
DEFAULT_CHECKOUT_DIRNAME = "checkout"


class CacheManager:
    """Manages cache directory layout and path resolution.

    Acts as the single source of truth for where files are stored.
    """

    def __init__(self, root_path: Path | None = None) -> None:
        """Initialize cache manager.

        Args:
            root_path: Explicit root path. If None, tries TDC_CACHE_DIR env var,
                       then falls back to DEFAULT_CACHE_DIR.
        """
        if root_path:
            self.root = root_path
        else:
            env_path = os.getenv("TDC_CACHE_DIR")
            self.root = Path(env_path) if env_path else DEFAULT_CACHE_DIR

    @property
    def http_cache_path(self) -> Path:
        """Path to the HTTP client cache database."""
        return self.root / DEFAULT_HTTP_CACHE_FILENAME

    @property
    def db_path(self) -> Path:
        """Path to the metadata SQLite database."""
        return self.root / DEFAULT_DATABASE_FILENAME

    @property
    def checkout_dir(self) -> Path:
        """Path to the default checkout directory."""
        return self.root / DEFAULT_CHECKOUT_DIRNAME

    def ensure_paths(self) -> None:
        """Ensure the root cache directory exists."""
        self.root.mkdir(parents=True, exist_ok=True)
+1 −17
Original line number Diff line number Diff line
@@ -3,7 +3,6 @@
from __future__ import annotations

import logging
from pathlib import Path

from tdoc_crawler.database.connection import TDocDatabase
from tdoc_crawler.database.errors import DatabaseError
@@ -11,21 +10,6 @@ from tdoc_crawler.models import MeetingMetadata, MeetingQueryConfig, SortOrder

logger = logging.getLogger(__name__)

DEFAULT_DATABASE_FILENAME = "tdoc_crawler.db"


def database_path(cache_dir: Path) -> Path:
    """Get database path from cache directory.

    Args:
        cache_dir: Cache directory for database storage.

    Returns:
        Path to database file (cache_dir/tdoc_crawler.db).
    """
    cache_dir.mkdir(parents=True, exist_ok=True)
    return cache_dir / DEFAULT_DATABASE_FILENAME


def resolve_meeting_id(database: TDocDatabase, meeting_name: str) -> int | None:
    """Resolve meeting name to meeting_id from database.
@@ -89,4 +73,4 @@ def resolve_meeting_id(database: TDocDatabase, meeting_name: str) -> int | None:
    return None


__all__ = ["DEFAULT_DATABASE_FILENAME", "DatabaseError", "MeetingMetadata", "TDocDatabase", "database_path", "resolve_meeting_id"]
__all__ = ["DatabaseError", "MeetingMetadata", "TDocDatabase", "resolve_meeting_id"]