Commit e40024b6 authored by Jan Reimes's avatar Jan Reimes
Browse files

🔥 chore: remove PLAN.md

parent 8c8b4679
Loading
Loading
Loading
Loading

PLAN.md

deleted100644 → 0
+0 −751
Original line number Diff line number Diff line
# PLAN: Multi-Provider Office→PDF Conversion CLI

## Goal

Implement a production-ready CLI tool that converts MS Office documents (Word, Excel, PowerPoint) to PDF using multiple cloud API providers with automatic failover, HTTP caching via hishel, and quota management.

**User-visible outcome:**

- `pdf-remote-converter convert document.docx output.pdf` → auto-selects configured provider (single key) or best available (multiple keys)
- `pdf-remote-converter convert --provider cloudconvert file.xlsx` → uses specific provider
- Automatic failover when provider hits quota/rate-limit (tries next configured provider)
- Automatic caching prevents redundant API calls for identical files
- Clear error messages when all providers fail or conversion errors

______________________________________________________________________

## Context

### Project Structure (Current)

```
pdf-remote-converter/
├── src/pdf_remote_converter/
│   ├── __init__.py
│   ├── __about__.py
│   └── cli/
│       ├── app.py          # Typer CLI entry point
│       ├── args.py         # CLI argument definitions
│       └── commands.py     # Placeholder commands (greet, add)
├── tests/
├── pyproject.toml          # Dependencies: typer only
└── office2pdf-apis.md      # API provider research
```

### Target Providers (from office2pdf-apis.md)

| Provider | Free Tier | Formats | Notes |
|----------|-----------|---------|-------|
| CloudConvert | 10/day (~300/mo) | DOC/DOCX/XLS/XLSX/PPT/PPTX | Async job model, Python SDK |
| Adobe PDF Services | 500/mo | DOC/DOCX/XLS/XLSX/PPT/PPTX | Official Python SDK, best legacy support |
| Zamzar | 100/mo | DOC/DOCX/XLS/XLSX/PPT/PPTX | 1MB limit on free tier, Python SDK |

### Technical Constraints

- Python 3.13+
- Use `httpx` for HTTP (required by hishel)
- Use `hishel` for HTTP caching (RFC 9111 compliant)
- CLI via `typer`
- 100% test coverage requirement
- Ruff for linting/formatting

### Dependencies to Add

```toml
dependencies = [
    "typer>=0.12.0",
    "httpx>=0.27.0",
    "hishel>=0.1.0",
    "pydantic>=2.0.0",  # Config/settings validation
    "pydantic-settings>=2.0.0",  # For BaseSettings
]
```

______________________________________________________________________

## Phases

### Phase 1: Core Infrastructure

**Deliverable:** Provider protocol, base HTTP client with caching, configuration system

**Files to create:**

```
src/pdf_remote_converter/
├── config.py              # Pydantic settings, env var loading
├── exceptions.py          # Custom exceptions (QuotaExceeded, ConversionFailed, etc.)
├── http.py                # Hishel-cached httpx client factory
├── logging.py             # Logging configuration
├── providers/
│   ├── __init__.py
│   ├── base.py            # Provider protocol and base class
│   └── models.py          # Conversion result, job status models
tests/
├── __init__.py
├── conftest.py            # Pytest fixtures, mock providers
└── test_providers_base.py # Tests for protocol and base class
.env.example               # Template with all environment variables and defaults
```

**Files to modify:**

- `pyproject.toml` — add dependencies
- `src/pdf_remote_converter/cli/args.py` — add convert command args
- `src/pdf_remote_converter/cli/commands.py` — remove placeholder commands

**CLI argument definitions:**

```python
# cli/args.py
from typing import Annotated
import typer

InputFile = Annotated[Path, typer.Argument(help="Input Office file to convert")]

OutputFile = Annotated[Path, typer.Argument(help="Output PDF file path")]

Provider = Annotated[
    str | None,
    typer.Option("--provider", "-p", help="Provider: cloudconvert, adobe, zamzar"),
]

ApiKey = Annotated[
    str | None,
    typer.Option("--api-key", envvar="PDF_REMOTE_CONVERTER_API_KEY", help="API key for selected provider"),
]

Force = Annotated[bool, typer.Option("--force/--no-force", help="Skip cache and force fresh conversion")]
```

**Key interfaces:**

```python
# providers/base.py
from typing import Protocol

class ConversionResult:
    """Result of a PDF conversion."""
    output_path: Path
    provider: str
    from_cache: bool
    credits_used: int

class ProviderBackend(Protocol):
    """Protocol for conversion providers."""
    
    @property
    def name(self) -> str: ...
    
    @property
    def supported_formats(self) -> set[str]: ...
    
    @property
    def monthly_quota(self) -> int: ...
    
    @property
    def quota_remaining(self) -> int: ...
    
    def convert(self, input_path: Path, output_path: Path) -> ConversionResult:
        """Convert input file to PDF."""
        ...
    
    def is_healthy(self) -> bool:
        """Check if provider is available and configured."""
        ...
```

**Configuration (config.py):**

```python
from pydantic_settings import BaseSettings, SettingsConfigDict

class ProviderSettings(BaseSettings):
    """API credentials loaded from environment variables."""
    model_config = SettingsConfigDict(env_prefix="", case_sensitive=False)

    # Per-provider credentials
    cloudconvert_api_key: str | None = None
    adobe_client_id: str | None = None
    adobe_client_secret: str | None = None
    zamzar_api_key: str | None = None

    # Generic fallback (used when provider-specific key not set)
    api_key: str | None = None

    # App settings
    cache_dir: Path = Path("~/.cache/pdf-remote-converter").expanduser()
    default_provider: str = "cloudconvert"

    def get_api_key(self, provider: str) -> str | None:
        """Get API key for provider (provider-specific or fallback)."""
        if provider == "adobe":
            # Adobe uses client_id + client_secret pair, not a single key
            return None  # Adobe uses separate credentials
        key = getattr(self, f"{provider}_api_key", None)
        return key or self.api_key
    
    def get_adobe_credentials(self) -> tuple[str, str] | None:
        """Get Adobe client_id and client_secret as tuple."""
        if self.adobe_client_id and self.adobe_client_secret:
            return (self.adobe_client_id, self.adobe_client_secret)
        return None

# logging.py
import logging
import sys

def setup_logging(verbose: bool = False) -> None:
    """Configure logging for the application."""
    level = logging.DEBUG if verbose else logging.INFO
    logging.basicConfig(
        level=level,
        format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
        handlers=[logging.StreamHandler(sys.stderr)],
    )
```

**Validation:**

```bash
uv run pytest tests/test_providers_base.py -v
uv run ruff check src/
```

______________________________________________________________________

### Phase 2: CloudConvert Provider

**Deliverable:** Fully functional CloudConvert integration with async job handling

**Files to create:**

```
src/pdf_remote_converter/providers/
└── cloudconvert.py        # CloudConvert implementation

tests/
└── test_cloudconvert.py   # Mock-based tests
```

**Implementation notes (from office2pdf-apis.md):**

- API v2 uses job model: `import/upload``convert``export` tasks
- Async: create job → poll status → download result
- Auth via API key in header

**Key flow:**

```python
class CloudConvertProvider:
    def convert(self, input_path: Path, output_path: Path) -> ConversionResult:
        # 1. Create job with import/upload + convert + export tasks
        # 2. Upload file to import task URL
        # 3. Poll job status until complete
        # 4. Download PDF from export task URL
        # 5. Return result with credits_used=1
```

**Validation:**

```bash
uv run pytest tests/test_cloudconvert.py -v
uv run pdf-remote-converter convert --provider cloudconvert test.docx out.pdf
```

______________________________________________________________________

### Phase 3: Adobe PDF Services Provider

**Deliverable:** Adobe integration using official SDK patterns

**Files to create:**

```
src/pdf_remote_converter/providers/
└── adobe.py               # Adobe PDF Services implementation

tests/
└── test_adobe.py          # Mock-based tests
```

**Implementation notes:**

- Uses JWT/OAuth authentication
- Official Python SDK available: `adobe-pdfservices-python`
- Flow: authenticate → POST job → poll → download

**Dependencies to add:**

```toml
"pdfservices-sdk>=4.0.0",  # Adobe PDF Services Python SDK
```

**Note:** Adobe SDK uses `ServicePrincipalCredentials` with `client_id` and `client_secret` (NOT `client_id` + `client_secret` as separate env vars, but passed as credentials object). See [Adobe SDK docs](https://github.com/adobe/pdfservices-python-sdk).

**Validation:**

```bash
uv run pytest tests/test_adobe.py -v
```

______________________________________________________________________

### Phase 4: Zamzar Provider

**Deliverable:** Zamzar integration with file size validation

**Files to create:**

```
src/pdf_remote_converter/providers/
└── zamzar.py              # Zamzar implementation

tests/
└── test_zamzar.py         # Mock-based tests
```

**Implementation notes:**

- Simple REST API with dedicated format endpoints
- Free tier: 1MB max file size
- Python SDK on PyPI: `zamzar` (verify package exists before Phase 4)

**Note:** Verify `zamzar` package on PyPI before implementation. If unavailable or unmaintained, use httpx directly with REST API (similar to Adobe fallback).

**Key considerations:**

- Validate file size before upload (reject >1MB on free tier)
- Use format-specific endpoints (e.g., `/doc-to-pdf`, `/xlsx-to-pdf`)

**Validation:**

```bash
uv run pytest tests/test_zamzar.py -v
```

______________________________________________________________________

### Phase 5: Provider Router & CLI Integration

**Deliverable:** Provider selection, failover logic, and complete CLI

**Files to create:**

```
src/pdf_remote_converter/
├── router.py              # Provider selection and failover
├── utils.py               # Format detection, file utilities
└── converter.py           # High-level Converter facade (public API)
```

**Files to modify:**

```
src/pdf_remote_converter/cli/
├── args.py                # Add InputFile, OutputFile, Provider, ApiKey, Force args
└── commands.py            # Implement convert command
```

**Provider selection logic:**

```python
# Default provider preference order (hardcoded, configurable later)
PROVIDER_PREFERENCE = ["cloudconvert", "adobe", "zamzar"]

def get_configured_providers(settings: ProviderSettings) -> list[ProviderBackend]:
    """Get list of providers that have API keys configured."""
    configured = []
    for name in PROVIDER_PREFERENCE:
        if settings.get_api_key(name) or (name == "adobe" and settings.get_adobe_credentials()):
            configured.append(create_provider(name, settings))
    return configured

def select_provider(
    input_path: Path,
    preferred: str | None = None,
    settings: ProviderSettings,
) -> ProviderBackend:
    """Select best available provider with automatic failover.
    
    Selection order:
    1. If --provider explicitly specified, use that (error if not configured)
    2. If only one provider has API key, use it automatically
    3. If multiple providers configured, use first by preference order
    """
    configured = get_configured_providers(settings)
    
    if not configured:
        raise NoProviderConfiguredError("No API keys configured. Set at least one provider API key.")
    
    if preferred:
        # Explicit selection - must be configured
        provider = next((p for p in configured if p.name == preferred), None)
        if not provider:
            raise ProviderNotConfiguredError(f"Provider '{preferred}' not configured. Set {preferred.upper()}_API_KEY.")
        return provider
    
    # Single provider: use it directly
    if len(configured) == 1:
        return configured[0]
    
    # Multiple providers: use first by preference (failover handled by router)
    return configured[0]

def convert_with_failover(
    input_path: Path,
    output_path: Path,
    providers: list[ProviderBackend],
) -> ConversionResult:
    """Convert with automatic failover to next provider on error.
    
    Failover triggers:
    - QuotaExceededError
    - RateLimitError
    - ProviderUnavailableError
    
    Does NOT failover for:
    - Invalid file format
    - File too large
    - Authentication errors (wrong API key)
    """
    errors = []
    for provider in providers:
        try:
            return provider.convert(input_path, output_path)
        except (QuotaExceededError, RateLimitError, ProviderUnavailableError) as e:
            errors.append((provider.name, e))
            continue  # Try next provider
    
    raise AllProvidersFailedError(f"All providers failed: {errors}")
```

**High-level Converter API (converter.py):**

```python
# Public facade for package usage
class Converter:
    """High-level converter API for programmatic usage."""
    
    def __init__(
        self,
        provider: str | None = None,
        api_key: str | None = None,
        settings: ProviderSettings | None = None,
    ):
        """Initialize converter with optional provider override."""
        self.settings = settings or ProviderSettings()
        self.provider_override = provider
        self.api_key_override = api_key
    
    def convert(self, input_path: str | Path, output_path: str | Path) -> ConversionResult:
        """Convert Office document to PDF with auto-selection and failover."""
        input_path = Path(input_path)
        output_path = Path(output_path)
        
        # Get configured providers
        providers = get_configured_providers(self.settings)
        
        # Select provider (or use override)
        if self.provider_override:
            selected = select_provider(input_path, self.provider_override, self.settings)
            providers = [selected]
        
        # Convert with failover
        return convert_with_failover(input_path, output_path, providers)
```

**CLI command:**

```python
@app.command()
def convert(
    input_file: args.InputFile,
    output_file: args.OutputFile,
    provider: args.Provider = None,
    api_key: args.ApiKey = None,   # Override env var for selected provider
    force: args.Force = False,     # Skip cache
) -> None:
    """Convert Office document to PDF."""
```

**Validation:**

```bash
uv run pdf-remote-converter convert document.docx output.pdf
uv run pdf-remote-converter convert --provider adobe large.xlsx out.pdf
uv run pdf-remote-converter convert --api-key YOUR_KEY --provider cloudconvert doc.docx out.pdf
uv run pdf-remote-converter convert --no-cache file.pptx output.pdf
uv run pytest tests/ -v --cov
```

______________________________________________________________________

### Phase 6: Caching & Quota Management

**Deliverable:** Hishel HTTP caching with file-content-aware cache keys

**Files to modify:**

```
src/pdf_remote_converter/
└── providers/base.py      # Add cache key generation (file hash)
```

**Caching strategy:**

- Use hishel `SyncCacheClient` with `always_cache=True`
- Hishel caches POST requests using body-sensitive keys (file content hashed automatically)
- Cache key: provider + format + file content hash (hishel handles this via Vary header + body hashing)
- Default TTL: 7 days (configurable via `default_ttl`)
- Storage: SQLite (default) or file system

**Note on POST caching:** CloudConvert/Adobe/Zamzar all use POST for conversion. Hishel supports body-sensitive caching for POST by using the request body in the cache key. Configure with:

```python
controller = Controller(
    cacheable_methods=["GET", "POST"],
    cacheable_status_codes=[200],
)
```

**Implementation:**

```python
# http.py
from hishel import CacheOptions, SpecificationPolicy, Controller
from hishel.httpx import SyncCacheClient
from hishel.storages import SyncSqliteStorage

def create_cached_client(cache_dir: Path | None = None) -> SyncCacheClient:
    """Create HTTP client with RFC 9111 caching + POST body-sensitive caching."""
    controller = Controller(
        cacheable_methods=["GET", "POST"],  # Enable POST caching for API calls
        cacheable_status_codes=[200, 201],
    )
    return SyncCacheClient(
        storage=SyncSqliteStorage(cache_dir / "http_cache.sqlite"),
        controller=controller,
        policy=SpecificationPolicy(
            cache_options=CacheOptions(always_cache=True)
        )
    )
```

**Validation:**

```bash
# First call hits API
uv run pdf-remote-converter convert test.docx out.pdf
# Second call should use cache (verify via logging or --verbose)
uv run pdf-remote-converter convert test.docx out.pdf
```

______________________________________________________________________

## Validation

### Commands (run from project root)

```bash
# Lint and format
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/

# Tests with coverage
uv run pytest tests/ -v --cov --cov-report=term-missing

# CLI smoke tests
uv run pdf-remote-converter --help
uv run pdf-remote-converter convert --help
uv run pdf-remote-converter convert test.docx output.pdf
uv run pdf-remote-converter convert --provider cloudconvert test.xlsx out.pdf

# Spell check
uv run codespell src/ tests/
```

### Acceptance Criteria

- [ ] All three providers (CloudConvert, Adobe, Zamzar) implement `ProviderBackend` protocol
- [ ] Single API key configured → auto-select that provider (no --provider needed)
- [ ] Multiple API keys → use preference order with automatic failover on quota/rate-limit errors
- [ ] Hishel caching reduces API calls for identical files
- [ ] API keys configurable via: CLI --api-key, function argument, or env var
- [ ] Clear error messages for: quota exceeded, unsupported format, file too large, missing API key
- [ ] 100% test coverage
- [ ] All linting passes (ruff, codespell)
- [ ] CLI `--help` is clear and complete
- [ ] Package usable as Python library (not just CLI)

______________________________________________________________________

## Progress

- [x] (2026-03-21) Created PLAN.md
- [x] (2026-03-22) Phase 1: Core Infrastructure (config, exceptions, http, logging, base protocols)
- [x] (2026-03-22) Phase 2: CloudConvert Provider
- [x] (2026-03-22) Phase 3: Adobe PDF Services Provider
- [x] (2026-03-22) Phase 4: Zamzar Provider
- [x] (2026-03-22) Phase 5: Router & CLI Integration
- [x] (2026-03-22) Phase 6: Caching (hishel HTTP caching, file hash utilities, --force support)

______________________________________________________________________

## Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| HTTP Client | httpx + hishel | Required for caching, modern async support |
| Caching | hishel with always_cache | APIs may not set proper cache headers |
| Config | Pydantic Settings | Type-safe env var loading |
| Provider protocol | Protocol class | Enables easy provider swapping |
| Initial providers | CloudConvert, Adobe, Zamzar | User requested all three from start |
| API key config | CLI > arg > env var | Flexible: ad-hoc CLI, explicit code, or global env |
| Auto-selection | Single key → use it; Multiple → preference order | Zero-config when one provider, smart failover when multiple |
| Failover triggers | QuotaExceeded, RateLimit, ProviderUnavailable | Transient errors only; auth/format errors don't failover |
| Provider preference | cloudconvert → adobe → zamzar | Based on free tier limits and format support |

______________________________________________________________________

## Notes

<!-- Add implementation notes during development -->

### API Key Configuration

API keys can be provided in three ways (priority order: CLI > argument > env var):

**1. CLI Option (--api-key)**

```bash
pdf-remote-converter convert --provider cloudconvert --api-key YOUR_KEY doc.docx out.pdf
```

**2. Function Argument (package usage)**

```python
from pdf_remote_converter import Converter

converter = Converter(
    provider="cloudconvert",
    api_key="YOUR_KEY"  # Explicit argument
)
converter.convert("doc.docx", "out.pdf")
```

**3. Environment Variables**

```bash
# Per-provider keys (used when no explicit key provided)
CLOUDCONVERT_API_KEY=xxx
ADOBE_CLIENT_ID=xxx
ADOBE_CLIENT_SECRET=xxx
ZAMZAR_API_KEY=xxx

# Or generic key for current default provider
PDF_REMOTE_CONVERTER_API_KEY=xxx
```

**Environment Variables (all)**

```
CLOUDCONVERT_API_KEY=xxx
ADOBE_CLIENT_ID=xxx
ADOBE_CLIENT_SECRET=xxx
ZAMZAR_API_KEY=xxx
PDF_REMOTE_CONVERTER_API_KEY=xxx        # Fallback for current provider
PDF_REMOTE_CONVERTER_CACHE_DIR=~/.cache/pdf-remote-converter
PDF_REMOTE_CONVERTER_DEFAULT_PROVIDER=cloudconvert
```

### .env.example Template

Create `.env.example` in project root with all configurable variables:

```bash
# PDF Remote Converter Configuration
# Copy this file to .env and fill in your values

# =============================================================================
# Provider API Keys (at least one required)
# =============================================================================

# CloudConvert (10 conversions/day free tier)
# Get your key at: https://cloudconvert.com/dashboard/api/v2/keys
CLOUDCONVERT_API_KEY=

# Adobe PDF Services (500 transactions/month free tier)
# Get credentials at: https://developer.adobe.com/console
# Note: Adobe uses client_id + client_secret pair (NOT a single API key)
ADOBE_CLIENT_ID=
ADOBE_CLIENT_SECRET=

# Zamzar (100 credits/month free tier, 1MB limit)
# Get your key at: https://developers.zamzar.com/
ZAMZAR_API_KEY=

# Generic fallback API key (used if provider-specific key not set)
PDF_REMOTE_CONVERTER_API_KEY=

# =============================================================================
# Application Settings
# =============================================================================

# Default provider: cloudconvert, adobe, or zamzar
# Default: cloudconvert
PDF_REMOTE_CONVERTER_DEFAULT_PROVIDER=cloudconvert

# Cache directory for HTTP responses and converted files
# Default: ~/.cache/pdf-remote-converter
PDF_REMOTE_CONVERTER_CACHE_DIR=
```

### File Format Detection

Use `filetype` or `python-magic` for format detection if extension unreliable.

### SDK Verification Checklist

Before starting each provider phase, verify SDK availability:

| Provider | Package | Status | Fallback |
|----------|---------|--------|----------|
| CloudConvert | `cloudconvert` | Check PyPI | httpx + REST API |
| Adobe | `pdfservices-sdk>=4.0.0` | ✅ Verified | httpx + REST API |
| Zamzar | `zamzar` | **To verify** | httpx + REST API |

### Logging

Use standard `logging` module with structured output:

```
2026-03-21 10:30:15 [INFO] pdf_remote_converter.router: Using provider 'cloudconvert' (1 of 3 configured)
2026-03-21 10:30:16 [DEBUG] pdf_remote_converter.http: Cache MISS for POST https://api.cloudconvert.com/...
2026-03-21 10:30:20 [INFO] pdf_remote_converter.providers.cloudconvert: Job completed, downloaded 2.3MB PDF
```

### Package Usage (Public API)

The library should be usable as a Python package, not just CLI:

```python
from pdf_remote_converter import Converter
from pdf_remote_converter.providers import CloudConvertProvider

# Simple usage with env var credentials
converter = Converter()
result = converter.convert("document.docx", "output.pdf")

# Explicit provider and credentials
converter = Converter(
    provider="cloudconvert",
    api_key="YOUR_KEY"
)
result = converter.convert("document.docx", "output.pdf")

# Direct provider instantiation
provider = CloudConvertProvider(api_key="YOUR_KEY")
result = provider.convert(Path("document.docx"), Path("output.pdf"))
print(f"Used {result.credits_used} credits, from_cache={result.from_cache}")
```