🔥 chore: remove PLAN.md (e40024b6) · Commits · Jan Reimes / pdf-remote-converter

PLAN.md

deleted100644 → 0

+0 −751

Original line number	Diff line number	Diff line
		# PLAN: Multi-Provider Office→PDF Conversion CLI

		## Goal

		Implement a production-ready CLI tool that converts MS Office documents (Word, Excel, PowerPoint) to PDF using multiple cloud API providers with automatic failover, HTTP caching via hishel, and quota management.

		User-visible outcome:

		- `pdf-remote-converter convert document.docx output.pdf` → auto-selects configured provider (single key) or best available (multiple keys)
		- `pdf-remote-converter convert --provider cloudconvert file.xlsx` → uses specific provider
		- Automatic failover when provider hits quota/rate-limit (tries next configured provider)
		- Automatic caching prevents redundant API calls for identical files
		- Clear error messages when all providers fail or conversion errors

		______________________________________________________________________

		## Context

		### Project Structure (Current)

		```
		pdf-remote-converter/
		├── src/pdf_remote_converter/
		│ ├── __init__.py
		│ ├── __about__.py
		│ └── cli/
		│ ├── app.py # Typer CLI entry point
		│ ├── args.py # CLI argument definitions
		│ └── commands.py # Placeholder commands (greet, add)
		├── tests/
		├── pyproject.toml # Dependencies: typer only
		└── office2pdf-apis.md # API provider research
		```

		### Target Providers (from office2pdf-apis.md)

		\| Provider \| Free Tier \| Formats \| Notes \|
		\|----------\|-----------\|---------\|-------\|
		\| CloudConvert \| 10/day (~300/mo) \| DOC/DOCX/XLS/XLSX/PPT/PPTX \| Async job model, Python SDK \|
		\| Adobe PDF Services \| 500/mo \| DOC/DOCX/XLS/XLSX/PPT/PPTX \| Official Python SDK, best legacy support \|
		\| Zamzar \| 100/mo \| DOC/DOCX/XLS/XLSX/PPT/PPTX \| 1MB limit on free tier, Python SDK \|

		### Technical Constraints

		- Python 3.13+
		- Use `httpx` for HTTP (required by hishel)
		- Use `hishel` for HTTP caching (RFC 9111 compliant)
		- CLI via `typer`
		- 100% test coverage requirement
		- Ruff for linting/formatting

		### Dependencies to Add

		```toml
		dependencies = [
		"typer>=0.12.0",
		"httpx>=0.27.0",
		"hishel>=0.1.0",
		"pydantic>=2.0.0", # Config/settings validation
		"pydantic-settings>=2.0.0", # For BaseSettings
		]
		```

		______________________________________________________________________

		## Phases

		### Phase 1: Core Infrastructure

		Deliverable: Provider protocol, base HTTP client with caching, configuration system

		Files to create:

		```
		src/pdf_remote_converter/
		├── config.py # Pydantic settings, env var loading
		├── exceptions.py # Custom exceptions (QuotaExceeded, ConversionFailed, etc.)
		├── http.py # Hishel-cached httpx client factory
		├── logging.py # Logging configuration
		├── providers/
		│ ├── __init__.py
		│ ├── base.py # Provider protocol and base class
		│ └── models.py # Conversion result, job status models
		tests/
		├── __init__.py
		├── conftest.py # Pytest fixtures, mock providers
		└── test_providers_base.py # Tests for protocol and base class
		.env.example # Template with all environment variables and defaults
		```

		Files to modify:

		- `pyproject.toml` — add dependencies
		- `src/pdf_remote_converter/cli/args.py` — add convert command args
		- `src/pdf_remote_converter/cli/commands.py` — remove placeholder commands

		CLI argument definitions:

		```python
		# cli/args.py
		from typing import Annotated
		import typer

		InputFile = Annotated[Path, typer.Argument(help="Input Office file to convert")]

		OutputFile = Annotated[Path, typer.Argument(help="Output PDF file path")]

		Provider = Annotated[
		str \| None,
		typer.Option("--provider", "-p", help="Provider: cloudconvert, adobe, zamzar"),
		]

		ApiKey = Annotated[
		str \| None,
		typer.Option("--api-key", envvar="PDF_REMOTE_CONVERTER_API_KEY", help="API key for selected provider"),
		]

		Force = Annotated[bool, typer.Option("--force/--no-force", help="Skip cache and force fresh conversion")]
		```

		Key interfaces:

		```python
		# providers/base.py
		from typing import Protocol

		class ConversionResult:
		"""Result of a PDF conversion."""
		output_path: Path
		provider: str
		from_cache: bool
		credits_used: int

		class ProviderBackend(Protocol):
		"""Protocol for conversion providers."""

		@property
		def name(self) -> str: ...

		@property
		def supported_formats(self) -> set[str]: ...

		@property
		def monthly_quota(self) -> int: ...

		@property
		def quota_remaining(self) -> int: ...

		def convert(self, input_path: Path, output_path: Path) -> ConversionResult:
		"""Convert input file to PDF."""
		...

		def is_healthy(self) -> bool:
		"""Check if provider is available and configured."""
		...
		```

		Configuration (config.py):

		```python
		from pydantic_settings import BaseSettings, SettingsConfigDict

		class ProviderSettings(BaseSettings):
		"""API credentials loaded from environment variables."""
		model_config = SettingsConfigDict(env_prefix="", case_sensitive=False)

		# Per-provider credentials
		cloudconvert_api_key: str \| None = None
		adobe_client_id: str \| None = None
		adobe_client_secret: str \| None = None
		zamzar_api_key: str \| None = None

		# Generic fallback (used when provider-specific key not set)
		api_key: str \| None = None

		# App settings
		cache_dir: Path = Path("~/.cache/pdf-remote-converter").expanduser()
		default_provider: str = "cloudconvert"

		def get_api_key(self, provider: str) -> str \| None:
		"""Get API key for provider (provider-specific or fallback)."""
		if provider == "adobe":
		# Adobe uses client_id + client_secret pair, not a single key
		return None # Adobe uses separate credentials
		key = getattr(self, f"{provider}_api_key", None)
		return key or self.api_key

		def get_adobe_credentials(self) -> tuple[str, str] \| None:
		"""Get Adobe client_id and client_secret as tuple."""
		if self.adobe_client_id and self.adobe_client_secret:
		return (self.adobe_client_id, self.adobe_client_secret)
		return None

		# logging.py
		import logging
		import sys

		def setup_logging(verbose: bool = False) -> None:
		"""Configure logging for the application."""
		level = logging.DEBUG if verbose else logging.INFO
		logging.basicConfig(
		level=level,
		format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
		handlers=[logging.StreamHandler(sys.stderr)],
		)
		```

		Validation:

		```bash
		uv run pytest tests/test_providers_base.py -v
		uv run ruff check src/
		```

		______________________________________________________________________

		### Phase 2: CloudConvert Provider

		Deliverable: Fully functional CloudConvert integration with async job handling

		Files to create:

		```
		src/pdf_remote_converter/providers/
		└── cloudconvert.py # CloudConvert implementation

		tests/
		└── test_cloudconvert.py # Mock-based tests
		```

		Implementation notes (from office2pdf-apis.md):

		- API v2 uses job model: `import/upload` → `convert` → `export` tasks
		- Async: create job → poll status → download result
		- Auth via API key in header

		Key flow:

		```python
		class CloudConvertProvider:
		def convert(self, input_path: Path, output_path: Path) -> ConversionResult:
		# 1. Create job with import/upload + convert + export tasks
		# 2. Upload file to import task URL
		# 3. Poll job status until complete
		# 4. Download PDF from export task URL
		# 5. Return result with credits_used=1
		```

		Validation:

		```bash
		uv run pytest tests/test_cloudconvert.py -v
		uv run pdf-remote-converter convert --provider cloudconvert test.docx out.pdf
		```

		______________________________________________________________________

		### Phase 3: Adobe PDF Services Provider

		Deliverable: Adobe integration using official SDK patterns

		Files to create:

		```
		src/pdf_remote_converter/providers/
		└── adobe.py # Adobe PDF Services implementation

		tests/
		└── test_adobe.py # Mock-based tests
		```

		Implementation notes:

		- Uses JWT/OAuth authentication
		- Official Python SDK available: `adobe-pdfservices-python`
		- Flow: authenticate → POST job → poll → download

		Dependencies to add:

		```toml
		"pdfservices-sdk>=4.0.0", # Adobe PDF Services Python SDK
		```

		Note: Adobe SDK uses `ServicePrincipalCredentials` with `client_id` and `client_secret` (NOT `client_id` + `client_secret` as separate env vars, but passed as credentials object). See [Adobe SDK docs](https://github.com/adobe/pdfservices-python-sdk).

		Validation:

		```bash
		uv run pytest tests/test_adobe.py -v
		```

		______________________________________________________________________

		### Phase 4: Zamzar Provider

		Deliverable: Zamzar integration with file size validation

		Files to create:

		```
		src/pdf_remote_converter/providers/
		└── zamzar.py # Zamzar implementation

		tests/
		└── test_zamzar.py # Mock-based tests
		```

		Implementation notes:

		- Simple REST API with dedicated format endpoints
		- Free tier: 1MB max file size
		- Python SDK on PyPI: `zamzar` (verify package exists before Phase 4)

		Note: Verify `zamzar` package on PyPI before implementation. If unavailable or unmaintained, use httpx directly with REST API (similar to Adobe fallback).

		Key considerations:

		- Validate file size before upload (reject >1MB on free tier)
		- Use format-specific endpoints (e.g., `/doc-to-pdf`, `/xlsx-to-pdf`)

		Validation:

		```bash
		uv run pytest tests/test_zamzar.py -v
		```

		______________________________________________________________________

		### Phase 5: Provider Router & CLI Integration

		Deliverable: Provider selection, failover logic, and complete CLI

		Files to create:

		```
		src/pdf_remote_converter/
		├── router.py # Provider selection and failover
		├── utils.py # Format detection, file utilities
		└── converter.py # High-level Converter facade (public API)
		```

		Files to modify:

		```
		src/pdf_remote_converter/cli/
		├── args.py # Add InputFile, OutputFile, Provider, ApiKey, Force args
		└── commands.py # Implement convert command
		```

		Provider selection logic:

		```python
		# Default provider preference order (hardcoded, configurable later)
		PROVIDER_PREFERENCE = ["cloudconvert", "adobe", "zamzar"]

		def get_configured_providers(settings: ProviderSettings) -> list[ProviderBackend]:
		"""Get list of providers that have API keys configured."""
		configured = []
		for name in PROVIDER_PREFERENCE:
		if settings.get_api_key(name) or (name == "adobe" and settings.get_adobe_credentials()):
		configured.append(create_provider(name, settings))
		return configured

		def select_provider(
		input_path: Path,
		preferred: str \| None = None,
		settings: ProviderSettings,
		) -> ProviderBackend:
		"""Select best available provider with automatic failover.

		Selection order:
		1. If --provider explicitly specified, use that (error if not configured)
		2. If only one provider has API key, use it automatically
		3. If multiple providers configured, use first by preference order
		"""
		configured = get_configured_providers(settings)

		if not configured:
		raise NoProviderConfiguredError("No API keys configured. Set at least one provider API key.")

		if preferred:
		# Explicit selection - must be configured
		provider = next((p for p in configured if p.name == preferred), None)
		if not provider:
		raise ProviderNotConfiguredError(f"Provider '{preferred}' not configured. Set {preferred.upper()}_API_KEY.")
		return provider

		# Single provider: use it directly
		if len(configured) == 1:
		return configured[0]

		# Multiple providers: use first by preference (failover handled by router)
		return configured[0]

		def convert_with_failover(
		input_path: Path,
		output_path: Path,
		providers: list[ProviderBackend],
		) -> ConversionResult:
		"""Convert with automatic failover to next provider on error.

		Failover triggers:
		- QuotaExceededError
		- RateLimitError
		- ProviderUnavailableError

		Does NOT failover for:
		- Invalid file format
		- File too large
		- Authentication errors (wrong API key)
		"""
		errors = []
		for provider in providers:
		try:
		return provider.convert(input_path, output_path)
		except (QuotaExceededError, RateLimitError, ProviderUnavailableError) as e:
		errors.append((provider.name, e))
		continue # Try next provider

		raise AllProvidersFailedError(f"All providers failed: {errors}")
		```

		High-level Converter API (converter.py):

		```python
		# Public facade for package usage
		class Converter:
		"""High-level converter API for programmatic usage."""

		def __init__(
		self,
		provider: str \| None = None,
		api_key: str \| None = None,
		settings: ProviderSettings \| None = None,
		):
		"""Initialize converter with optional provider override."""
		self.settings = settings or ProviderSettings()
		self.provider_override = provider
		self.api_key_override = api_key

		def convert(self, input_path: str \| Path, output_path: str \| Path) -> ConversionResult:
		"""Convert Office document to PDF with auto-selection and failover."""
		input_path = Path(input_path)
		output_path = Path(output_path)

		# Get configured providers
		providers = get_configured_providers(self.settings)

		# Select provider (or use override)
		if self.provider_override:
		selected = select_provider(input_path, self.provider_override, self.settings)
		providers = [selected]

		# Convert with failover
		return convert_with_failover(input_path, output_path, providers)
		```

		CLI command:

		```python
		@app.command()
		def convert(
		input_file: args.InputFile,
		output_file: args.OutputFile,
		provider: args.Provider = None,
		api_key: args.ApiKey = None, # Override env var for selected provider
		force: args.Force = False, # Skip cache
		) -> None:
		"""Convert Office document to PDF."""
		```

		Validation:

		```bash
		uv run pdf-remote-converter convert document.docx output.pdf
		uv run pdf-remote-converter convert --provider adobe large.xlsx out.pdf
		uv run pdf-remote-converter convert --api-key YOUR_KEY --provider cloudconvert doc.docx out.pdf
		uv run pdf-remote-converter convert --no-cache file.pptx output.pdf
		uv run pytest tests/ -v --cov
		```

		______________________________________________________________________

		### Phase 6: Caching & Quota Management

		Deliverable: Hishel HTTP caching with file-content-aware cache keys

		Files to modify:

		```
		src/pdf_remote_converter/
		└── providers/base.py # Add cache key generation (file hash)
		```

		Caching strategy:

		- Use hishel `SyncCacheClient` with `always_cache=True`
		- Hishel caches POST requests using body-sensitive keys (file content hashed automatically)
		- Cache key: provider + format + file content hash (hishel handles this via Vary header + body hashing)
		- Default TTL: 7 days (configurable via `default_ttl`)
		- Storage: SQLite (default) or file system

		Note on POST caching: CloudConvert/Adobe/Zamzar all use POST for conversion. Hishel supports body-sensitive caching for POST by using the request body in the cache key. Configure with:

		```python
		controller = Controller(
		cacheable_methods=["GET", "POST"],
		cacheable_status_codes=[200],
		)
		```

		Implementation:

		```python
		# http.py
		from hishel import CacheOptions, SpecificationPolicy, Controller
		from hishel.httpx import SyncCacheClient
		from hishel.storages import SyncSqliteStorage

		def create_cached_client(cache_dir: Path \| None = None) -> SyncCacheClient:
		"""Create HTTP client with RFC 9111 caching + POST body-sensitive caching."""
		controller = Controller(
		cacheable_methods=["GET", "POST"], # Enable POST caching for API calls
		cacheable_status_codes=[200, 201],
		)
		return SyncCacheClient(
		storage=SyncSqliteStorage(cache_dir / "http_cache.sqlite"),
		controller=controller,
		policy=SpecificationPolicy(
		cache_options=CacheOptions(always_cache=True)
		)
		)
		```

		Validation:

		```bash
		# First call hits API
		uv run pdf-remote-converter convert test.docx out.pdf
		# Second call should use cache (verify via logging or --verbose)
		uv run pdf-remote-converter convert test.docx out.pdf
		```

		______________________________________________________________________

		## Validation

		### Commands (run from project root)

		```bash
		# Lint and format
		uv run ruff check src/ tests/
		uv run ruff format --check src/ tests/

		# Tests with coverage
		uv run pytest tests/ -v --cov --cov-report=term-missing

		# CLI smoke tests
		uv run pdf-remote-converter --help
		uv run pdf-remote-converter convert --help
		uv run pdf-remote-converter convert test.docx output.pdf
		uv run pdf-remote-converter convert --provider cloudconvert test.xlsx out.pdf

		# Spell check
		uv run codespell src/ tests/
		```

		### Acceptance Criteria

		- [ ] All three providers (CloudConvert, Adobe, Zamzar) implement `ProviderBackend` protocol
		- [ ] Single API key configured → auto-select that provider (no --provider needed)
		- [ ] Multiple API keys → use preference order with automatic failover on quota/rate-limit errors
		- [ ] Hishel caching reduces API calls for identical files
		- [ ] API keys configurable via: CLI --api-key, function argument, or env var
		- [ ] Clear error messages for: quota exceeded, unsupported format, file too large, missing API key
		- [ ] 100% test coverage
		- [ ] All linting passes (ruff, codespell)
		- [ ] CLI `--help` is clear and complete
		- [ ] Package usable as Python library (not just CLI)

		______________________________________________________________________

		## Progress

		- [x] (2026-03-21) Created PLAN.md
		- [x] (2026-03-22) Phase 1: Core Infrastructure (config, exceptions, http, logging, base protocols)
		- [x] (2026-03-22) Phase 2: CloudConvert Provider
		- [x] (2026-03-22) Phase 3: Adobe PDF Services Provider
		- [x] (2026-03-22) Phase 4: Zamzar Provider
		- [x] (2026-03-22) Phase 5: Router & CLI Integration
		- [x] (2026-03-22) Phase 6: Caching (hishel HTTP caching, file hash utilities, --force support)

		______________________________________________________________________

		## Decisions

		\| Decision \| Choice \| Rationale \|
		\|----------\|--------\|-----------\|
		\| HTTP Client \| httpx + hishel \| Required for caching, modern async support \|
		\| Caching \| hishel with always_cache \| APIs may not set proper cache headers \|
		\| Config \| Pydantic Settings \| Type-safe env var loading \|
		\| Provider protocol \| Protocol class \| Enables easy provider swapping \|
		\| Initial providers \| CloudConvert, Adobe, Zamzar \| User requested all three from start \|
		\| API key config \| CLI > arg > env var \| Flexible: ad-hoc CLI, explicit code, or global env \|
		\| Auto-selection \| Single key → use it; Multiple → preference order \| Zero-config when one provider, smart failover when multiple \|
		\| Failover triggers \| QuotaExceeded, RateLimit, ProviderUnavailable \| Transient errors only; auth/format errors don't failover \|
		\| Provider preference \| cloudconvert → adobe → zamzar \| Based on free tier limits and format support \|

		______________________________________________________________________

		## Notes

		<!-- Add implementation notes during development -->

		### API Key Configuration

		API keys can be provided in three ways (priority order: CLI > argument > env var):

		1. CLI Option (--api-key)

		```bash
		pdf-remote-converter convert --provider cloudconvert --api-key YOUR_KEY doc.docx out.pdf
		```

		2. Function Argument (package usage)

		```python
		from pdf_remote_converter import Converter

		converter = Converter(
		provider="cloudconvert",
		api_key="YOUR_KEY" # Explicit argument
		)
		converter.convert("doc.docx", "out.pdf")
		```

		3. Environment Variables

		```bash
		# Per-provider keys (used when no explicit key provided)
		CLOUDCONVERT_API_KEY=xxx
		ADOBE_CLIENT_ID=xxx
		ADOBE_CLIENT_SECRET=xxx
		ZAMZAR_API_KEY=xxx

		# Or generic key for current default provider
		PDF_REMOTE_CONVERTER_API_KEY=xxx
		```

		Environment Variables (all)

		```
		CLOUDCONVERT_API_KEY=xxx
		ADOBE_CLIENT_ID=xxx
		ADOBE_CLIENT_SECRET=xxx
		ZAMZAR_API_KEY=xxx
		PDF_REMOTE_CONVERTER_API_KEY=xxx # Fallback for current provider
		PDF_REMOTE_CONVERTER_CACHE_DIR=~/.cache/pdf-remote-converter
		PDF_REMOTE_CONVERTER_DEFAULT_PROVIDER=cloudconvert
		```

		### .env.example Template

		Create `.env.example` in project root with all configurable variables:

		```bash
		# PDF Remote Converter Configuration
		# Copy this file to .env and fill in your values

		# =============================================================================
		# Provider API Keys (at least one required)
		# =============================================================================

		# CloudConvert (10 conversions/day free tier)
		# Get your key at: https://cloudconvert.com/dashboard/api/v2/keys
		CLOUDCONVERT_API_KEY=

		# Adobe PDF Services (500 transactions/month free tier)
		# Get credentials at: https://developer.adobe.com/console
		# Note: Adobe uses client_id + client_secret pair (NOT a single API key)
		ADOBE_CLIENT_ID=
		ADOBE_CLIENT_SECRET=

		# Zamzar (100 credits/month free tier, 1MB limit)
		# Get your key at: https://developers.zamzar.com/
		ZAMZAR_API_KEY=

		# Generic fallback API key (used if provider-specific key not set)
		PDF_REMOTE_CONVERTER_API_KEY=

		# =============================================================================
		# Application Settings
		# =============================================================================

		# Default provider: cloudconvert, adobe, or zamzar
		# Default: cloudconvert
		PDF_REMOTE_CONVERTER_DEFAULT_PROVIDER=cloudconvert

		# Cache directory for HTTP responses and converted files
		# Default: ~/.cache/pdf-remote-converter
		PDF_REMOTE_CONVERTER_CACHE_DIR=
		```

		### File Format Detection

		Use `filetype` or `python-magic` for format detection if extension unreliable.

		### SDK Verification Checklist

		Before starting each provider phase, verify SDK availability:

		\| Provider \| Package \| Status \| Fallback \|
		\|----------\|---------\|--------\|----------\|
		\| CloudConvert \| `cloudconvert` \| Check PyPI \| httpx + REST API \|
		\| Adobe \| `pdfservices-sdk>=4.0.0` \| ✅ Verified \| httpx + REST API \|
		\| Zamzar \| `zamzar` \| To verify \| httpx + REST API \|

		### Logging

		Use standard `logging` module with structured output:

		```
		2026-03-21 10:30:15 [INFO] pdf_remote_converter.router: Using provider 'cloudconvert' (1 of 3 configured)
		2026-03-21 10:30:16 [DEBUG] pdf_remote_converter.http: Cache MISS for POST https://api.cloudconvert.com/...
		2026-03-21 10:30:20 [INFO] pdf_remote_converter.providers.cloudconvert: Job completed, downloaded 2.3MB PDF
		```

		### Package Usage (Public API)

		The library should be usable as a Python package, not just CLI:

		```python
		from pdf_remote_converter import Converter
		from pdf_remote_converter.providers import CloudConvertProvider

		# Simple usage with env var credentials
		converter = Converter()
		result = converter.convert("document.docx", "output.pdf")

		# Explicit provider and credentials
		converter = Converter(
		provider="cloudconvert",
		api_key="YOUR_KEY"
		)
		result = converter.convert("document.docx", "output.pdf")

		# Direct provider instantiation
		provider = CloudConvertProvider(api_key="YOUR_KEY")
		result = provider.convert(Path("document.docx"), Path("output.pdf"))
		print(f"Used {result.credits_used} credits, from_cache={result.from_cache}")
		```