fix(tdocs): tolerate unknown TDoc status values (e.g. '-') instead of raising ValueError (c08ede8d) · Commits · Jan Reimes / 3gpp-crawler

.planning/phases/11-extraction-profiles-for-wiki/11-01-PLAN.md

0 → 100644

+264 −0

Original line number	Diff line number	Diff line
		---
		phase: 11-extraction-profiles-for-wiki
		plan: 01
		type: execute
		wave: 1
		depends_on: []
		files_modified:
		- src/tdoc_crawler/extraction/profiles.py
		- src/tdoc_crawler/extraction/__init__.py
		- src/tdoc_crawler/extraction/convert.py
		- src/tdoc_crawler/extraction/conversion.py
		- src/tdoc_crawler/config/settings.py
		autonomous: true
		requirements:
		- WIKI-01
		- WIKI-02

		must_haves:
		truths:
		- "Three extraction profiles exist: pdf-only, default, advanced"
		- "pdf-only profile converts office docs to PDF and passes through native PDFs"
		- "Profile can be selected via config"
		artifacts:
		- path: "src/tdoc_crawler/extraction/profiles.py"
		provides: "Extraction profile definitions"
		contains: "class ExtractionProfile\|pdf.only\|default\|advanced"
		- path: "src/tdoc_crawler/extraction/convert.py"
		provides: "Profile-aware conversion logic"
		contains: "ExtractionProfile\|profile"
		key_links:
		- from: "profiles.py"
		to: "convert.py"
		via: "profile enum consumed by conversion functions"
		pattern: "ExtractionProfile"
		---

		<objective>
		Define the three extraction profiles and implement the pdf-only profile.

		Purpose: Establish the profile system and deliver the minimum viable wiki ingestion path (raw PDFs).
		Output: ExtractionProfile enum, profile-aware conversion, pdf-only implementation.
		</objective>

		<execution_context>
		@$HOME/.claude/get-shit-done/workflows/execute-plan.md
		@$HOME/.claude/get-shit-done/templates/summary.md
		</execution_context>

		<context>
		@.planning/phases/11-extraction-profiles-for-wiki/11-CONTEXT.md
		@src/tdoc_crawler/extraction/convert.py
		@src/tdoc_crawler/extraction/conversion.py
		@src/tdoc_crawler/extraction/__init__.py
		@src/tdoc_crawler/config/settings.py
		</context>

		<tasks>

		<task type="auto">
		<name>Task 1: Create ExtractionProfile enum and profile definitions</name>
		<files>
		src/tdoc_crawler/extraction/profiles.py
		src/tdoc_crawler/extraction/__init__.py
		</files>
		<read_first>
		src/tdoc_crawler/extraction/__init__.py — current exports
		src/tdoc_crawler/extraction/convert.py — OpendataloaderConfig dataclass
		</read_first>
		<action>
		Create `src/tdoc_crawler/extraction/profiles.py` with:

		```python
		"""Extraction profiles for wiki ingestion.

		Defines three tiers of document extraction:
		- pdf-only: Raw PDF (office docs converted via LibreOffice)
		- default: Structured markdown + JSON via opendataloader-pdf hybrid mode
		- advanced: Same as default + AI-assisted picture descriptions
		"""

		from __future__ import annotations

		from enum import StrEnum, auto


		class ExtractionProfile(StrEnum):
		"""Extraction profile levels for wiki ingestion."""

		PDF_ONLY = auto() # pdf-only: raw PDF, no structured extraction
		DEFAULT = auto() # default: opendataloader hybrid mode
		ADVANCED = auto() # advanced: hybrid + picture descriptions


		# Default profile used when none is specified
		DEFAULT_EXTRACTION_PROFILE = ExtractionProfile.DEFAULT
		```

		Update `src/tdoc_crawler/extraction/__init__.py` to export `ExtractionProfile` and `DEFAULT_EXTRACTION_PROFILE`.
		</action>
		<acceptance_criteria>
		- profiles.py exists with ExtractionProfile(StrEnum) with PDF_ONLY, DEFAULT, ADVANCED
		- DEFAULT_EXTRACTION_PROFILE constant is set to DEFAULT
		- __init__.py exports both symbols
		</acceptance_criteria>
		<verify>
		<automated>grep -c "class ExtractionProfile\\|PDF_ONLY\\|DEFAULT\\|ADVANCED\\|DEFAULT_EXTRACTION_PROFILE" src/tdoc_crawler/extraction/profiles.py</automated>
		</verify>
		<done>ExtractionProfile enum defined and exported</done>
		</task>

		<task type="auto">
		<name>Task 2: Implement pdf-only profile in conversion pipeline</name>
		<files>
		src/tdoc_crawler/extraction/conversion.py
		src/tdoc_crawler/extraction/convert.py
		</files>
		<read_first>
		src/tdoc_crawler/extraction/conversion.py — convert_to_pdf(), is_office_format()
		src/tdoc_crawler/extraction/convert.py — _ensure_converted(), _run_opendataloader()
		</read_first>
		<action>
		Implement the pdf-only profile:

		1. In `src/tdoc_crawler/extraction/conversion.py`, add a function:
		```python
		def ensure_pdf(
		source_file: Path,
		output_dir: Path,
		*,
		force: bool = False,
		config: ConverterConfig \| None = None,
		) -> Path:
		"""Ensure a PDF version of the source file exists in output_dir.

		For office documents, converts via LibreOffice. For native PDFs, copies.
		Uses cached conversion when available.

		Args:
		source_file: Path to source document.
		output_dir: Directory to place the PDF.
		force: Force reconversion.
		config: Optional converter configuration.

		Returns:
		Path to the PDF file.
		"""
		pdf_path = output_dir / f"{source_file.stem}.pdf"

		if pdf_path.exists() and not force:
		return pdf_path

		if is_office_format(source_file):
		# Use existing convert_to_pdf logic
		return convert_to_pdf(source_file, output_dir, force=force, config=config)

		# Native PDF — copy to output
		import shutil
		output_dir.mkdir(parents=True, exist_ok=True)
		shutil.copy2(source_file, pdf_path)
		return pdf_path
		```

		2. In `src/tdoc_crawler/extraction/convert.py`, modify `_ensure_converted()` to accept an `ExtractionProfile` parameter:
		- Add import: `from tdoc_crawler.extraction.profiles import ExtractionProfile`
		- Add `profile: ExtractionProfile = ExtractionProfile.DEFAULT` parameter
		- When `profile == ExtractionProfile.PDF_ONLY`: skip opendataloader entirely, just ensure PDF exists via `ensure_pdf()`, return empty markdown and no JSON
		- When `profile == ExtractionProfile.DEFAULT` or `ADVANCED`: existing behavior (opendataloader hybrid mode)

		Also add a new public function:
		```python
		def convert_for_wiki(
		document_id: str,
		wiki_source_dir: Path,
		*,
		profile: ExtractionProfile = ExtractionProfile.DEFAULT,
		force: bool = False,
		) -> Path \| None:
		"""Convert a document for wiki ingestion using the specified profile.

		Args:
		document_id: Document identifier (TDoc ID, spec number).
		wiki_source_dir: Target directory under wiki/<workspace>/sources/.
		profile: Extraction profile to use.
		force: Force reconversion.

		Returns:
		Path to the primary output file (PDF for pdf-only, MD for default/advanced),
		or None if conversion fails.
		"""
		from tdoc_crawler.extraction.conversion import ensure_pdf, is_office_format
		from tdoc_crawler.extraction.fetch_tdoc import fetch_tdoc_files

		normalized_id = normalize_tdoc_id(document_id)
		tdoc_files = fetch_tdoc_files(normalized_id, force_download=force)
		primary = tdoc_files.primary_path
		if primary is None:
		raise ConversionError(f"No document files found for {normalized_id}")

		wiki_source_dir.mkdir(parents=True, exist_ok=True)

		if profile == ExtractionProfile.PDF_ONLY:
		pdf_path = ensure_pdf(primary, wiki_source_dir, force=force)
		return pdf_path

		# default or advanced — use opendataloader
		config = OpendataloaderConfig(
		hybrid="docling-fast",
		hybrid_mode="full" if profile == ExtractionProfile.ADVANCED else None,
		)
		markdown_content, json_path = _ensure_converted(
		document_id, force=force, config=config
		)
		# Write markdown to wiki source dir
		md_file = wiki_source_dir / f"{primary.stem}.md"
		md_file.write_text(markdown_content, encoding="utf-8")
		return md_file
		```
		</action>
		<acceptance_criteria>
		- ensure_pdf() exists in conversion.py and handles office→PDF and native PDF copy
		- _ensure_converted() accepts profile parameter
		- convert_for_wiki() exists and dispatches to correct pipeline per profile
		- pdf-only profile skips opendataloader entirely
		</acceptance_criteria>
		<verify>
		<automated>grep -c "def ensure_pdf\\|def convert_for_wiki\\|ExtractionProfile" src/tdoc_crawler/extraction/conversion.py src/tdoc_crawler/extraction/convert.py</automated>
		</verify>
		<done>pdf-only profile implemented, convert_for_wiki() dispatches correctly</done>
		</task>

		</tasks>

		<threat_model>

		## Trust Boundaries

		\| Boundary \| Description \|
		\|----------\|-------------\|
		\| CLI→filesystem \| Extraction writes files to wiki directory \|

		## STRIDE Threat Register

		\| Threat ID \| Category \| Component \| Disposition \| Mitigation Plan \|
		\|-----------\|----------\|-----------\|-------------\|-----------------\|
		\| T-11-01 \| I \| File copy in ensure_pdf \| accept \| Copies source files to wiki directory; no security impact \|
		</threat_model>

		<verification>
		- [ ] ExtractionProfile enum defined with three values
		- [ ] pdf-only profile produces PDF output without opendataloader
		- [ ] convert_for_wiki() dispatches correctly per profile
		- [ ] All existing tests pass
		</verification>

		<success_criteria>

		- ExtractionProfile enum exists with PDF_ONLY, DEFAULT, ADVANCED
		- pdf-only profile works: office→PDF conversion, native PDF passthrough
		- convert_for_wiki() is the public API for profile-aware conversion
		</success_criteria>

		<output>
		After completion, create `.planning/phases/11-extraction-profiles-for-wiki/11-01-SUMMARY.md`
		</output>

.planning/phases/11-extraction-profiles-for-wiki/11-01-SUMMARY.md

0 → 100644

+38 −0

Original line number	Diff line number	Diff line
		# Plan 11-01 SUMMARY — Define ExtractionProfile Enum and Implement pdf-only

		Phase: 11-extraction-profiles-for-wiki
		Plan: 01
		Executed: 2026-04-28

		## Changes Made

		### Task 1: Create ExtractionProfile enum and profile definitions
		- `src/tdoc_crawler/extraction/profiles.py`: Created new module with:
		- `ExtractionProfile(StrEnum)` with values: `PDF_ONLY`, `DEFAULT`, `ADVANCED`
		- `DEFAULT_EXTRACTION_PROFILE = ExtractionProfile.DEFAULT`
		- `src/tdoc_crawler/extraction/__init__.py`: Added exports for `ExtractionProfile` and `DEFAULT_EXTRACTION_PROFILE`

		### Task 2: Implement pdf-only profile in conversion pipeline
		- `src/tdoc_crawler/extraction/conversion.py`: Added `ensure_pdf()` function:
		- Office docs: converts via LibreOffice using `convert_to_pdf()`
		- Native PDFs: copies to output directory
		- Uses cached PDFs when available
		- `src/tdoc_crawler/extraction/convert.py`: Added:
		- `convert_for_wiki()` — main profile-aware conversion function
		- `_add_source_pdf_to_json()` — injects source PDF reference into JSON output
		- Profile dispatch: PDF_ONLY skips opendataloader, DEFAULT/ADVANCED use opendataloader hybrid

		## Files Modified
		- `src/tdoc_crawler/extraction/profiles.py` — NEW (ExtractionProfile enum)
		- `src/tdoc_crawler/extraction/__init__.py` — Added exports, removed broken checkout import
		- `src/tdoc_crawler/extraction/conversion.py` — Added `ensure_pdf()`
		- `src/tdoc_crawler/extraction/convert.py` — Added `convert_for_wiki()`, `_add_source_pdf_to_json()`

		## Verification
		- All 30 CLI tests pass
		- Syntax verified on all modified files
		- `workspace process --help` shows `--profile` option correctly

		## Requirements Addressed
		- WIKI-01: pdf-only profile delivers raw PDF to wiki directory
		- WIKI-02: Structured output (markdown + JSON) for default/advanced profiles
		No newline at end of file

.planning/phases/11-extraction-profiles-for-wiki/11-02-PLAN.md

0 → 100644

+187 −0

Original line number	Diff line number	Diff line
		---
		phase: 11-extraction-profiles-for-wiki
		plan: 02
		type: execute
		wave: 2
		depends_on:
		- 11-01
		files_modified:
		- src/tdoc_crawler/extraction/convert.py
		- src/tdoc_crawler/extraction/conversion.py
		- src/tdoc_crawler/cli/_workspace_commands.py
		- src/tdoc_crawler/config/settings.py
		autonomous: true
		requirements:
		- WIKI-01
		- WIKI-02

		must_haves:
		truths:
		- "default profile produces markdown + structured JSON with PDF reference"
		- "advanced profile adds picture descriptions to JSON output"
		- "Profile can be selected via --profile flag on workspace process"
		- "JSON output includes source_pdf field pointing to original PDF"
		artifacts:
		- path: "src/tdoc_crawler/extraction/convert.py"
		provides: "Default and advanced profile implementations"
		contains: "hybrid.docling-fast\|hybrid_mode.full\|source_pdf"
		- path: "src/tdoc_crawler/cli/_workspace_commands.py"
		provides: "Profile selection on workspace process"
		contains: "profile\|ExtractionProfile"
		key_links:
		- from: "workspace process --profile"
		to: "convert_for_wiki()"
		via: "profile flag passed through to extraction pipeline"
		pattern: "profile"
		---

		<objective>
		Implement default and advanced extraction profiles with structured JSON output and CLI profile selection.

		Purpose: Deliver the medium and full extraction tiers for wiki ingestion, with profile selection via CLI.
		Output: Default profile (markdown + JSON + PDF ref), advanced profile (adds picture descriptions), CLI --profile flag.
		</objective>

		<execution_context>
		@$HOME/.claude/get-shit-done/workflows/execute-plan.md
		@$HOME/.claude/get-shit-done/templates/summary.md
		</execution_context>

		<context>
		@.planning/phases/11-extraction-profiles-for-wiki/11-CONTEXT.md
		@src/tdoc_crawler/extraction/convert.py
		@src/tdoc_crawler/extraction/conversion.py
		@src/tdoc_crawler/cli/_workspace_commands.py
		@src/tdoc_crawler/config/settings.py
		</context>

		<tasks>

		<task type="auto">
		<name>Task 1: Implement default and advanced profiles in convert.py</name>
		<files>
		src/tdoc_crawler/extraction/convert.py
		</files>
		<read_first>
		src/tdoc_crawler/extraction/convert.py — _ensure_converted(), _run_opendataloader(), OpendataloaderConfig
		</read_first>
		<action>
		Implement the default and advanced profiles in `convert.py`:

		1. Modify `_ensure_converted()` (or the new `convert_for_wiki()`) to:
		- For `default` profile: use `OpendataloaderConfig(hybrid="docling-fast")` — this triggers opendataloader's hybrid mode with the docling-fast backend
		- For `advanced` profile: use `OpendataloaderConfig(hybrid="docling-fast", hybrid_mode="full")` — this enables picture descriptions on the client side (requires backend started with `--enrich-picture-description`)

		2. Add `source_pdf` field to the JSON output. After opendataloader produces its JSON, inject:
		```python
		def _add_source_pdf_to_json(json_path: Path, pdf_path: Path) -> None:
		"""Add source_pdf reference to opendataloader JSON output."""
		try:
		data = json.loads(json_path.read_text(encoding="utf-8"))
		if isinstance(data, dict):
		data["source_pdf"] = str(pdf_path)
		elif isinstance(data, list) and len(data) > 0:
		data[0]["source_pdf"] = str(pdf_path)
		json_path.write_text(json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8")
		except Exception as e:
		logger.warning("Failed to add source_pdf to JSON: %s", e)
		```
		Call this after opendataloader produces the JSON file.

		3. Ensure the PDF file is also placed in the wiki source directory alongside the markdown and JSON outputs. The `convert_for_wiki()` function should copy the PDF to `wiki_source_dir / f"{doc_id}.pdf"` for all profiles.
		</action>
		<acceptance_criteria>
		- default profile uses OpendataloaderConfig(hybrid="docling-fast")
		- advanced profile uses OpendataloaderConfig(hybrid="docling-fast", hybrid_mode="full")
		- JSON output contains source_pdf field
		- PDF is copied to wiki source directory for all profiles
		</acceptance_criteria>
		<verify>
		<automated>grep -c "source_pdf\\|hybrid.docling-fast\\|hybrid_mode.full" src/tdoc_crawler/extraction/convert.py</automated>
		</verify>
		<done>Default and advanced profiles implemented with PDF reference in JSON</done>
		</task>

		<task type="auto">
		<name>Task 2: Add --profile flag to workspace process command</name>
		<files>
		src/tdoc_crawler/cli/_workspace_commands.py
		</files>
		<read_first>
		src/tdoc_crawler/cli/_workspace_commands.py — workspace_process function
		</read_first>
		<action>
		Add `--profile` option to the `workspace process` command:

		1. Add import:
		```python
		from tdoc_crawler.extraction.profiles import DEFAULT_EXTRACTION_PROFILE, ExtractionProfile
		```

		2. Add `profile` parameter to `workspace_process`:
		```python
		profile: str = typer.Option(
		DEFAULT_EXTRACTION_PROFILE.value,
		"--profile",
		help="Extraction profile: pdf-only, default, or advanced",
		),
		```

		3. In the function body, parse and use the profile:
		```python
		try:
		extraction_profile = ExtractionProfile(profile)
		except ValueError:
		console.print(f"[red]Invalid profile '{profile}'. Use: pdf-only, default, advanced[/red]")
		raise typer.Exit(1)
		```

		4. Update the processing logic to call `convert_for_wiki()` with the selected profile for each member.
		</action>
		<acceptance_criteria>
		- workspace process has --profile option with pdf-only, default, advanced values
		- Invalid profile values show error message
		- Default value matches DEFAULT_EXTRACTION_PROFILE
		</acceptance_criteria>
		<verify>
		<automated>grep -c "profile\\|ExtractionProfile" src/tdoc_crawler/cli/_workspace_commands.py</automated>
		</verify>
		<done>CLI profile selection works on workspace process command</done>
		</task>

		</tasks>

		<threat_model>

		## Trust Boundaries

		\| Boundary \| Description \|
		\|----------\|-------------\|
		\| CLI→filesystem \| Extraction writes markdown, JSON, images to wiki directory \|

		## STRIDE Threat Register

		\| Threat ID \| Category \| Component \| Disposition \| Mitigation Plan \|
		\|-----------\|----------\|-----------\|-------------\|-----------------\|
		\| T-11-02 \| I \| JSON output with source_pdf \| accept \| Path reference to local file; no security impact \|
		</threat_model>

		<verification>
		- [ ] default profile produces markdown + JSON + PDF
		- [ ] advanced profile adds picture descriptions
		- [ ] JSON includes source_pdf field
		- [ ] CLI --profile flag works on workspace process
		- [ ] All existing tests pass
		</verification>

		<success_criteria>

		- Default and advanced profiles fully implemented
		- CLI profile selection works
		- JSON output includes PDF reference
		- All tests pass
		</success_criteria>

		<output>
		After completion, create `.planning/phases/11-extraction-profiles-for-wiki/11-02-SUMMARY.md`
		</output>

.planning/phases/11-extraction-profiles-for-wiki/11-02-SUMMARY.md

0 → 100644

+23 −0

Original line number	Diff line number	Diff line
		# Plan 11-02 SUMMARY — Wire --profile CLI Flag

		Phase: 11-extraction-profiles-for-wiki
		Plan: 02
		Executed: 2026-04-28

		## Changes Made

		### Task: Add --profile CLI flag to workspace process command
		- `src/tdoc_crawler/cli/_workspace_commands.py`: Added:
		- `profile: str = typer.Option(DEFAULT_EXTRACTION_PROFILE.value, "--profile", ...)` parameter
		- Profile validation logic: converts string to `ExtractionProfile` enum, exits with error for invalid values
		- Passes profile to `convert_for_wiki()` function

		## Files Modified
		- `src/tdoc_crawler/cli/_workspace_commands.py` — Added `--profile` flag with validation

		## Verification
		- `uv run 3gpp-crawler workspace process --help` shows `--profile` option with choices: pdf-only, default, advanced
		- All 30 CLI tests pass

		## Requirements Addressed
		- WIKI-03: Profile selection via CLI flag on workspace process command
		No newline at end of file

.planning/phases/11-extraction-profiles-for-wiki/11-CONTEXT.md

0 → 100644

+123 −0

Original line number	Diff line number	Diff line
		# Phase 11: Extraction Profiles for Wiki Ingestion — Context

		Gathered: 2026-04-28
		Status: Ready for planning
		Source: User design decisions

		<domain>
		## Phase Boundary

		Phase 11 defines and implements three extraction profiles for preparing 3GPP documents (TDocs, specs, other) for LLM-wiki ingestion. The profiles control how source documents are converted and what structured data is produced for downstream wiki compilers.

		This phase does NOT implement wiki compilation itself (WIKI-01–04) — it only defines the extraction pipeline that feeds into it.

		</domain>

		<decisions>
		## Implementation Decisions

		### D-01: Three-tier extraction profiles

		\| Level \| Name \| Description \|
		\|-------\|------\|-------------\|
		\| `pdf-only` \| Minimum \| Convert office formats to PDF only. Let the LLM-wiki framework handle all extraction from the PDF. Uses `convert-lo` for office→PDF, or passes through native PDFs. \|
		\| `default` \| Medium \| Use `opendataloader-pdf` in hybrid mode (`docling-fast`) to extract clean markdown text, figures, equations, and tables as structured JSON output. The JSON must include a link/reference to the original PDF file. \|
		\| `advanced` \| Full \| Same as `default`, plus AI-assisted chart/image description via `opendataloader-pdf-hybrid` with `--enrich-picture-description` and `--hybrid-mode full`. \|

		### D-02: Profile selection mechanism

		- Profile is selected per-workspace via config or CLI flag
		- Default profile is `default` (medium)
		- The existing `TDC_AI_EXTRACTION_PROFILE` env var can be extended to support these three values
		- CLI flag `--profile` on `workspace process` command

		### D-03: Output structure

		All profiles produce artifacts in `~/.3gpp-crawler/wiki/<workspace>/sources/<doc-id>/`:

		- pdf-only: `<doc-id>.pdf` (the converted/passthrough PDF)
		- default: `<doc-id>.pdf` + `<doc-id>.md` (markdown) + `<doc-id>.json` (structured elements) + `<doc-id>_images/` (extracted images)
		- advanced: Same as default + `<doc-id>.json` includes `"description"` fields on picture elements

		### D-04: PDF reference in JSON output

		The JSON output from `default` and `advanced` profiles must include a reference to the original PDF file:

		```json
		{
		"file name": "S4-260109.pdf",
		"source_pdf": "~/.3gpp-crawler/wiki/atias/sources/S4-260109/S4-260109.pdf",
		...
		}
		```

		### D-05: Office-to-PDF conversion

		- Uses existing `convert-lo` (LibreOffice headless) for office format → PDF conversion
		- Native PDFs pass through unchanged
		- Cached PDFs stored in `.ai/` subdirectory next to source file (existing pattern)

		### D-06: Hybrid backend management

		- `default` profile starts/connects to `opendataloader-pdf-hybrid` backend automatically
		- `advanced` profile starts backend with `--enrich-picture-description` and uses `--hybrid-mode full`
		- Backend URL and timeout configurable via `OpendataloaderConfig` (already exists)

		### Claude's Discretion

		- Exact CLI flag name (`--profile` vs `--extraction-profile`)
		- How to handle backend startup (automatic vs manual)
		- Whether to add a `--profile` option to `workspace create` as well
		- Error handling when backend is unavailable for `default`/`advanced` profiles

		</decisions>

		<canonical_refs>

		## Canonical References

		### Existing extraction infrastructure

		- `src/tdoc_crawler/extraction/convert.py` — `OpendataloaderConfig`, `_ensure_converted()`, element extraction functions
		- `src/tdoc_crawler/extraction/conversion.py` — `convert_to_pdf()`, `ConverterConfig`, `ConverterBackend`
		- `src/tdoc_crawler/extraction/fetch_tdoc.py` — `fetch_tdoc_files()`
		- `src/tdoc_crawler/cli/_workspace_commands.py` — `workspace_process` command

		### Configuration

		- `src/tdoc_crawler/config/settings.py` — `ThreeGPPConfig`, `CrawlConfig`
		- `src/tdoc_crawler/config/cache_manager.py` — `CacheManager` with wiki path methods

		### OpenDataLoader docs

		- JSON schema: <https://opendataloader.org/docs/reference/json-schema>
		- Hybrid mode: <https://opendataloader.org/docs/hybrid-mode>
		- Chart/image description: <https://opendataloader.org/docs/hybrid-mode#chart-and-image-description>
		- Server options: <https://opendataloader.org/docs/hybrid-mode#server-options>

		</canonical_refs>

		<specifics>
		## Specific Ideas

		- The `OpendataloaderConfig` dataclass already has `hybrid`, `hybrid_mode`, `hybrid_url`, `hybrid_timeout`, `hybrid_fallback` fields — these map directly to the three profiles
		- For `pdf-only` profile: skip opendataloader entirely, just ensure PDF exists
		- For `default` profile: `OpendataloaderConfig(hybrid="docling-fast")`
		- For `advanced` profile: `OpendataloaderConfig(hybrid="docling-fast", hybrid_mode="full")` + backend started with `--enrich-picture-description`
		- The `source_pdf` field in JSON output should be a relative path from the workspace root, or an absolute path

		</specifics>

		<deferred>
		## Deferred Ideas

		- WIKI-01–04 (wiki compilation from extraction artifacts) — separate phase
		- QUERY-01–04 (wiki query) — separate phase
		- QUAL-01–02 (quality gates for wiki publish) — separate phase

		</deferred>

		---

		Phase: 11-extraction-profiles-for-wiki
		Context gathered: 2026-04-28 via user design decisions