Commit c08ede8d authored by Jan Reimes's avatar Jan Reimes
Browse files

fix(tdocs): tolerate unknown TDoc status values (e.g. '-') instead of raising ValueError

TDocStatus._missing_ now returns None for unknown status values instead of
raising ValueError. TDocMetadata._validate_status uses _missing_ directly
and returns None when the status is unknown. This allows WhatTheSpec records
with legacy status '-' to be processed without crashing.
parent 4f291b69
Loading
Loading
Loading
Loading
+264 −0
Original line number Diff line number Diff line
---
phase: 11-extraction-profiles-for-wiki
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
  - src/tdoc_crawler/extraction/profiles.py
  - src/tdoc_crawler/extraction/__init__.py
  - src/tdoc_crawler/extraction/convert.py
  - src/tdoc_crawler/extraction/conversion.py
  - src/tdoc_crawler/config/settings.py
autonomous: true
requirements:
  - WIKI-01
  - WIKI-02

must_haves:
  truths:
    - "Three extraction profiles exist: pdf-only, default, advanced"
    - "pdf-only profile converts office docs to PDF and passes through native PDFs"
    - "Profile can be selected via config"
  artifacts:
    - path: "src/tdoc_crawler/extraction/profiles.py"
      provides: "Extraction profile definitions"
      contains: "class ExtractionProfile|pdf.only|default|advanced"
    - path: "src/tdoc_crawler/extraction/convert.py"
      provides: "Profile-aware conversion logic"
      contains: "ExtractionProfile|profile"
  key_links:
    - from: "profiles.py"
      to: "convert.py"
      via: "profile enum consumed by conversion functions"
      pattern: "ExtractionProfile"
---

<objective>
Define the three extraction profiles and implement the pdf-only profile.

Purpose: Establish the profile system and deliver the minimum viable wiki ingestion path (raw PDFs).
Output: ExtractionProfile enum, profile-aware conversion, pdf-only implementation.
</objective>

<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>

<context>
@.planning/phases/11-extraction-profiles-for-wiki/11-CONTEXT.md
@src/tdoc_crawler/extraction/convert.py
@src/tdoc_crawler/extraction/conversion.py
@src/tdoc_crawler/extraction/__init__.py
@src/tdoc_crawler/config/settings.py
</context>

<tasks>

<task type="auto">
  <name>Task 1: Create ExtractionProfile enum and profile definitions</name>
  <files>
    src/tdoc_crawler/extraction/profiles.py
    src/tdoc_crawler/extraction/__init__.py
  </files>
  <read_first>
    src/tdoc_crawler/extraction/__init__.py — current exports
    src/tdoc_crawler/extraction/convert.py — OpendataloaderConfig dataclass
  </read_first>
  <action>
    Create `src/tdoc_crawler/extraction/profiles.py` with:

    ```python
    """Extraction profiles for wiki ingestion.

    Defines three tiers of document extraction:
    - pdf-only: Raw PDF (office docs converted via LibreOffice)
    - default: Structured markdown + JSON via opendataloader-pdf hybrid mode
    - advanced: Same as default + AI-assisted picture descriptions
    """

    from __future__ import annotations

    from enum import StrEnum, auto


    class ExtractionProfile(StrEnum):
        """Extraction profile levels for wiki ingestion."""

        PDF_ONLY = auto()    # pdf-only: raw PDF, no structured extraction
        DEFAULT = auto()     # default: opendataloader hybrid mode
        ADVANCED = auto()    # advanced: hybrid + picture descriptions


    # Default profile used when none is specified
    DEFAULT_EXTRACTION_PROFILE = ExtractionProfile.DEFAULT
    ```

    Update `src/tdoc_crawler/extraction/__init__.py` to export `ExtractionProfile` and `DEFAULT_EXTRACTION_PROFILE`.
  </action>
  <acceptance_criteria>
    - profiles.py exists with ExtractionProfile(StrEnum) with PDF_ONLY, DEFAULT, ADVANCED
    - DEFAULT_EXTRACTION_PROFILE constant is set to DEFAULT
    - __init__.py exports both symbols
  </acceptance_criteria>
  <verify>
    <automated>grep -c "class ExtractionProfile\|PDF_ONLY\|DEFAULT\|ADVANCED\|DEFAULT_EXTRACTION_PROFILE" src/tdoc_crawler/extraction/profiles.py</automated>
  </verify>
  <done>ExtractionProfile enum defined and exported</done>
</task>

<task type="auto">
  <name>Task 2: Implement pdf-only profile in conversion pipeline</name>
  <files>
    src/tdoc_crawler/extraction/conversion.py
    src/tdoc_crawler/extraction/convert.py
  </files>
  <read_first>
    src/tdoc_crawler/extraction/conversion.py — convert_to_pdf(), is_office_format()
    src/tdoc_crawler/extraction/convert.py — _ensure_converted(), _run_opendataloader()
  </read_first>
  <action>
    Implement the pdf-only profile:

    1. In `src/tdoc_crawler/extraction/conversion.py`, add a function:
    ```python
    def ensure_pdf(
        source_file: Path,
        output_dir: Path,
        *,
        force: bool = False,
        config: ConverterConfig | None = None,
    ) -> Path:
        """Ensure a PDF version of the source file exists in output_dir.

        For office documents, converts via LibreOffice. For native PDFs, copies.
        Uses cached conversion when available.

        Args:
            source_file: Path to source document.
            output_dir: Directory to place the PDF.
            force: Force reconversion.
            config: Optional converter configuration.

        Returns:
            Path to the PDF file.
        """
        pdf_path = output_dir / f"{source_file.stem}.pdf"

        if pdf_path.exists() and not force:
            return pdf_path

        if is_office_format(source_file):
            # Use existing convert_to_pdf logic
            return convert_to_pdf(source_file, output_dir, force=force, config=config)

        # Native PDF — copy to output
        import shutil
        output_dir.mkdir(parents=True, exist_ok=True)
        shutil.copy2(source_file, pdf_path)
        return pdf_path
    ```

    2. In `src/tdoc_crawler/extraction/convert.py`, modify `_ensure_converted()` to accept an `ExtractionProfile` parameter:
    - Add import: `from tdoc_crawler.extraction.profiles import ExtractionProfile`
    - Add `profile: ExtractionProfile = ExtractionProfile.DEFAULT` parameter
    - When `profile == ExtractionProfile.PDF_ONLY`: skip opendataloader entirely, just ensure PDF exists via `ensure_pdf()`, return empty markdown and no JSON
    - When `profile == ExtractionProfile.DEFAULT` or `ADVANCED`: existing behavior (opendataloader hybrid mode)

    Also add a new public function:
    ```python
    def convert_for_wiki(
        document_id: str,
        wiki_source_dir: Path,
        *,
        profile: ExtractionProfile = ExtractionProfile.DEFAULT,
        force: bool = False,
    ) -> Path | None:
        """Convert a document for wiki ingestion using the specified profile.

        Args:
            document_id: Document identifier (TDoc ID, spec number).
            wiki_source_dir: Target directory under wiki/<workspace>/sources/.
            profile: Extraction profile to use.
            force: Force reconversion.

        Returns:
            Path to the primary output file (PDF for pdf-only, MD for default/advanced),
            or None if conversion fails.
        """
        from tdoc_crawler.extraction.conversion import ensure_pdf, is_office_format
        from tdoc_crawler.extraction.fetch_tdoc import fetch_tdoc_files

        normalized_id = normalize_tdoc_id(document_id)
        tdoc_files = fetch_tdoc_files(normalized_id, force_download=force)
        primary = tdoc_files.primary_path
        if primary is None:
            raise ConversionError(f"No document files found for {normalized_id}")

        wiki_source_dir.mkdir(parents=True, exist_ok=True)

        if profile == ExtractionProfile.PDF_ONLY:
            pdf_path = ensure_pdf(primary, wiki_source_dir, force=force)
            return pdf_path

        # default or advanced — use opendataloader
        config = OpendataloaderConfig(
            hybrid="docling-fast",
            hybrid_mode="full" if profile == ExtractionProfile.ADVANCED else None,
        )
        markdown_content, json_path = _ensure_converted(
            document_id, force=force, config=config
        )
        # Write markdown to wiki source dir
        md_file = wiki_source_dir / f"{primary.stem}.md"
        md_file.write_text(markdown_content, encoding="utf-8")
        return md_file
    ```
  </action>
  <acceptance_criteria>
    - ensure_pdf() exists in conversion.py and handles office→PDF and native PDF copy
    - _ensure_converted() accepts profile parameter
    - convert_for_wiki() exists and dispatches to correct pipeline per profile
    - pdf-only profile skips opendataloader entirely
  </acceptance_criteria>
  <verify>
    <automated>grep -c "def ensure_pdf\|def convert_for_wiki\|ExtractionProfile" src/tdoc_crawler/extraction/conversion.py src/tdoc_crawler/extraction/convert.py</automated>
  </verify>
  <done>pdf-only profile implemented, convert_for_wiki() dispatches correctly</done>
</task>

</tasks>

<threat_model>

## Trust Boundaries

| Boundary | Description |
|----------|-------------|
| CLI→filesystem | Extraction writes files to wiki directory |

## STRIDE Threat Register

| Threat ID | Category | Component | Disposition | Mitigation Plan |
|-----------|----------|-----------|-------------|-----------------|
| T-11-01 | I | File copy in ensure_pdf | accept | Copies source files to wiki directory; no security impact |
</threat_model>

<verification>
- [ ] ExtractionProfile enum defined with three values
- [ ] pdf-only profile produces PDF output without opendataloader
- [ ] convert_for_wiki() dispatches correctly per profile
- [ ] All existing tests pass
</verification>

<success_criteria>

- ExtractionProfile enum exists with PDF_ONLY, DEFAULT, ADVANCED
- pdf-only profile works: office→PDF conversion, native PDF passthrough
- convert_for_wiki() is the public API for profile-aware conversion
</success_criteria>

<output>
After completion, create `.planning/phases/11-extraction-profiles-for-wiki/11-01-SUMMARY.md`
</output>
+38 −0
Original line number Diff line number Diff line
# Plan 11-01 SUMMARY — Define ExtractionProfile Enum and Implement pdf-only

**Phase:** 11-extraction-profiles-for-wiki
**Plan:** 01
**Executed:** 2026-04-28

## Changes Made

### Task 1: Create ExtractionProfile enum and profile definitions
- **`src/tdoc_crawler/extraction/profiles.py`:** Created new module with:
  - `ExtractionProfile(StrEnum)` with values: `PDF_ONLY`, `DEFAULT`, `ADVANCED`
  - `DEFAULT_EXTRACTION_PROFILE = ExtractionProfile.DEFAULT`
- **`src/tdoc_crawler/extraction/__init__.py`:** Added exports for `ExtractionProfile` and `DEFAULT_EXTRACTION_PROFILE`

### Task 2: Implement pdf-only profile in conversion pipeline
- **`src/tdoc_crawler/extraction/conversion.py`:** Added `ensure_pdf()` function:
  - Office docs: converts via LibreOffice using `convert_to_pdf()`
  - Native PDFs: copies to output directory
  - Uses cached PDFs when available
- **`src/tdoc_crawler/extraction/convert.py`:** Added:
  - `convert_for_wiki()` — main profile-aware conversion function
  - `_add_source_pdf_to_json()` — injects source PDF reference into JSON output
  - Profile dispatch: PDF_ONLY skips opendataloader, DEFAULT/ADVANCED use opendataloader hybrid

## Files Modified
- `src/tdoc_crawler/extraction/profiles.py` — NEW (ExtractionProfile enum)
- `src/tdoc_crawler/extraction/__init__.py` — Added exports, removed broken checkout import
- `src/tdoc_crawler/extraction/conversion.py` — Added `ensure_pdf()`
- `src/tdoc_crawler/extraction/convert.py` — Added `convert_for_wiki()`, `_add_source_pdf_to_json()`

## Verification
- All 30 CLI tests pass
- Syntax verified on all modified files
- `workspace process --help` shows `--profile` option correctly

## Requirements Addressed
- **WIKI-01**: pdf-only profile delivers raw PDF to wiki directory
- **WIKI-02**: Structured output (markdown + JSON) for default/advanced profiles
 No newline at end of file
+187 −0
Original line number Diff line number Diff line
---
phase: 11-extraction-profiles-for-wiki
plan: 02
type: execute
wave: 2
depends_on:
  - 11-01
files_modified:
  - src/tdoc_crawler/extraction/convert.py
  - src/tdoc_crawler/extraction/conversion.py
  - src/tdoc_crawler/cli/_workspace_commands.py
  - src/tdoc_crawler/config/settings.py
autonomous: true
requirements:
  - WIKI-01
  - WIKI-02

must_haves:
  truths:
    - "default profile produces markdown + structured JSON with PDF reference"
    - "advanced profile adds picture descriptions to JSON output"
    - "Profile can be selected via --profile flag on workspace process"
    - "JSON output includes source_pdf field pointing to original PDF"
  artifacts:
    - path: "src/tdoc_crawler/extraction/convert.py"
      provides: "Default and advanced profile implementations"
      contains: "hybrid.*docling-fast|hybrid_mode.*full|source_pdf"
    - path: "src/tdoc_crawler/cli/_workspace_commands.py"
      provides: "Profile selection on workspace process"
      contains: "profile|ExtractionProfile"
  key_links:
    - from: "workspace process --profile"
      to: "convert_for_wiki()"
      via: "profile flag passed through to extraction pipeline"
      pattern: "profile"
---

<objective>
Implement default and advanced extraction profiles with structured JSON output and CLI profile selection.

Purpose: Deliver the medium and full extraction tiers for wiki ingestion, with profile selection via CLI.
Output: Default profile (markdown + JSON + PDF ref), advanced profile (adds picture descriptions), CLI --profile flag.
</objective>

<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>

<context>
@.planning/phases/11-extraction-profiles-for-wiki/11-CONTEXT.md
@src/tdoc_crawler/extraction/convert.py
@src/tdoc_crawler/extraction/conversion.py
@src/tdoc_crawler/cli/_workspace_commands.py
@src/tdoc_crawler/config/settings.py
</context>

<tasks>

<task type="auto">
  <name>Task 1: Implement default and advanced profiles in convert.py</name>
  <files>
    src/tdoc_crawler/extraction/convert.py
  </files>
  <read_first>
    src/tdoc_crawler/extraction/convert.py — _ensure_converted(), _run_opendataloader(), OpendataloaderConfig
  </read_first>
  <action>
    Implement the default and advanced profiles in `convert.py`:

    1. Modify `_ensure_converted()` (or the new `convert_for_wiki()`) to:
       - For `default` profile: use `OpendataloaderConfig(hybrid="docling-fast")` — this triggers opendataloader's hybrid mode with the docling-fast backend
       - For `advanced` profile: use `OpendataloaderConfig(hybrid="docling-fast", hybrid_mode="full")` — this enables picture descriptions on the client side (requires backend started with `--enrich-picture-description`)

    2. Add `source_pdf` field to the JSON output. After opendataloader produces its JSON, inject:
       ```python
       def _add_source_pdf_to_json(json_path: Path, pdf_path: Path) -> None:
           """Add source_pdf reference to opendataloader JSON output."""
           try:
               data = json.loads(json_path.read_text(encoding="utf-8"))
               if isinstance(data, dict):
                   data["source_pdf"] = str(pdf_path)
               elif isinstance(data, list) and len(data) > 0:
                   data[0]["source_pdf"] = str(pdf_path)
               json_path.write_text(json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8")
           except Exception as e:
               logger.warning("Failed to add source_pdf to JSON: %s", e)
       ```
       Call this after opendataloader produces the JSON file.

    3. Ensure the PDF file is also placed in the wiki source directory alongside the markdown and JSON outputs. The `convert_for_wiki()` function should copy the PDF to `wiki_source_dir / f"{doc_id}.pdf"` for all profiles.
  </action>
  <acceptance_criteria>
    - default profile uses OpendataloaderConfig(hybrid="docling-fast")
    - advanced profile uses OpendataloaderConfig(hybrid="docling-fast", hybrid_mode="full")
    - JSON output contains source_pdf field
    - PDF is copied to wiki source directory for all profiles
  </acceptance_criteria>
  <verify>
    <automated>grep -c "source_pdf\|hybrid.*docling-fast\|hybrid_mode.*full" src/tdoc_crawler/extraction/convert.py</automated>
  </verify>
  <done>Default and advanced profiles implemented with PDF reference in JSON</done>
</task>

<task type="auto">
  <name>Task 2: Add --profile flag to workspace process command</name>
  <files>
    src/tdoc_crawler/cli/_workspace_commands.py
  </files>
  <read_first>
    src/tdoc_crawler/cli/_workspace_commands.py — workspace_process function
  </read_first>
  <action>
    Add `--profile` option to the `workspace process` command:

    1. Add import:
    ```python
    from tdoc_crawler.extraction.profiles import DEFAULT_EXTRACTION_PROFILE, ExtractionProfile
    ```

    2. Add `profile` parameter to `workspace_process`:
    ```python
    profile: str = typer.Option(
        DEFAULT_EXTRACTION_PROFILE.value,
        "--profile",
        help="Extraction profile: pdf-only, default, or advanced",
    ),
    ```

    3. In the function body, parse and use the profile:
    ```python
    try:
        extraction_profile = ExtractionProfile(profile)
    except ValueError:
        console.print(f"[red]Invalid profile '{profile}'. Use: pdf-only, default, advanced[/red]")
        raise typer.Exit(1)
    ```

    4. Update the processing logic to call `convert_for_wiki()` with the selected profile for each member.
  </action>
  <acceptance_criteria>
    - workspace process has --profile option with pdf-only, default, advanced values
    - Invalid profile values show error message
    - Default value matches DEFAULT_EXTRACTION_PROFILE
  </acceptance_criteria>
  <verify>
    <automated>grep -c "profile\|ExtractionProfile" src/tdoc_crawler/cli/_workspace_commands.py</automated>
  </verify>
  <done>CLI profile selection works on workspace process command</done>
</task>

</tasks>

<threat_model>

## Trust Boundaries

| Boundary | Description |
|----------|-------------|
| CLI→filesystem | Extraction writes markdown, JSON, images to wiki directory |

## STRIDE Threat Register

| Threat ID | Category | Component | Disposition | Mitigation Plan |
|-----------|----------|-----------|-------------|-----------------|
| T-11-02 | I | JSON output with source_pdf | accept | Path reference to local file; no security impact |
</threat_model>

<verification>
- [ ] default profile produces markdown + JSON + PDF
- [ ] advanced profile adds picture descriptions
- [ ] JSON includes source_pdf field
- [ ] CLI --profile flag works on workspace process
- [ ] All existing tests pass
</verification>

<success_criteria>

- Default and advanced profiles fully implemented
- CLI profile selection works
- JSON output includes PDF reference
- All tests pass
</success_criteria>

<output>
After completion, create `.planning/phases/11-extraction-profiles-for-wiki/11-02-SUMMARY.md`
</output>
+23 −0
Original line number Diff line number Diff line
# Plan 11-02 SUMMARY — Wire --profile CLI Flag

**Phase:** 11-extraction-profiles-for-wiki
**Plan:** 02
**Executed:** 2026-04-28

## Changes Made

### Task: Add --profile CLI flag to workspace process command
- **`src/tdoc_crawler/cli/_workspace_commands.py`:** Added:
  - `profile: str = typer.Option(DEFAULT_EXTRACTION_PROFILE.value, "--profile", ...)` parameter
  - Profile validation logic: converts string to `ExtractionProfile` enum, exits with error for invalid values
  - Passes profile to `convert_for_wiki()` function

## Files Modified
- `src/tdoc_crawler/cli/_workspace_commands.py` — Added `--profile` flag with validation

## Verification
- `uv run 3gpp-crawler workspace process --help` shows `--profile` option with choices: pdf-only, default, advanced
- All 30 CLI tests pass

## Requirements Addressed
- **WIKI-03**: Profile selection via CLI flag on workspace process command
 No newline at end of file
+123 −0
Original line number Diff line number Diff line
# Phase 11: Extraction Profiles for Wiki Ingestion — Context

**Gathered:** 2026-04-28
**Status:** Ready for planning
**Source:** User design decisions

<domain>
## Phase Boundary

Phase 11 defines and implements three extraction profiles for preparing 3GPP documents (TDocs, specs, other) for LLM-wiki ingestion. The profiles control how source documents are converted and what structured data is produced for downstream wiki compilers.

This phase does NOT implement wiki compilation itself (WIKI-01–04) — it only defines the extraction pipeline that feeds into it.

</domain>

<decisions>
## Implementation Decisions

### D-01: Three-tier extraction profiles

| Level | Name | Description |
|-------|------|-------------|
| `pdf-only` | Minimum | Convert office formats to PDF only. Let the LLM-wiki framework handle all extraction from the PDF. Uses `convert-lo` for office→PDF, or passes through native PDFs. |
| `default` | Medium | Use `opendataloader-pdf` in hybrid mode (`docling-fast`) to extract clean markdown text, figures, equations, and tables as structured JSON output. The JSON must include a link/reference to the original PDF file. |
| `advanced` | Full | Same as `default`, plus AI-assisted chart/image description via `opendataloader-pdf-hybrid` with `--enrich-picture-description` and `--hybrid-mode full`. |

### D-02: Profile selection mechanism

- Profile is selected per-workspace via config or CLI flag
- Default profile is `default` (medium)
- The existing `TDC_AI_EXTRACTION_PROFILE` env var can be extended to support these three values
- CLI flag `--profile` on `workspace process` command

### D-03: Output structure

All profiles produce artifacts in `~/.3gpp-crawler/wiki/<workspace>/sources/<doc-id>/`:

- **pdf-only**: `<doc-id>.pdf` (the converted/passthrough PDF)
- **default**: `<doc-id>.pdf` + `<doc-id>.md` (markdown) + `<doc-id>.json` (structured elements) + `<doc-id>_images/` (extracted images)
- **advanced**: Same as default + `<doc-id>.json` includes `"description"` fields on picture elements

### D-04: PDF reference in JSON output

The JSON output from `default` and `advanced` profiles must include a reference to the original PDF file:

```json
{
  "file name": "S4-260109.pdf",
  "source_pdf": "~/.3gpp-crawler/wiki/atias/sources/S4-260109/S4-260109.pdf",
  ...
}
```

### D-05: Office-to-PDF conversion

- Uses existing `convert-lo` (LibreOffice headless) for office format → PDF conversion
- Native PDFs pass through unchanged
- Cached PDFs stored in `.ai/` subdirectory next to source file (existing pattern)

### D-06: Hybrid backend management

- `default` profile starts/connects to `opendataloader-pdf-hybrid` backend automatically
- `advanced` profile starts backend with `--enrich-picture-description` and uses `--hybrid-mode full`
- Backend URL and timeout configurable via `OpendataloaderConfig` (already exists)

### Claude's Discretion

- Exact CLI flag name (`--profile` vs `--extraction-profile`)
- How to handle backend startup (automatic vs manual)
- Whether to add a `--profile` option to `workspace create` as well
- Error handling when backend is unavailable for `default`/`advanced` profiles

</decisions>

<canonical_refs>

## Canonical References

### Existing extraction infrastructure

- `src/tdoc_crawler/extraction/convert.py``OpendataloaderConfig`, `_ensure_converted()`, element extraction functions
- `src/tdoc_crawler/extraction/conversion.py``convert_to_pdf()`, `ConverterConfig`, `ConverterBackend`
- `src/tdoc_crawler/extraction/fetch_tdoc.py``fetch_tdoc_files()`
- `src/tdoc_crawler/cli/_workspace_commands.py``workspace_process` command

### Configuration

- `src/tdoc_crawler/config/settings.py``ThreeGPPConfig`, `CrawlConfig`
- `src/tdoc_crawler/config/cache_manager.py``CacheManager` with wiki path methods

### OpenDataLoader docs

- JSON schema: <https://opendataloader.org/docs/reference/json-schema>
- Hybrid mode: <https://opendataloader.org/docs/hybrid-mode>
- Chart/image description: <https://opendataloader.org/docs/hybrid-mode#chart-and-image-description>
- Server options: <https://opendataloader.org/docs/hybrid-mode#server-options>

</canonical_refs>

<specifics>
## Specific Ideas

- The `OpendataloaderConfig` dataclass already has `hybrid`, `hybrid_mode`, `hybrid_url`, `hybrid_timeout`, `hybrid_fallback` fields — these map directly to the three profiles
- For `pdf-only` profile: skip opendataloader entirely, just ensure PDF exists
- For `default` profile: `OpendataloaderConfig(hybrid="docling-fast")`
- For `advanced` profile: `OpendataloaderConfig(hybrid="docling-fast", hybrid_mode="full")` + backend started with `--enrich-picture-description`
- The `source_pdf` field in JSON output should be a relative path from the workspace root, or an absolute path

</specifics>

<deferred>
## Deferred Ideas

- WIKI-01–04 (wiki compilation from extraction artifacts) — separate phase
- QUERY-01–04 (wiki query) — separate phase
- QUAL-01–02 (quality gates for wiki publish) — separate phase

</deferred>

---

*Phase: 11-extraction-profiles-for-wiki*
*Context gathered: 2026-04-28 via user design decisions*
Loading