Commit ebf803a2 authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(ai-document-processing): add research, spec, and tasks for AI pipeline

* Introduce research document outlining decisions for AI document processing.
* Create feature specification detailing user scenarios, requirements, and success criteria.
* Establish tasks for implementing the AI document processing pipeline, organized by user story phases.
* Define critical functionalities including DOCX-to-Markdown extraction, document classification, embeddings generation, LLM summarization, and temporal graph construction.
* Ensure all tasks are aligned with functional requirements and include independent tests for validation.
parent 6cddbf3c
Loading
Loading
Loading
Loading
+36 −0
Original line number Diff line number Diff line
# Specification Quality Checklist: AI Document Processing Pipeline

**Purpose**: Validate specification completeness and quality before proceeding to planning
**Created**: 2026-02-24
**Feature**: [spec.md](../spec.md)

## Content Quality

- [x] No implementation details (languages, frameworks, APIs)
- [x] Focused on user value and business needs
- [x] Written for non-technical stakeholders
- [x] All mandatory sections completed

## Requirement Completeness

- [x] No [NEEDS CLARIFICATION] markers remain
- [x] Requirements are testable and unambiguous
- [x] Success criteria are measurable
- [x] Success criteria are technology-agnostic (no implementation details)
- [x] All acceptance scenarios are defined
- [x] Edge cases are identified
- [x] Scope is clearly bounded
- [x] Dependencies and assumptions identified

## Feature Readiness

- [x] All functional requirements have clear acceptance criteria
- [x] User scenarios cover primary flows
- [x] Feature meets measurable outcomes defined in Success Criteria
- [x] No implementation details leak into specification

## Notes

- All items pass validation. Spec is ready for `/speckit.clarify` or `/speckit.plan`.
- FR-010 references `ai/` domain package path as an architectural constraint per constitution (Principle V: Domain-Oriented Architecture). This is a placement rule, not an implementation detail.
- Assumptions section documents that DOCX is the initial target format; Out of Scope explicitly defers PDF/PPT/XLS extraction.
+292 −0
Original line number Diff line number Diff line
# API Contracts: AI Document Processing Pipeline

**Date**: 2026-02-24
**Source**: [spec.md](../spec.md) FRs + [data-model.md](../data-model.md) entities

## Module: `tdoc_crawler.ai`

### Public API (exposed from `__init__.py`)

```python
from tdoc_crawler.ai.models import (
    AiConfig,
    DocumentChunk,
    DocumentClassification,
    DocumentSummary,
    GraphEdge,
    GraphNode,
    PipelineStage,
    ProcessingStatus,
)


def process_tdoc(
    tdoc_id: str,
    config: AiConfig | None = None,
    stages: list[PipelineStage] | None = None,
) -> ProcessingStatus:
    """Process a single TDoc through the AI pipeline.

    Args:
        tdoc_id: TDoc identifier (e.g., "SP-123456"). Normalized via .upper().
        config: Pipeline configuration. If None, loads from default config file.
        stages: Specific stages to run. If None, runs all applicable stages.

    Returns:
        Updated ProcessingStatus after pipeline execution.

    Raises:
        TDocNotFoundError: If the TDoc is not in the database or has no files.
        AiConfigError: If required configuration (e.g., LLM endpoint) is missing.
    """


def process_all(
    config: AiConfig | None = None,
    new_only: bool = False,
    stages: list[PipelineStage] | None = None,
    progress_callback: Callable[[str, PipelineStage], None] | None = None,
) -> list[ProcessingStatus]:
    """Batch process all (or new-only) TDocs through the AI pipeline.

    Args:
        config: Pipeline configuration. If None, loads from default config file.
        new_only: If True, only process TDocs not yet in processing_status.
        stages: Specific stages to run. If None, runs all applicable stages.
        progress_callback: Called with (tdoc_id, stage) for progress reporting.

    Returns:
        List of ProcessingStatus for all processed TDocs.
    """


def get_status(
    tdoc_id: str | None = None,
) -> ProcessingStatus | list[ProcessingStatus]:
    """Get processing status for one or all TDocs.

    Args:
        tdoc_id: If provided, return status for this TDoc. If None, return all.

    Returns:
        Single ProcessingStatus or list of all statuses.

    Raises:
        TDocNotFoundError: If tdoc_id is provided but not found.
    """


def query_embeddings(
    query: str,
    top_k: int = 5,
    tdoc_filter: list[str] | None = None,
    config: AiConfig | None = None,
) -> list[tuple[DocumentChunk, float]]:
    """Semantic search over embedded document chunks.

    Args:
        query: Natural language query text.
        top_k: Number of top results to return.
        tdoc_filter: Optional list of TDoc IDs to restrict search to.
        config: Pipeline configuration (for embedding model).

    Returns:
        List of (chunk, similarity_score) tuples, sorted by descending score.
    """


def query_graph(
    query: str,
    temporal_range: tuple[datetime, datetime] | None = None,
    node_types: list[str] | None = None,
    config: AiConfig | None = None,
) -> dict[str, Any]:
    """Query the temporal knowledge graph.

    Args:
        query: Natural language query about relationships/evolution.
        temporal_range: Optional (start, end) datetime filter.
        node_types: Optional filter for node types.
        config: Pipeline configuration.

    Returns:
        Dict with keys "nodes" (list[GraphNode]), "edges" (list[GraphEdge]),
        "answer" (str) — the synthesized response.
    """
```

## Module: `tdoc_crawler.ai.operations.extract`

```python
def extract_docx_to_markdown(
    docx_path: Path,
    output_dir: Path,
) -> Path:
    """Convert a DOCX file to Markdown using Docling.

    Args:
        docx_path: Path to the source DOCX file.
        output_dir: Directory where the Markdown file will be written.

    Returns:
        Path to the generated Markdown file.

    Raises:
        ExtractionError: If DOCX is corrupt, password-protected, or conversion fails.
    """
```

## Module: `tdoc_crawler.ai.operations.classify`

```python
def classify_tdoc_files(
    tdoc_id: str,
    file_paths: list[Path],
) -> list[DocumentClassification]:
    """Classify files in a TDoc folder as main or secondary.

    Args:
        tdoc_id: The TDoc identifier.
        file_paths: List of DOCX files found in the TDoc folder.

    Returns:
        Classification for each file, with exactly one marked as main.
    """
```

## Module: `tdoc_crawler.ai.operations.embed`

```python
def chunk_and_embed(
    tdoc_id: str,
    markdown_path: Path,
    config: AiConfig,
) -> list[DocumentChunk]:
    """Split Markdown into chunks and generate embeddings.

    Args:
        tdoc_id: The TDoc identifier.
        markdown_path: Path to the extracted Markdown file.
        config: Configuration (embedding model, chunk size, overlap).

    Returns:
        List of DocumentChunk objects with vectors populated.
    """
```

## Module: `tdoc_crawler.ai.operations.summarize`

```python
def summarize_tdoc(
    tdoc_id: str,
    markdown_path: Path,
    config: AiConfig,
) -> DocumentSummary:
    """Generate an abstract and structured summary using an LLM.

    Args:
        tdoc_id: The TDoc identifier.
        markdown_path: Path to the extracted Markdown file.
        config: Configuration (LLM model, prompt version, word count bounds).

    Returns:
        DocumentSummary with abstract, key_points, action_items, decisions.

    Raises:
        LlmConfigError: If no LLM endpoint is configured or reachable.
    """
```

## Module: `tdoc_crawler.ai.operations.graph`

```python
def build_graph_for_tdoc(
    tdoc_id: str,
    markdown_path: Path,
    summary: DocumentSummary,
    config: AiConfig,
) -> tuple[list[GraphNode], list[GraphEdge]]:
    """Extract entities and relationships from a TDoc and add to the graph.

    Args:
        tdoc_id: The TDoc identifier.
        markdown_path: Path to the extracted Markdown file.
        summary: Previously generated summary (for entity hints).
        config: Configuration.

    Returns:
        Tuple of (new nodes, new edges) added to the graph.
    """
```

## Module: `tdoc_crawler.ai.operations.pipeline`

```python
def run_pipeline(
    tdoc_id: str,
    config: AiConfig,
    stages: list[PipelineStage] | None = None,
) -> ProcessingStatus:
    """Run the full pipeline for a single TDoc, respecting resume logic.

    Checks ProcessingStatus to determine where to resume. Executes stages
    in order: classify -> extract -> embed -> summarize -> graph.

    Args:
        tdoc_id: The TDoc identifier.
        config: Pipeline configuration.
        stages: If provided, only run these stages.

    Returns:
        Updated ProcessingStatus.
    """
```

## Module: `tdoc_crawler.ai.storage`

```python
class AiStorage:
    """LanceDB storage layer for all AI-generated data."""

    def __init__(self, store_path: Path) -> None:
        """Initialize LanceDB connection at the given path."""

    def save_status(self, status: ProcessingStatus) -> None: ...
    def get_status(self, tdoc_id: str) -> ProcessingStatus | None: ...
    def list_statuses(self) -> list[ProcessingStatus]: ...

    def save_classifications(self, classifications: list[DocumentClassification]) -> None: ...
    def get_classifications(self, tdoc_id: str) -> list[DocumentClassification]: ...

    def save_chunks(self, chunks: list[DocumentChunk]) -> None: ...
    def search_chunks(self, query_vector: list[float], top_k: int) -> list[tuple[DocumentChunk, float]]: ...

    def save_summary(self, summary: DocumentSummary) -> None: ...
    def get_summary(self, tdoc_id: str) -> DocumentSummary | None: ...

    def save_nodes(self, nodes: list[GraphNode]) -> None: ...
    def save_edges(self, edges: list[GraphEdge]) -> None: ...
    def query_graph(self, filters: dict[str, Any]) -> tuple[list[GraphNode], list[GraphEdge]]: ...
```

## Error Types

```python
class AiError(Exception):
    """Base exception for AI processing errors."""

class TDocNotFoundError(AiError):
    """TDoc not found in database or has no files."""

class ExtractionError(AiError):
    """DOCX extraction failed (corrupt, password-protected, etc.)."""

class LlmConfigError(AiError):
    """LLM endpoint not configured or unreachable."""

class AiConfigError(AiError):
    """Invalid or missing AI configuration."""

class EmbeddingDimensionError(AiError):
    """Embedding model dimension mismatch with stored vectors."""
```
+101 −0
Original line number Diff line number Diff line
# CLI Contract: AI Document Processing

**Date**: 2026-02-24
**Module**: `tdoc_crawler.cli.ai`

## Command Group: `ai`

Registered as a Typer sub-app on the main `app` in `cli/app.py`:

```python
ai_app = typer.Typer(help="AI document processing commands")
app.add_typer(ai_app, name="ai", rich_help_panel="AI Commands")
```

## Commands

### `ai process`

Process TDocs through the AI pipeline.

```
tdoc-crawler ai process [--tdoc-id <ID>] [--all] [--new-only] [--json] [--cache-dir <PATH>]
```

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--tdoc-id` | str | None | Process a single TDoc |
| `--all` | bool | False | Process all TDocs |
| `--new-only` | bool | False | Only process unprocessed TDocs (requires `--all`) |
| `--json` | bool | False | Output JSON instead of Rich text |
| `--cache-dir` | Path | None | Override cache directory |

**Text output**: Progress bar per TDoc, final summary line.
**JSON output**: `[{"tdoc_id": "...", "stages": {"classifying": "completed", ...}}]`
**Errors**: Stderr; exit code 1.

### `ai status`

Show processing status for TDocs.

```
tdoc-crawler ai status [--tdoc-id <ID>] [--json] [--cache-dir <PATH>]
```

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--tdoc-id` | str | None | Status for a single TDoc (if omitted, shows all) |
| `--json` | bool | False | Output JSON |
| `--cache-dir` | Path | None | Override cache directory |

**Text output**: Rich table with columns: TDoc ID, Stage, Classified, Extracted,
Embedded, Summarized, Graphed, Error.
**JSON output**: `{"tdoc_id": "...", "current_stage": "...", "classified_at": "...", ...}`

### `ai query`

Semantic search over processed TDocs.

```
tdoc-crawler ai query "<question>" [--top-k <N>] [--json] [--cache-dir <PATH>]
```

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `<question>` | str | required | Natural language query |
| `--top-k` | int | 5 | Number of results |
| `--json` | bool | False | Output JSON |
| `--cache-dir` | Path | None | Override cache directory |

**Text output**: Numbered list of matching chunks with TDoc ID, section, and snippet.
**JSON output**: `{"results": [{"tdoc_id": "...", "section": "...", "text": "...", "score": 0.85}]}`

### `ai graph`

Query the temporal knowledge graph.

```
tdoc-crawler ai graph --query "<question>" [--from <DATE>] [--to <DATE>] [--json] [--cache-dir <PATH>]
```

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--query` | str | required | Natural language graph query |
| `--from` | str | None | Start date filter (ISO format) |
| `--to` | str | None | End date filter (ISO format) |
| `--json` | bool | False | Output JSON |
| `--cache-dir` | Path | None | Override cache directory |

**Text output**: Chronologically ordered answer with TDoc references and meeting context.
**JSON output**: `{"answer": "...", "nodes": [...], "edges": [...]}`

## Separation of Concerns

All CLI functions in `cli/ai.py` follow the delegation pattern:

1. Parse Typer arguments
1. Initialize `CacheManager` and `AiConfig`
1. Call the library function from `tdoc_crawler.ai`
1. Format output (Rich table or JSON) and print via `typer.echo`

No domain logic in the CLI module.
+266 −0
Original line number Diff line number Diff line
# Data Model: AI Document Processing Pipeline

**Date**: 2026-02-24
**Source**: [spec.md](spec.md) Key Entities + [research.md](research.md) storage decisions

## Storage Architecture

All AI data is stored in LanceDB (file-based) under `<cache_dir>/.ai/lancedb/`.
The existing tdoc-crawler SQLite database is read-only from the AI module.

```
<cache_dir>/
├── tdoc_crawler.db          # Existing SQLite DB (read-only from AI module)
├── checkout/                # Existing TDoc file checkouts
└── .ai/
    └── lancedb/
        ├── processing_status.lance
        ├── classifications.lance
        ├── chunks.lance         # Includes embedded vectors
        ├── summaries.lance
        ├── graph_nodes.lance
        └── graph_edges.lance
```

## Entity Definitions

### 1. AiConfig

Configuration model for the AI pipeline. Not stored in LanceDB — loaded from
a config file or environment variables.

```python
class AiConfig(BaseModel):
    """Configuration for the AI processing pipeline."""

    # Storage
    ai_store_path: Path  # Default: <cache_dir>/.ai/lancedb/

    # Extraction
    # (No configurable params for Docling in v1 — uses defaults)

    # Embeddings
    embedding_model: str = "BAAI/bge-small-en-v1.5"
    max_chunk_size: int = 1000    # tokens
    chunk_overlap: int = 100      # tokens

    # Summarization
    llm_model: str = "ollama/llama3.2"        # litellm model identifier
    llm_api_base: str | None = None           # Override for remote endpoints
    abstract_min_words: int = 150
    abstract_max_words: int = 250

    # Pipeline
    parallelism: int = 4         # Concurrent TDoc processing
```

### 2. ProcessingStatus

Tracks the current AI processing state for each TDoc.

```python
class PipelineStage(StrEnum):
    """Stages of the AI processing pipeline."""

    PENDING = "pending"
    CLASSIFYING = "classifying"
    EXTRACTING = "extracting"
    EMBEDDING = "embedding"
    SUMMARIZING = "summarizing"
    GRAPHING = "graphing"
    COMPLETED = "completed"
    FAILED = "failed"


class ProcessingStatus(BaseModel):
    """Processing state for a single TDoc."""

    tdoc_id: str                            # PK; normalized via .upper()
    current_stage: PipelineStage = PipelineStage.PENDING
    classified_at: datetime | None = None
    extracted_at: datetime | None = None
    embedded_at: datetime | None = None
    summarized_at: datetime | None = None
    graphed_at: datetime | None = None
    completed_at: datetime | None = None
    error_message: str | None = None
    source_hash: str | None = None          # Hash of source DOCX for change detection
```

**LanceDB table**: `processing_status`
**Primary key**: `tdoc_id`

### 3. DocumentClassification

Records main/secondary classification for each file in a TDoc folder.

```python
class DocumentClassification(BaseModel):
    """Classification of a file within a TDoc folder."""

    tdoc_id: str                 # FK to ProcessingStatus
    file_path: str               # Relative path within checkout folder
    is_main_document: bool
    confidence: float            # 0.0 to 1.0
    decisive_heuristic: str      # Which rule determined the classification
    file_size_bytes: int
    classified_at: datetime
```

**LanceDB table**: `classifications`
**Composite key**: `(tdoc_id, file_path)`

### 4. DocumentChunk

A segment of extracted Markdown with position metadata. Also stores the
embedding vector (combined in one table for efficient vector search).

```python
class DocumentChunk(BaseModel):
    """A chunk of extracted document text with its embedding."""

    chunk_id: str                  # "{tdoc_id}:{chunk_index}"
    tdoc_id: str                   # FK to ProcessingStatus
    section_heading: str | None    # Heading of the section this chunk belongs to
    chunk_index: int               # Position within the document
    text: str                      # The chunk text content
    char_offset_start: int         # Start offset in the full Markdown
    char_offset_end: int           # End offset in the full Markdown
    vector: list[float]            # Embedding vector (384-dim for bge-small)
    embedding_model: str           # Model used to generate the vector
    created_at: datetime
```

**LanceDB table**: `chunks`
**Primary key**: `chunk_id`
**Vector column**: `vector` (used for similarity search)

### 5. DocumentSummary

LLM-generated abstract and structured summary.

```python
class DocumentSummary(BaseModel):
    """AI-generated summary for a TDoc."""

    tdoc_id: str                     # PK; FK to ProcessingStatus
    abstract: str                    # 150-250 word abstract
    key_points: list[str]            # Bullet-point key findings
    action_items: list[str]          # Identified action items
    decisions: list[str]             # Decisions recorded in the TDoc
    affected_specs: list[str]        # Spec numbers mentioned (e.g., "TS 26.132")
    llm_model: str                   # Model used for generation
    prompt_version: str              # Version of the prompt template
    generated_at: datetime
```

**LanceDB table**: `summaries`
**Primary key**: `tdoc_id`

### 6. GraphNode

An entity in the temporal knowledge graph.

```python
class GraphNodeType(StrEnum):
    """Types of nodes in the knowledge graph."""

    TDOC = "tdoc"
    MEETING = "meeting"
    SPEC = "spec"
    WORK_ITEM = "work_item"
    CHANGE_REQUEST = "cr"
    COMPANY = "company"
    CONCEPT = "concept"


class GraphNode(BaseModel):
    """A node in the temporal knowledge graph."""

    node_id: str                     # Unique identifier (type-prefixed)
    node_type: GraphNodeType
    label: str                       # Human-readable label
    valid_from: datetime | None      # Temporal validity start
    valid_to: datetime | None        # Temporal validity end
    properties: dict[str, Any]       # Additional type-specific properties
    created_at: datetime
```

**LanceDB table**: `graph_nodes`
**Primary key**: `node_id`

### 7. GraphEdge

A typed relationship between two GraphNodes.

```python
class GraphEdgeType(StrEnum):
    """Types of edges in the knowledge graph."""

    DISCUSSES = "discusses"         # TDoc discusses a Spec/WorkItem/Concept
    REVISES = "revises"            # TDoc revises another TDoc
    REFERENCES = "references"      # TDoc references another TDoc
    SUPERSEDES = "supersedes"      # TDoc supersedes another TDoc
    AUTHORED_BY = "authored_by"    # TDoc authored by Company
    MERGED_INTO = "merged_into"    # TDoc merged into a Spec
    PRESENTED_AT = "presented_at"  # TDoc presented at Meeting


class GraphEdge(BaseModel):
    """An edge in the temporal knowledge graph."""

    edge_id: str                    # "{source_id}->{edge_type}->{target_id}"
    source_id: str                  # FK to GraphNode.node_id
    target_id: str                  # FK to GraphNode.node_id
    edge_type: GraphEdgeType
    weight: float = 1.0             # Relationship strength
    temporal_context: str | None    # Meeting or date context
    provenance: str                 # How this edge was derived
    created_at: datetime
```

**LanceDB table**: `graph_edges`
**Primary key**: `edge_id`

## Entity Relationships

```
ProcessingStatus (1) --- (0..N) DocumentClassification
ProcessingStatus (1) --- (0..N) DocumentChunk
ProcessingStatus (1) --- (0..1) DocumentSummary
GraphNode (1) --- (0..N) GraphEdge (as source)
GraphNode (1) --- (0..N) GraphEdge (as target)
```

## State Transitions

```
ProcessingStatus.current_stage:

  pending --> classifying --> extracting --> embedding --> summarizing --> graphing --> completed
      |            |              |             |              |             |
      +-----+------+------+------+------+------+------+------+------+------+
            |                                                        |
            +---------------------> failed <-------------------------+
```

Each stage transition updates the corresponding `*_at` timestamp.
A `failed` status records the `error_message` and can be retried.

## Idempotency and Change Detection

- `ProcessingStatus.source_hash` stores SHA-256 of the source DOCX file
- Before processing, the current hash is compared to the stored hash
- If unchanged, the TDoc is skipped (FR-013: idempotent)
- If changed, all downstream stages are re-run from extraction

## Validation Rules

| Field | Rule |
|-------|------|
| `tdoc_id` | Normalized via `.upper()`; must match 3GPP TDoc pattern |
| `DocumentClassification.confidence` | Must be in range [0.0, 1.0] |
| `DocumentSummary.abstract` | Word count must be in [abstract_min_words, abstract_max_words] |
| `DocumentChunk.vector` | Length must match embedding model dimension |
| `GraphNode.node_id` | Must be prefixed with `node_type` (e.g., "tdoc:SP-123456") |
| `GraphEdge.edge_id` | Must follow "{source_id}->{edge_type}->{target_id}" format |
+174 −0

File added.

Preview size limit exceeded, changes collapsed.

Loading