feat(ai-document-processing): add research, spec, and tasks for AI pipeline (ebf803a2) · Commits · Jan Reimes / 3gpp-crawler

specs/002-ai-document-processing/checklists/requirements.md

0 → 100644

+36 −0

Original line number	Diff line number	Diff line
		# Specification Quality Checklist: AI Document Processing Pipeline

		Purpose: Validate specification completeness and quality before proceeding to planning
		Created: 2026-02-24
		Feature: [spec.md](../spec.md)

		## Content Quality

		- [x] No implementation details (languages, frameworks, APIs)
		- [x] Focused on user value and business needs
		- [x] Written for non-technical stakeholders
		- [x] All mandatory sections completed

		## Requirement Completeness

		- [x] No [NEEDS CLARIFICATION] markers remain
		- [x] Requirements are testable and unambiguous
		- [x] Success criteria are measurable
		- [x] Success criteria are technology-agnostic (no implementation details)
		- [x] All acceptance scenarios are defined
		- [x] Edge cases are identified
		- [x] Scope is clearly bounded
		- [x] Dependencies and assumptions identified

		## Feature Readiness

		- [x] All functional requirements have clear acceptance criteria
		- [x] User scenarios cover primary flows
		- [x] Feature meets measurable outcomes defined in Success Criteria
		- [x] No implementation details leak into specification

		## Notes

		- All items pass validation. Spec is ready for `/speckit.clarify` or `/speckit.plan`.
		- FR-010 references `ai/` domain package path as an architectural constraint per constitution (Principle V: Domain-Oriented Architecture). This is a placement rule, not an implementation detail.
		- Assumptions section documents that DOCX is the initial target format; Out of Scope explicitly defers PDF/PPT/XLS extraction.

specs/002-ai-document-processing/contracts/api.md

0 → 100644

+292 −0

Original line number	Diff line number	Diff line
		# API Contracts: AI Document Processing Pipeline

		Date: 2026-02-24
		Source: [spec.md](../spec.md) FRs + [data-model.md](../data-model.md) entities

		## Module: `tdoc_crawler.ai`

		### Public API (exposed from `__init__.py`)

		```python
		from tdoc_crawler.ai.models import (
		AiConfig,
		DocumentChunk,
		DocumentClassification,
		DocumentSummary,
		GraphEdge,
		GraphNode,
		PipelineStage,
		ProcessingStatus,
		)


		def process_tdoc(
		tdoc_id: str,
		config: AiConfig \| None = None,
		stages: list[PipelineStage] \| None = None,
		) -> ProcessingStatus:
		"""Process a single TDoc through the AI pipeline.

		Args:
		tdoc_id: TDoc identifier (e.g., "SP-123456"). Normalized via .upper().
		config: Pipeline configuration. If None, loads from default config file.
		stages: Specific stages to run. If None, runs all applicable stages.

		Returns:
		Updated ProcessingStatus after pipeline execution.

		Raises:
		TDocNotFoundError: If the TDoc is not in the database or has no files.
		AiConfigError: If required configuration (e.g., LLM endpoint) is missing.
		"""


		def process_all(
		config: AiConfig \| None = None,
		new_only: bool = False,
		stages: list[PipelineStage] \| None = None,
		progress_callback: Callable[[str, PipelineStage], None] \| None = None,
		) -> list[ProcessingStatus]:
		"""Batch process all (or new-only) TDocs through the AI pipeline.

		Args:
		config: Pipeline configuration. If None, loads from default config file.
		new_only: If True, only process TDocs not yet in processing_status.
		stages: Specific stages to run. If None, runs all applicable stages.
		progress_callback: Called with (tdoc_id, stage) for progress reporting.

		Returns:
		List of ProcessingStatus for all processed TDocs.
		"""


		def get_status(
		tdoc_id: str \| None = None,
		) -> ProcessingStatus \| list[ProcessingStatus]:
		"""Get processing status for one or all TDocs.

		Args:
		tdoc_id: If provided, return status for this TDoc. If None, return all.

		Returns:
		Single ProcessingStatus or list of all statuses.

		Raises:
		TDocNotFoundError: If tdoc_id is provided but not found.
		"""


		def query_embeddings(
		query: str,
		top_k: int = 5,
		tdoc_filter: list[str] \| None = None,
		config: AiConfig \| None = None,
		) -> list[tuple[DocumentChunk, float]]:
		"""Semantic search over embedded document chunks.

		Args:
		query: Natural language query text.
		top_k: Number of top results to return.
		tdoc_filter: Optional list of TDoc IDs to restrict search to.
		config: Pipeline configuration (for embedding model).

		Returns:
		List of (chunk, similarity_score) tuples, sorted by descending score.
		"""


		def query_graph(
		query: str,
		temporal_range: tuple[datetime, datetime] \| None = None,
		node_types: list[str] \| None = None,
		config: AiConfig \| None = None,
		) -> dict[str, Any]:
		"""Query the temporal knowledge graph.

		Args:
		query: Natural language query about relationships/evolution.
		temporal_range: Optional (start, end) datetime filter.
		node_types: Optional filter for node types.
		config: Pipeline configuration.

		Returns:
		Dict with keys "nodes" (list[GraphNode]), "edges" (list[GraphEdge]),
		"answer" (str) — the synthesized response.
		"""
		```

		## Module: `tdoc_crawler.ai.operations.extract`

		```python
		def extract_docx_to_markdown(
		docx_path: Path,
		output_dir: Path,
		) -> Path:
		"""Convert a DOCX file to Markdown using Docling.

		Args:
		docx_path: Path to the source DOCX file.
		output_dir: Directory where the Markdown file will be written.

		Returns:
		Path to the generated Markdown file.

		Raises:
		ExtractionError: If DOCX is corrupt, password-protected, or conversion fails.
		"""
		```

		## Module: `tdoc_crawler.ai.operations.classify`

		```python
		def classify_tdoc_files(
		tdoc_id: str,
		file_paths: list[Path],
		) -> list[DocumentClassification]:
		"""Classify files in a TDoc folder as main or secondary.

		Args:
		tdoc_id: The TDoc identifier.
		file_paths: List of DOCX files found in the TDoc folder.

		Returns:
		Classification for each file, with exactly one marked as main.
		"""
		```

		## Module: `tdoc_crawler.ai.operations.embed`

		```python
		def chunk_and_embed(
		tdoc_id: str,
		markdown_path: Path,
		config: AiConfig,
		) -> list[DocumentChunk]:
		"""Split Markdown into chunks and generate embeddings.

		Args:
		tdoc_id: The TDoc identifier.
		markdown_path: Path to the extracted Markdown file.
		config: Configuration (embedding model, chunk size, overlap).

		Returns:
		List of DocumentChunk objects with vectors populated.
		"""
		```

		## Module: `tdoc_crawler.ai.operations.summarize`

		```python
		def summarize_tdoc(
		tdoc_id: str,
		markdown_path: Path,
		config: AiConfig,
		) -> DocumentSummary:
		"""Generate an abstract and structured summary using an LLM.

		Args:
		tdoc_id: The TDoc identifier.
		markdown_path: Path to the extracted Markdown file.
		config: Configuration (LLM model, prompt version, word count bounds).

		Returns:
		DocumentSummary with abstract, key_points, action_items, decisions.

		Raises:
		LlmConfigError: If no LLM endpoint is configured or reachable.
		"""
		```

		## Module: `tdoc_crawler.ai.operations.graph`

		```python
		def build_graph_for_tdoc(
		tdoc_id: str,
		markdown_path: Path,
		summary: DocumentSummary,
		config: AiConfig,
		) -> tuple[list[GraphNode], list[GraphEdge]]:
		"""Extract entities and relationships from a TDoc and add to the graph.

		Args:
		tdoc_id: The TDoc identifier.
		markdown_path: Path to the extracted Markdown file.
		summary: Previously generated summary (for entity hints).
		config: Configuration.

		Returns:
		Tuple of (new nodes, new edges) added to the graph.
		"""
		```

		## Module: `tdoc_crawler.ai.operations.pipeline`

		```python
		def run_pipeline(
		tdoc_id: str,
		config: AiConfig,
		stages: list[PipelineStage] \| None = None,
		) -> ProcessingStatus:
		"""Run the full pipeline for a single TDoc, respecting resume logic.

		Checks ProcessingStatus to determine where to resume. Executes stages
		in order: classify -> extract -> embed -> summarize -> graph.

		Args:
		tdoc_id: The TDoc identifier.
		config: Pipeline configuration.
		stages: If provided, only run these stages.

		Returns:
		Updated ProcessingStatus.
		"""
		```

		## Module: `tdoc_crawler.ai.storage`

		```python
		class AiStorage:
		"""LanceDB storage layer for all AI-generated data."""

		def __init__(self, store_path: Path) -> None:
		"""Initialize LanceDB connection at the given path."""

		def save_status(self, status: ProcessingStatus) -> None: ...
		def get_status(self, tdoc_id: str) -> ProcessingStatus \| None: ...
		def list_statuses(self) -> list[ProcessingStatus]: ...

		def save_classifications(self, classifications: list[DocumentClassification]) -> None: ...
		def get_classifications(self, tdoc_id: str) -> list[DocumentClassification]: ...

		def save_chunks(self, chunks: list[DocumentChunk]) -> None: ...
		def search_chunks(self, query_vector: list[float], top_k: int) -> list[tuple[DocumentChunk, float]]: ...

		def save_summary(self, summary: DocumentSummary) -> None: ...
		def get_summary(self, tdoc_id: str) -> DocumentSummary \| None: ...

		def save_nodes(self, nodes: list[GraphNode]) -> None: ...
		def save_edges(self, edges: list[GraphEdge]) -> None: ...
		def query_graph(self, filters: dict[str, Any]) -> tuple[list[GraphNode], list[GraphEdge]]: ...
		```

		## Error Types

		```python
		class AiError(Exception):
		"""Base exception for AI processing errors."""

		class TDocNotFoundError(AiError):
		"""TDoc not found in database or has no files."""

		class ExtractionError(AiError):
		"""DOCX extraction failed (corrupt, password-protected, etc.)."""

		class LlmConfigError(AiError):
		"""LLM endpoint not configured or unreachable."""

		class AiConfigError(AiError):
		"""Invalid or missing AI configuration."""

		class EmbeddingDimensionError(AiError):
		"""Embedding model dimension mismatch with stored vectors."""
		```

specs/002-ai-document-processing/contracts/cli.md

0 → 100644

+101 −0

Original line number	Diff line number	Diff line
		# CLI Contract: AI Document Processing

		Date: 2026-02-24
		Module: `tdoc_crawler.cli.ai`

		## Command Group: `ai`

		Registered as a Typer sub-app on the main `app` in `cli/app.py`:

		```python
		ai_app = typer.Typer(help="AI document processing commands")
		app.add_typer(ai_app, name="ai", rich_help_panel="AI Commands")
		```

		## Commands

		### `ai process`

		Process TDocs through the AI pipeline.

		```
		tdoc-crawler ai process [--tdoc-id <ID>] [--all] [--new-only] [--json] [--cache-dir <PATH>]
		```

		\| Flag \| Type \| Default \| Description \|
		\|------\|------\|---------\|-------------\|
		\| `--tdoc-id` \| str \| None \| Process a single TDoc \|
		\| `--all` \| bool \| False \| Process all TDocs \|
		\| `--new-only` \| bool \| False \| Only process unprocessed TDocs (requires `--all`) \|
		\| `--json` \| bool \| False \| Output JSON instead of Rich text \|
		\| `--cache-dir` \| Path \| None \| Override cache directory \|

		Text output: Progress bar per TDoc, final summary line.
		JSON output: `[{"tdoc_id": "...", "stages": {"classifying": "completed", ...}}]`
		Errors: Stderr; exit code 1.

		### `ai status`

		Show processing status for TDocs.

		```
		tdoc-crawler ai status [--tdoc-id <ID>] [--json] [--cache-dir <PATH>]
		```

		\| Flag \| Type \| Default \| Description \|
		\|------\|------\|---------\|-------------\|
		\| `--tdoc-id` \| str \| None \| Status for a single TDoc (if omitted, shows all) \|
		\| `--json` \| bool \| False \| Output JSON \|
		\| `--cache-dir` \| Path \| None \| Override cache directory \|

		Text output: Rich table with columns: TDoc ID, Stage, Classified, Extracted,
		Embedded, Summarized, Graphed, Error.
		JSON output: `{"tdoc_id": "...", "current_stage": "...", "classified_at": "...", ...}`

		### `ai query`

		Semantic search over processed TDocs.

		```
		tdoc-crawler ai query "<question>" [--top-k <N>] [--json] [--cache-dir <PATH>]
		```

		\| Flag \| Type \| Default \| Description \|
		\|------\|------\|---------\|-------------\|
		\| `<question>` \| str \| required \| Natural language query \|
		\| `--top-k` \| int \| 5 \| Number of results \|
		\| `--json` \| bool \| False \| Output JSON \|
		\| `--cache-dir` \| Path \| None \| Override cache directory \|

		Text output: Numbered list of matching chunks with TDoc ID, section, and snippet.
		JSON output: `{"results": [{"tdoc_id": "...", "section": "...", "text": "...", "score": 0.85}]}`

		### `ai graph`

		Query the temporal knowledge graph.

		```
		tdoc-crawler ai graph --query "<question>" [--from <DATE>] [--to <DATE>] [--json] [--cache-dir <PATH>]
		```

		\| Flag \| Type \| Default \| Description \|
		\|------\|------\|---------\|-------------\|
		\| `--query` \| str \| required \| Natural language graph query \|
		\| `--from` \| str \| None \| Start date filter (ISO format) \|
		\| `--to` \| str \| None \| End date filter (ISO format) \|
		\| `--json` \| bool \| False \| Output JSON \|
		\| `--cache-dir` \| Path \| None \| Override cache directory \|

		Text output: Chronologically ordered answer with TDoc references and meeting context.
		JSON output: `{"answer": "...", "nodes": [...], "edges": [...]}`

		## Separation of Concerns

		All CLI functions in `cli/ai.py` follow the delegation pattern:

		1. Parse Typer arguments
		1. Initialize `CacheManager` and `AiConfig`
		1. Call the library function from `tdoc_crawler.ai`
		1. Format output (Rich table or JSON) and print via `typer.echo`

		No domain logic in the CLI module.

specs/002-ai-document-processing/data-model.md

0 → 100644

+266 −0

Original line number	Diff line number	Diff line
		# Data Model: AI Document Processing Pipeline

		Date: 2026-02-24
		Source: [spec.md](spec.md) Key Entities + [research.md](research.md) storage decisions

		## Storage Architecture

		All AI data is stored in LanceDB (file-based) under `<cache_dir>/.ai/lancedb/`.
		The existing tdoc-crawler SQLite database is read-only from the AI module.

		```
		<cache_dir>/
		├── tdoc_crawler.db # Existing SQLite DB (read-only from AI module)
		├── checkout/ # Existing TDoc file checkouts
		└── .ai/
		└── lancedb/
		├── processing_status.lance
		├── classifications.lance
		├── chunks.lance # Includes embedded vectors
		├── summaries.lance
		├── graph_nodes.lance
		└── graph_edges.lance
		```

		## Entity Definitions

		### 1. AiConfig

		Configuration model for the AI pipeline. Not stored in LanceDB — loaded from
		a config file or environment variables.

		```python
		class AiConfig(BaseModel):
		"""Configuration for the AI processing pipeline."""

		# Storage
		ai_store_path: Path # Default: <cache_dir>/.ai/lancedb/

		# Extraction
		# (No configurable params for Docling in v1 — uses defaults)

		# Embeddings
		embedding_model: str = "BAAI/bge-small-en-v1.5"
		max_chunk_size: int = 1000 # tokens
		chunk_overlap: int = 100 # tokens

		# Summarization
		llm_model: str = "ollama/llama3.2" # litellm model identifier
		llm_api_base: str \| None = None # Override for remote endpoints
		abstract_min_words: int = 150
		abstract_max_words: int = 250

		# Pipeline
		parallelism: int = 4 # Concurrent TDoc processing
		```

		### 2. ProcessingStatus

		Tracks the current AI processing state for each TDoc.

		```python
		class PipelineStage(StrEnum):
		"""Stages of the AI processing pipeline."""

		PENDING = "pending"
		CLASSIFYING = "classifying"
		EXTRACTING = "extracting"
		EMBEDDING = "embedding"
		SUMMARIZING = "summarizing"
		GRAPHING = "graphing"
		COMPLETED = "completed"
		FAILED = "failed"


		class ProcessingStatus(BaseModel):
		"""Processing state for a single TDoc."""

		tdoc_id: str # PK; normalized via .upper()
		current_stage: PipelineStage = PipelineStage.PENDING
		classified_at: datetime \| None = None
		extracted_at: datetime \| None = None
		embedded_at: datetime \| None = None
		summarized_at: datetime \| None = None
		graphed_at: datetime \| None = None
		completed_at: datetime \| None = None
		error_message: str \| None = None
		source_hash: str \| None = None # Hash of source DOCX for change detection
		```

		LanceDB table: `processing_status`
		Primary key: `tdoc_id`

		### 3. DocumentClassification

		Records main/secondary classification for each file in a TDoc folder.

		```python
		class DocumentClassification(BaseModel):
		"""Classification of a file within a TDoc folder."""

		tdoc_id: str # FK to ProcessingStatus
		file_path: str # Relative path within checkout folder
		is_main_document: bool
		confidence: float # 0.0 to 1.0
		decisive_heuristic: str # Which rule determined the classification
		file_size_bytes: int
		classified_at: datetime
		```

		LanceDB table: `classifications`
		Composite key: `(tdoc_id, file_path)`

		### 4. DocumentChunk

		A segment of extracted Markdown with position metadata. Also stores the
		embedding vector (combined in one table for efficient vector search).

		```python
		class DocumentChunk(BaseModel):
		"""A chunk of extracted document text with its embedding."""

		chunk_id: str # "{tdoc_id}:{chunk_index}"
		tdoc_id: str # FK to ProcessingStatus
		section_heading: str \| None # Heading of the section this chunk belongs to
		chunk_index: int # Position within the document
		text: str # The chunk text content
		char_offset_start: int # Start offset in the full Markdown
		char_offset_end: int # End offset in the full Markdown
		vector: list[float] # Embedding vector (384-dim for bge-small)
		embedding_model: str # Model used to generate the vector
		created_at: datetime
		```

		LanceDB table: `chunks`
		Primary key: `chunk_id`
		Vector column: `vector` (used for similarity search)

		### 5. DocumentSummary

		LLM-generated abstract and structured summary.

		```python
		class DocumentSummary(BaseModel):
		"""AI-generated summary for a TDoc."""

		tdoc_id: str # PK; FK to ProcessingStatus
		abstract: str # 150-250 word abstract
		key_points: list[str] # Bullet-point key findings
		action_items: list[str] # Identified action items
		decisions: list[str] # Decisions recorded in the TDoc
		affected_specs: list[str] # Spec numbers mentioned (e.g., "TS 26.132")
		llm_model: str # Model used for generation
		prompt_version: str # Version of the prompt template
		generated_at: datetime
		```

		LanceDB table: `summaries`
		Primary key: `tdoc_id`

		### 6. GraphNode

		An entity in the temporal knowledge graph.

		```python
		class GraphNodeType(StrEnum):
		"""Types of nodes in the knowledge graph."""

		TDOC = "tdoc"
		MEETING = "meeting"
		SPEC = "spec"
		WORK_ITEM = "work_item"
		CHANGE_REQUEST = "cr"
		COMPANY = "company"
		CONCEPT = "concept"


		class GraphNode(BaseModel):
		"""A node in the temporal knowledge graph."""

		node_id: str # Unique identifier (type-prefixed)
		node_type: GraphNodeType
		label: str # Human-readable label
		valid_from: datetime \| None # Temporal validity start
		valid_to: datetime \| None # Temporal validity end
		properties: dict[str, Any] # Additional type-specific properties
		created_at: datetime
		```

		LanceDB table: `graph_nodes`
		Primary key: `node_id`

		### 7. GraphEdge

		A typed relationship between two GraphNodes.

		```python
		class GraphEdgeType(StrEnum):
		"""Types of edges in the knowledge graph."""

		DISCUSSES = "discusses" # TDoc discusses a Spec/WorkItem/Concept
		REVISES = "revises" # TDoc revises another TDoc
		REFERENCES = "references" # TDoc references another TDoc
		SUPERSEDES = "supersedes" # TDoc supersedes another TDoc
		AUTHORED_BY = "authored_by" # TDoc authored by Company
		MERGED_INTO = "merged_into" # TDoc merged into a Spec
		PRESENTED_AT = "presented_at" # TDoc presented at Meeting


		class GraphEdge(BaseModel):
		"""An edge in the temporal knowledge graph."""

		edge_id: str # "{source_id}->{edge_type}->{target_id}"
		source_id: str # FK to GraphNode.node_id
		target_id: str # FK to GraphNode.node_id
		edge_type: GraphEdgeType
		weight: float = 1.0 # Relationship strength
		temporal_context: str \| None # Meeting or date context
		provenance: str # How this edge was derived
		created_at: datetime
		```

		LanceDB table: `graph_edges`
		Primary key: `edge_id`

		## Entity Relationships

		```
		ProcessingStatus (1) --- (0..N) DocumentClassification
		ProcessingStatus (1) --- (0..N) DocumentChunk
		ProcessingStatus (1) --- (0..1) DocumentSummary
		GraphNode (1) --- (0..N) GraphEdge (as source)
		GraphNode (1) --- (0..N) GraphEdge (as target)
		```

		## State Transitions

		```
		ProcessingStatus.current_stage:

		pending --> classifying --> extracting --> embedding --> summarizing --> graphing --> completed
		\| \| \| \| \| \|
		+-----+------+------+------+------+------+------+------+------+------+
		\| \|
		+---------------------> failed <-------------------------+
		```

		Each stage transition updates the corresponding `*_at` timestamp.
		A `failed` status records the `error_message` and can be retried.

		## Idempotency and Change Detection

		- `ProcessingStatus.source_hash` stores SHA-256 of the source DOCX file
		- Before processing, the current hash is compared to the stored hash
		- If unchanged, the TDoc is skipped (FR-013: idempotent)
		- If changed, all downstream stages are re-run from extraction

		## Validation Rules

		\| Field \| Rule \|
		\|-------\|------\|
		\| `tdoc_id` \| Normalized via `.upper()`; must match 3GPP TDoc pattern \|
		\| `DocumentClassification.confidence` \| Must be in range [0.0, 1.0] \|
		\| `DocumentSummary.abstract` \| Word count must be in [abstract_min_words, abstract_max_words] \|
		\| `DocumentChunk.vector` \| Length must match embedding model dimension \|
		\| `GraphNode.node_id` \| Must be prefixed with `node_type` (e.g., "tdoc:SP-123456") \|
		\| `GraphEdge.edge_id` \| Must follow "{source_id}->{edge_type}->{target_id}" format \|

specs/002-ai-document-processing/plan.md

0 → 100644

+174 −0

File added.

Preview size limit exceeded, changes collapsed.

Admin message