Commit c2b3731a authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(cli): enhance query command with tdoc-id flag for targeted searches

- Added `--tdoc-id <ID>` option to the `ai query` command to limit searches to a specific TDoc.
- Updated command documentation to reflect the new flag.
- Adjusted command output structure to include TDoc ID in results.
parent ebf803a2
Loading
Loading
Loading
Loading
+416 −0

File added.

Preview size limit exceeded, changes collapsed.

+2 −1
Original line number Diff line number Diff line
@@ -57,12 +57,13 @@ Embedded, Summarized, Graphed, Error.
Semantic search over processed TDocs.

```
tdoc-crawler ai query "<question>" [--top-k <N>] [--json] [--cache-dir <PATH>]
tdoc-crawler ai query "<question>" [--tdoc-id <ID>] [--top-k <N>] [--json] [--cache-dir <PATH>]
```

| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `<question>` | str | required | Natural language query |
| `--tdoc-id` | str | None | Limit search to a single TDoc |
| `--top-k` | int | 5 | Number of results |
| `--json` | bool | False | Output JSON |
| `--cache-dir` | Path | None | Override cache directory |
+2 −2
Original line number Diff line number Diff line
@@ -43,7 +43,7 @@ AI dependencies are packaged as optional extras (`tdoc_crawler[ai]`).
    `logging` module. Optional extras in pyproject.toml. Ruff + Ty clean.
- [x] Domain placement confirmed (correct domain package; no logic in cli/ or utils/;
  no legacy crawlers/ reference).
  - All AI logic in `src/tdoc_crawler/ai/` with `operations/`, `sources/`, `models.py`.
  - All AI logic in `src/tdoc_crawler/ai/` with `operations/` and `models.py`.
    CLI file `cli/ai.py` delegates only. No crawlers/ references.
- [x] DRY validation done (existing code searched; no duplicated logic introduced).
  - Reuses: `CacheManager` for path resolution, `create_cached_session()` for HTTP,
@@ -91,7 +91,7 @@ CLI functions only parse arguments and delegate to the library API.
| `ai process --tdoc-id <id>` | TDoc ID | Progress + summary line | `{"tdoc_id": ..., "stages": {...}}` |
| `ai process --all [--new-only]` | Flags | Progress bar + counts | `[{"tdoc_id": ..., "stages": {...}}, ...]` |
| `ai status --tdoc-id <id>` | TDoc ID | Stage table | `{"tdoc_id": ..., "stages": {...}}` |
| `ai query "<question>"` | Query text | Answer + source refs | `{"answer": ..., "sources": [...]}` |
| `ai query "<question>" [--tdoc-id <id>]` | Query text | Answer + source refs | `{"answer": ..., "sources": [...]}` |
| `ai graph --query "<question>"` | Query text | Chronological chain | `{"nodes": [...], "edges": [...]}` |

All commands accept `--json` for structured output. Errors go to stderr.
+10 −12
Original line number Diff line number Diff line
@@ -213,8 +213,7 @@ ______________________________________________________________________
### Edge Cases

- What happens when a TDoc download folder contains no DOCX files (only PDF or ZIP)?
  The system should log a warning and skip extraction, marking the TDoc as
  `extraction_skipped`.
  The system should log a warning and skip extraction, marking the TDoc as `failed`.
- What happens when the configured LLM endpoint is unreachable? The summarization and
  graph stages should fail gracefully with a clear error; earlier stages (extraction,
  embeddings) should still succeed.
@@ -259,9 +258,10 @@ ______________________________________________________________________
- **FR-011**: Before implementing any new functionality, existing code MUST be searched
  for equivalent implementations; CLI code MUST only delegate to core library functions
  and MUST NOT duplicate domain logic.
- **FR-012**: All HTTP requests (to LLM endpoints, embedding services) MUST use
  `create_cached_session()` from `tdoc_crawler.http_client` where applicable; direct
  HTTP calls to external services are prohibited.
- **FR-012**: All HTTP requests that are deterministic and cacheable MUST use
  `create_cached_session()` from `tdoc_crawler.http_client`; direct HTTP calls to
  external services are prohibited. LLM calls via litellm are exempt from caching, and
  local embedding generation does not use HTTP.
- **FR-013**: All AI processing stages MUST be idempotent: re-running a stage on an
  already-processed TDoc with unchanged inputs MUST produce no side effects.
- **FR-014**: The pipeline MUST support incremental processing: only new or updated
@@ -286,17 +286,15 @@ ______________________________________________________________________
- **DocumentClassification**: Records main/secondary classification for each file in a
  TDoc folder. Includes confidence score, decisive heuristic, and file path.
- **DocumentChunk**: A segment of extracted Markdown text with position metadata (TDoc ID,
  section heading, chunk index, character offsets). Input for embeddings.
- **ChunkEmbedding**: A vector representation of a DocumentChunk. Includes the embedding
  model version, vector dimension, and creation timestamp.
  section heading, chunk index, character offsets) and its embedding vector metadata.
- **DocumentSummary**: The LLM-generated abstract and structured summary for a TDoc.
  Includes model ID, prompt version, and generation timestamp.
- **GraphNode**: An entity in the knowledge graph (TDoc, Meeting, Spec, WorkItem, CR,
  Company, Person, or Concept). Has temporal validity fields (valid_from, valid_to) and
  a type discriminator.
  Company, or Concept). Has temporal validity fields (valid_from, valid_to) and a type
  discriminator.
- **GraphEdge**: A typed relationship between two GraphNodes (discusses, revises,
  references, supersedes, authored_by, merged_into). Has weight, temporal context, and
  source provenance.
  references, supersedes, authored_by, merged_into, presented_at). Has weight, temporal
  context, and source provenance.

## Assumptions

+8 −7
Original line number Diff line number Diff line
@@ -19,9 +19,9 @@

**Purpose**: Create the ai/ package skeleton, declare AI dependencies, and prepare test fixtures

- [ ] T001 [P] Create ai/ package structure with `__init__.py` in src/tdoc_crawler/ai/ and src/tdoc_crawler/ai/operations/
- [ ] T002 [P] Add optional `[ai]` dependency group (docling, lancedb, litellm, sentence-transformers) to pyproject.toml
- [ ] T003 [P] Create test fixture directory tests/data/ai/ with sample DOCX files (single-file TDoc, multi-file TDoc with cover note, corrupt/empty file)
- [x] T001 [P] Create ai/ package structure with `__init__.py` in src/tdoc_crawler/ai/ and src/tdoc_crawler/ai/operations/
- [x] T002 [P] Add optional `[ai]` dependency group (docling, lancedb, litellm, sentence-transformers) to pyproject.toml
- [x] T003 [P] Create test fixture directory tests/data/ai/ with sample DOCX files (single-file TDoc, multi-file TDoc with cover note, corrupt/empty file)

______________________________________________________________________

@@ -31,10 +31,10 @@ ______________________________________________________________________

**CRITICAL**: No user story work can begin until this phase is complete

- [ ] T004 Implement all pydantic models, enums (PipelineStage, GraphNodeType, GraphEdgeType), and error types (AiError, TDocNotFoundError, ExtractionError, LlmConfigError, AiConfigError, EmbeddingDimensionError) per data-model.md in src/tdoc_crawler/ai/models.py
- [ ] T005 [P] Implement AiConfig configuration model with environment variable loading and defaults per data-model.md in src/tdoc_crawler/ai/config.py
- [ ] T006 Implement AiStorage class with LanceDB connection, table initialization (processing_status, classifications, chunks, summaries, graph_nodes, graph_edges), and all CRUD methods per contracts/api.md in src/tdoc_crawler/ai/storage.py
- [ ] T007 Create public API module with stub functions (process_tdoc, process_all, get_status, query_embeddings, query_graph) per contracts/api.md in src/tdoc_crawler/ai/**init**.py
- [x] T004 Implement all pydantic models, enums (PipelineStage, GraphNodeType, GraphEdgeType), and error types (AiError, TDocNotFoundError, ExtractionError, LlmConfigError, AiConfigError, EmbeddingDimensionError) per data-model.md in src/tdoc_crawler/ai/models.py
- [x] T005 [P] Implement AiConfig configuration model with environment variable loading and defaults per data-model.md in src/tdoc_crawler/ai/config.py
- [x] T006 Implement AiStorage class with LanceDB connection, table initialization (processing_status, classifications, chunks, summaries, graph_nodes, graph_edges), and all CRUD methods per contracts/api.md in src/tdoc_crawler/ai/storage.py
- [x] T007 Create public API module with stub functions (process_tdoc, process_all, get_status, query_embeddings, query_graph) per contracts/api.md in src/tdoc_crawler/ai/**init**.py

**Checkpoint**: Foundation ready — user story implementation can now begin

@@ -191,6 +191,7 @@ ______________________________________________________________________
- [ ] T026 [P] Run Ruff and Ty checks, fix all lint and type errors across src/tdoc_crawler/ai/ and tests/test_ai\_\*.py
- [ ] T027 [P] Update docs/index.md and relevant docs/ files with AI commands reference documentation
- [ ] T028 Run quickstart.md validation: execute all CLI examples from specs/002-ai-document-processing/quickstart.md and verify outputs
- [ ] T028b Run success criteria validation against spec.md success criteria and record outcomes in docs/history/
- [ ] T029 [P] Add @pytest.mark.integration markers for tests requiring real AI models (Docling, sentence-transformers, litellm) in tests/
- [ ] T030 Run full test suite with uv run pytest -v and verify all tests pass