feat(cli): enhance query command with tdoc-id flag for targeted searches (c2b3731a) · Commits · Jan Reimes / 3gpp-crawler

specs/002-ai-document-processing/3GPP Graph RAG.md

0 → 100644

+416 −0

File added.

Preview size limit exceeded, changes collapsed.

specs/002-ai-document-processing/contracts/cli.md

+2 −1

Original line number	Diff line number	Diff line
		@@ -57,12 +57,13 @@ Embedded, Summarized, Graphed, Error.
		Semantic search over processed TDocs.

		```
		tdoc-crawler ai query "<question>" [--top-k <N>] [--json] [--cache-dir <PATH>]
		tdoc-crawler ai query "<question>" [--tdoc-id <ID>] [--top-k <N>] [--json] [--cache-dir <PATH>]
		```

		\| Flag \| Type \| Default \| Description \|
		\|------\|------\|---------\|-------------\|
		\| `<question>` \| str \| required \| Natural language query \|
		\| `--tdoc-id` \| str \| None \| Limit search to a single TDoc \|
		\| `--top-k` \| int \| 5 \| Number of results \|
		\| `--json` \| bool \| False \| Output JSON \|
		\| `--cache-dir` \| Path \| None \| Override cache directory \|

specs/002-ai-document-processing/plan.md

+2 −2

Original line number	Diff line number	Diff line
		@@ -43,7 +43,7 @@ AI dependencies are packaged as optional extras (`tdoc_crawler[ai]`).
		`logging` module. Optional extras in pyproject.toml. Ruff + Ty clean.
		- [x] Domain placement confirmed (correct domain package; no logic in cli/ or utils/;
		no legacy crawlers/ reference).
		- All AI logic in `src/tdoc_crawler/ai/` with `operations/`, `sources/`, `models.py`.
		- All AI logic in `src/tdoc_crawler/ai/` with `operations/` and `models.py`.
		CLI file `cli/ai.py` delegates only. No crawlers/ references.
		- [x] DRY validation done (existing code searched; no duplicated logic introduced).
		- Reuses: `CacheManager` for path resolution, `create_cached_session()` for HTTP,
		@@ -91,7 +91,7 @@ CLI functions only parse arguments and delegate to the library API.
		\| `ai process --tdoc-id <id>` \| TDoc ID \| Progress + summary line \| `{"tdoc_id": ..., "stages": {...}}` \|
		\| `ai process --all [--new-only]` \| Flags \| Progress bar + counts \| `[{"tdoc_id": ..., "stages": {...}}, ...]` \|
		\| `ai status --tdoc-id <id>` \| TDoc ID \| Stage table \| `{"tdoc_id": ..., "stages": {...}}` \|
		\| `ai query "<question>"` \| Query text \| Answer + source refs \| `{"answer": ..., "sources": [...]}` \|
		\| `ai query "<question>" [--tdoc-id <id>]` \| Query text \| Answer + source refs \| `{"answer": ..., "sources": [...]}` \|
		\| `ai graph --query "<question>"` \| Query text \| Chronological chain \| `{"nodes": [...], "edges": [...]}` \|

		All commands accept `--json` for structured output. Errors go to stderr.

specs/002-ai-document-processing/spec.md

+10 −12

Original line number	Diff line number	Diff line
		@@ -213,8 +213,7 @@ ______________________________________________________________________
		### Edge Cases

		- What happens when a TDoc download folder contains no DOCX files (only PDF or ZIP)?
		The system should log a warning and skip extraction, marking the TDoc as
		`extraction_skipped`.
		The system should log a warning and skip extraction, marking the TDoc as `failed`.
		- What happens when the configured LLM endpoint is unreachable? The summarization and
		graph stages should fail gracefully with a clear error; earlier stages (extraction,
		embeddings) should still succeed.
		@@ -259,9 +258,10 @@ ______________________________________________________________________
		- FR-011: Before implementing any new functionality, existing code MUST be searched
		for equivalent implementations; CLI code MUST only delegate to core library functions
		and MUST NOT duplicate domain logic.
		- FR-012: All HTTP requests (to LLM endpoints, embedding services) MUST use
		`create_cached_session()` from `tdoc_crawler.http_client` where applicable; direct
		HTTP calls to external services are prohibited.
		- FR-012: All HTTP requests that are deterministic and cacheable MUST use
		`create_cached_session()` from `tdoc_crawler.http_client`; direct HTTP calls to
		external services are prohibited. LLM calls via litellm are exempt from caching, and
		local embedding generation does not use HTTP.
		- FR-013: All AI processing stages MUST be idempotent: re-running a stage on an
		already-processed TDoc with unchanged inputs MUST produce no side effects.
		- FR-014: The pipeline MUST support incremental processing: only new or updated
		@@ -286,17 +286,15 @@ ______________________________________________________________________
		- DocumentClassification: Records main/secondary classification for each file in a
		TDoc folder. Includes confidence score, decisive heuristic, and file path.
		- DocumentChunk: A segment of extracted Markdown text with position metadata (TDoc ID,
		section heading, chunk index, character offsets). Input for embeddings.
		- ChunkEmbedding: A vector representation of a DocumentChunk. Includes the embedding
		model version, vector dimension, and creation timestamp.
		section heading, chunk index, character offsets) and its embedding vector metadata.
		- DocumentSummary: The LLM-generated abstract and structured summary for a TDoc.
		Includes model ID, prompt version, and generation timestamp.
		- GraphNode: An entity in the knowledge graph (TDoc, Meeting, Spec, WorkItem, CR,
		Company, Person, or Concept). Has temporal validity fields (valid_from, valid_to) and
		a type discriminator.
		Company, or Concept). Has temporal validity fields (valid_from, valid_to) and a type
		discriminator.
		- GraphEdge: A typed relationship between two GraphNodes (discusses, revises,
		references, supersedes, authored_by, merged_into). Has weight, temporal context, and
		source provenance.
		references, supersedes, authored_by, merged_into, presented_at). Has weight, temporal
		context, and source provenance.

		## Assumptions

specs/002-ai-document-processing/tasks.md

+8 −7

Original line number	Diff line number	Diff line
		@@ -19,9 +19,9 @@

		Purpose: Create the ai/ package skeleton, declare AI dependencies, and prepare test fixtures

		- [ ] T001 [P] Create ai/ package structure with `__init__.py` in src/tdoc_crawler/ai/ and src/tdoc_crawler/ai/operations/
		- [ ] T002 [P] Add optional `[ai]` dependency group (docling, lancedb, litellm, sentence-transformers) to pyproject.toml
		- [ ] T003 [P] Create test fixture directory tests/data/ai/ with sample DOCX files (single-file TDoc, multi-file TDoc with cover note, corrupt/empty file)
		- [x] T001 [P] Create ai/ package structure with `__init__.py` in src/tdoc_crawler/ai/ and src/tdoc_crawler/ai/operations/
		- [x] T002 [P] Add optional `[ai]` dependency group (docling, lancedb, litellm, sentence-transformers) to pyproject.toml
		- [x] T003 [P] Create test fixture directory tests/data/ai/ with sample DOCX files (single-file TDoc, multi-file TDoc with cover note, corrupt/empty file)

		______________________________________________________________________

		@@ -31,10 +31,10 @@ ______________________________________________________________________

		CRITICAL: No user story work can begin until this phase is complete

		- [ ] T004 Implement all pydantic models, enums (PipelineStage, GraphNodeType, GraphEdgeType), and error types (AiError, TDocNotFoundError, ExtractionError, LlmConfigError, AiConfigError, EmbeddingDimensionError) per data-model.md in src/tdoc_crawler/ai/models.py
		- [ ] T005 [P] Implement AiConfig configuration model with environment variable loading and defaults per data-model.md in src/tdoc_crawler/ai/config.py
		- [ ] T006 Implement AiStorage class with LanceDB connection, table initialization (processing_status, classifications, chunks, summaries, graph_nodes, graph_edges), and all CRUD methods per contracts/api.md in src/tdoc_crawler/ai/storage.py
		- [ ] T007 Create public API module with stub functions (process_tdoc, process_all, get_status, query_embeddings, query_graph) per contracts/api.md in src/tdoc_crawler/ai/init.py
		- [x] T004 Implement all pydantic models, enums (PipelineStage, GraphNodeType, GraphEdgeType), and error types (AiError, TDocNotFoundError, ExtractionError, LlmConfigError, AiConfigError, EmbeddingDimensionError) per data-model.md in src/tdoc_crawler/ai/models.py
		- [x] T005 [P] Implement AiConfig configuration model with environment variable loading and defaults per data-model.md in src/tdoc_crawler/ai/config.py
		- [x] T006 Implement AiStorage class with LanceDB connection, table initialization (processing_status, classifications, chunks, summaries, graph_nodes, graph_edges), and all CRUD methods per contracts/api.md in src/tdoc_crawler/ai/storage.py
		- [x] T007 Create public API module with stub functions (process_tdoc, process_all, get_status, query_embeddings, query_graph) per contracts/api.md in src/tdoc_crawler/ai/init.py

		Checkpoint: Foundation ready — user story implementation can now begin

		@@ -191,6 +191,7 @@ ______________________________________________________________________
		- [ ] T026 [P] Run Ruff and Ty checks, fix all lint and type errors across src/tdoc_crawler/ai/ and tests/test_ai\_\*.py
		- [ ] T027 [P] Update docs/index.md and relevant docs/ files with AI commands reference documentation
		- [ ] T028 Run quickstart.md validation: execute all CLI examples from specs/002-ai-document-processing/quickstart.md and verify outputs
		- [ ] T028b Run success criteria validation against spec.md success criteria and record outcomes in docs/history/
		- [ ] T029 [P] Add @pytest.mark.integration markers for tests requiring real AI models (Docling, sentence-transformers, litellm) in tests/
		- [ ] T030 Run full test suite with uv run pytest -v and verify all tests pass

Admin message