Commit da21dcbf authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(workspaces): update implementation plan and tasks for workspace scoping

- Refactor implementation plan to use a template format for feature branches.
- Add notes for planning workflow and technical context requirements.
- Update tasks to reflect completed user stories and implementation checkpoints.
- Enhance documentation for extraction migration strategy and graph construction.
- Revise quickstart commands for clarity and usability.
parent 6fbc6d4c
Loading
Loading
Loading
Loading
+77 −92
Original line number Diff line number Diff line
# Implementation Plan: GraphRAG Workspace Scoping
# Implementation Plan: [FEATURE]

**Branch**: `001-graphrag-workspaces` | **Date**: 2026-02-25 | **Spec**: `specs/001-graphrag-workspaces/spec.md`
**Input**: Feature specification from `specs/001-graphrag-workspaces/spec.md`
**Branch**: `[###-feature-name]` | **Date**: [DATE] | **Spec**: [link]
**Input**: Feature specification from `/specs/[###-feature-name]/spec.md`

**Note**: This template is filled in by the planning workflow.

## Summary

Add workspace/project scoping to AI/GraphRAG processing with mandatory `default` fallback.
The implementation extends existing AI pipeline/storage/CLI code to support isolated
workspace corpora and workspace-scoped artifact generation, while reusing current
pipeline stages and removing/adjusting single-workspace assumptions already present in
`src/tdoc_crawler/ai` and `src/tdoc_crawler/cli/ai.py`.
[Extract from feature spec: primary requirement + technical approach from research]

## Technical Context

**Language/Version**: Python 3.14
**Primary Dependencies**: typer, rich, pydantic, lancedb, pyarrow, sentence-transformers
**Storage**: LanceDB tables under `.ai/lancedb` + workspace file references
**Testing**: pytest (`uv run pytest`), focused AI tests under `tests/ai/`
**Target Platform**: Cross-platform CLI (Windows/Linux/macOS)
**Project Type**: Single Python repository (library + CLI)
**Performance Goals**: Workspace resolution and membership filtering add negligible overhead (target \<10% vs current single-scope flow)
**Constraints**: Maintain backward-compatible default behavior via `default` workspace; avoid breaking completed AI pipeline stages
**Scale/Scope**: Multiple workspaces per repository, each with selected subsets of TDocs/specs/other files
<!--
  ACTION REQUIRED: Replace the content in this section with the technical details
  for the project. The structure here is presented in advisory capacity to guide
  the iteration process.
-->

**Language/Version**: [e.g., Python 3.11, Swift 5.9, Rust 1.75 or NEEDS CLARIFICATION]
**Primary Dependencies**: [e.g., FastAPI, UIKit, LLVM or NEEDS CLARIFICATION]
**Storage**: [if applicable, e.g., PostgreSQL, CoreData, files or N/A]
**Testing**: [e.g., pytest, XCTest, cargo test or NEEDS CLARIFICATION]
**Target Platform**: [e.g., Linux server, iOS 15+, WASM or NEEDS CLARIFICATION]
**Project Type**: [single/web/mobile - determines source structure]
**Performance Goals**: [domain-specific, e.g., 1000 req/s, 10k lines/sec, 60 fps or NEEDS CLARIFICATION]
**Constraints**: [domain-specific, e.g., \<200ms p95, \<100MB memory, offline-capable or NEEDS CLARIFICATION]
**Scale/Scope**: [domain-specific, e.g., 10k users, 1M LOC, 50 screens or NEEDS CLARIFICATION]

## Constitution Check

*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*

- [x] Library-first boundary documented (standalone library + integration points).
- [x] CLI contract defined (text input/output + JSON mode).
- [x] TDD evidence planned (tests written/approved + red phase before implementation).
- [x] Python standards planned (type hints, logging, uv/pyproject, Ruff, Ty, pathlib,
- [ ] Library-first boundary documented (standalone library + integration points).
- [ ] CLI contract defined (text input/output + JSON mode).
- [ ] TDD evidence planned (tests written/approved + red phase before implementation).
- [ ] Python standards planned (type hints, logging, uv/pyproject, Ruff, Ty, pathlib,
  dataclasses where appropriate, Typer CLI).
- [x] Domain placement confirmed (correct domain package; no logic in cli/ or utils/;
- [ ] Domain placement confirmed (correct domain package; no logic in cli/ or utils/;
  no legacy crawlers/ reference).
- [x] DRY validation done (existing code searched; no duplicated logic introduced).
- [x] Network access plan documented (core crawler source traffic via
- [ ] DRY validation done (existing code searched; no duplicated logic introduced).
- [ ] Network access plan documented (core crawler source traffic via
  create_cached_session(); any AI model-provider traffic explicitly documented as
  exempt with approved provider/client integration).

## Library Boundary & CLI Contract

**Library boundary (new/extended in `src/tdoc_crawler/ai/`)**

- Extend AI domain APIs to include explicit `workspace` context with default fallback.
- Reuse existing orchestration (`process_tdoc`, `process_all`, `run_pipeline`) and
  extend signatures to accept workspace/project scoping.
- Add workspace management library operations (create/list/get/add-members/remove-members)
  in AI domain package.
- Keep domain logic in `src/tdoc_crawler/ai/`; CLI remains delegation-only.

**CLI contract (`src/tdoc_crawler/cli/ai.py`)**

- Workspace-aware options on relevant commands: `--workspace` (optional, default resolves to `default`).
- Add workspace management command group (text output via rich; JSON via `--json`).
- Preserve existing text output for human use and JSON mode for automation.
- The OpenAPI document in `contracts/` is a logical payload contract for CLI JSON
  request/response semantics in this feature; it does NOT require implementing an
  HTTP server in this increment.

**Existing-code consideration/removal scope**

- Keep existing single-scope flows functionally equivalent by mapping them to `default`.
- Remove implicit assumptions that a global `.ai/lancedb` is unscoped by introducing
  workspace-aware filtering/keys.
- Rename newly introduced generic interfaces to `source_item`/`workspace member` where
  semantics include non-TDoc files; keep legacy `tdoc_id` fields only where tied to existing
  pipeline identity and backward compatibility.
\[Describe the standalone library module, its public API, and the CLI entrypoints.
Include text input/output channels and JSON output mode.\]

## TDD Evidence

For each user story, tests are written and approved first, then run red before code changes:

- **US1 (workspace isolation)**: tests for workspace creation, membership registration,
  and isolation across two workspaces.
- **US2 (`default` fallback)**: tests for omitted workspace argument resolving to
  auto-created `default` in library + CLI paths.
- **US3 (workspace-scoped KB build)**: tests that process/build operations only use
  selected members from target workspace and never cross-contaminate artifacts.

Red evidence is captured by running focused tests (e.g., `tests/ai/test_ai_workspaces*.py`,
updated `tests/ai/test_ai_cli.py`, `tests/ai/test_ai_pipeline.py`) before implementation,
then green after implementation.
\[Describe the unit tests to be written first and how failing tests (red phase) will be
validated before implementation begins.\]

## Project Structure

### Documentation (this feature)

```text
specs/001-graphrag-workspaces/
specs/[###-feature]/
├── plan.md              # This file (/speckit.plan command output)
├── research.md          # Phase 0 output (/speckit.plan command)
├── data-model.md        # Phase 1 output (/speckit.plan command)
@@ -99,42 +69,57 @@ specs/001-graphrag-workspaces/

### Source Code (repository root)

<!--
  ACTION REQUIRED: Replace the placeholder tree below with the concrete layout
  for this feature. Delete unused options and expand the chosen structure with
  real paths (e.g., apps/admin, packages/something). The delivered plan must
  not include Option labels.
-->

```text
# [REMOVE IF UNUSED] Option 1: Single project (DEFAULT)
src/
├── tdoc_crawler/
│   ├── ai/
│   │   ├── __init__.py
│   │   ├── models.py
│   │   ├── storage.py
│   │   └── operations/
│   │       ├── pipeline.py
│   │       ├── extract.py
│   │       └── ...
│   └── cli/
│       ├── ai.py
│       └── app.py
├── models/
├── services/
├── cli/
└── lib/

tests/
└── ai/
    ├── test_ai_cli.py
    ├── test_ai_pipeline.py
    ├── test_ai_storage.py
    └── (new) test_ai_workspaces.py
├── contract/
├── integration/
└── unit/

# [REMOVE IF UNUSED] Option 2: Web application (when "frontend" + "backend" detected)
backend/
├── src/
│   ├── models/
│   ├── services/
│   └── api/
└── tests/

frontend/
├── src/
│   ├── components/
│   ├── pages/
│   └── services/
└── tests/

# [REMOVE IF UNUSED] Option 3: Mobile + API (when "iOS/Android" detected)
api/
└── [same as backend above]

ios/ or android/
└── [platform-specific structure: feature modules, UI flows, platform tests]
```

**Structure Decision**: Extend existing AI domain and AI CLI modules in-place; add
workspace-specific models/operations without creating parallel subsystems.
**Structure Decision**: \[Document the selected structure and reference the real
directories captured above\]

## Complexity Tracking

No constitution violations identified.

## Post-Design Constitution Re-Check
> **Fill ONLY if Constitution Check has violations that must be justified**

- [x] Library-first boundary remains intact after design artifacts.
- [x] CLI text/JSON contract is preserved in contract and quickstart artifacts.
- [x] TDD-first flow remains mandatory in quickstart/test plan.
- [x] Python standards/tooling expectations remain unchanged.
- [x] Domain placement keeps logic in AI domain and CLI delegation-only.
- [x] DRY strategy is explicit (reuse existing pipeline/storage/CLI; avoid parallel systems).
- [x] Network policy remains compliant (no new core-crawler HTTP sources introduced).
| Violation | Why Needed | Simpler Alternative Rejected Because |
|-----------|------------|-------------------------------------|
| [e.g., 4th project] | [current need] | [why 3 projects insufficient] |
| [e.g., Repository pattern] | [specific problem] | [why direct DB access insufficient] |
+26 −23
Original line number Diff line number Diff line
@@ -69,14 +69,20 @@ ______________________________________________________________________

### Tests for User Story 2 (REQUIRED) ⚠️

- [ ] T022 [P] [US2] Add red library tests for omitted/blank workspace fallback in `tests/ai/test_ai_workspaces.py`
- [x] T022 [P] [US2] Add red library tests for omitted/blank workspace fallback in `tests/ai/test_ai_workspaces.py`
- [x] T023 [P] [US2] Add red CLI tests for implicit `default` resolution in `tests/ai/test_ai_cli.py`
- [x] T024 [US2] Obtain user approval for failing red-test evidence before starting implementation in `tests/ai/test_ai_workspaces.py` and `tests/ai/test_ai_cli.py`
- [x] T025 [US2] Re-run existing-implementation search and record reuse/refactor decisions in `specs/001-graphrag-workspaces/research.md` for `src/tdoc_crawler/ai/__init__.py`, `src/tdoc_crawler/ai/storage.py`, and `src/tdoc_crawler/cli/ai.py`
- [ ] T023 [P] [US2] Add red CLI tests for implicit `default` resolution in `tests/ai/test_ai_cli.py`
- [ ] T024 [US2] Obtain user approval for failing red-test evidence before starting implementation in `tests/ai/test_ai_workspaces.py` and `tests/ai/test_ai_cli.py`
- [ ] T025 [US2] Re-run existing-implementation search and record reuse/refactor decisions in `specs/001-graphrag-workspaces/research.md` for `src/tdoc_crawler/ai/__init__.py`, `src/tdoc_crawler/ai/storage.py`, and `src/tdoc_crawler/cli/ai.py`

### Implementation for User Story 2

- [ ] T026 [US2] Implement `default` auto-create and normalization logic in `src/tdoc_crawler/ai/operations/workspaces.py`
- [x] T026 [US2] Implement `default` auto-create and normalization logic in `src/tdoc_crawler/ai/operations/workspaces.py`
- [x] T027 [US2] Apply fallback behavior in public AI APIs (`process_tdoc`, `process_all`, `get_status`) in `src/tdoc_crawler/ai/__init__.py`
- [x] T028 [US2] Apply fallback behavior in CLI entrypoints (`process`, `status`, `workspace` commands) in `src/tdoc_crawler/cli/ai.py`
- [x] T029 [US2] Enforce default fallback for blank/unset workspace values in `src/tdoc_crawler/ai/storage.py`
- [ ] T027 [US2] Apply fallback behavior in public AI APIs (`process_tdoc`, `process_all`, `get_status`) in `src/tdoc_crawler/ai/__init__.py`
- [ ] T028 [US2] Apply fallback behavior in CLI entrypoints (`process`, `status`, `workspace` commands) in `src/tdoc_crawler/cli/ai.py`
- [ ] T029 [US2] Enforce default fallback for blank/unset workspace values in `src/tdoc_crawler/ai/storage.py`
@@ -93,35 +99,32 @@ ______________________________________________________________________

### Tests for User Story 3 (REQUIRED) ⚠️

- [ ] T030 [P] [US3] Add red pipeline isolation tests for workspace-scoped processing in `tests/ai/test_ai_pipeline.py`
- [ ] T031 [P] [US3] Add red contract tests for workspace-scoped process/status behaviors in `tests/ai/test_ai_workspace_contract.py`
- [ ] T032 [P] [US3] Add red storage scope tests for status/chunk/summary retrieval in `tests/ai/test_ai_storage.py`
- [ ] T033 [US3] Obtain user approval for failing red-test evidence before starting implementation in `tests/ai/test_ai_pipeline.py`, `tests/ai/test_ai_workspace_contract.py`, and `tests/ai/test_ai_storage.py`
- [ ] T034 [US3] Re-run existing-implementation search and record reuse/refactor decisions in `specs/001-graphrag-workspaces/research.md` for `src/tdoc_crawler/ai/operations/pipeline.py`, `src/tdoc_crawler/ai/storage.py`, and `src/tdoc_crawler/cli/ai.py`
- [x] T030 [P] [US3] Add red pipeline isolation tests for workspace-scoped processing in `tests/ai/test_ai_pipeline.py`
- [x] T031 [P] [US3] Add red contract tests for workspace-scoped process/status behaviors in `tests/ai/test_ai_workspace_contract.py`
- [x] T032 [P] [US3] Add red storage scope tests for status/chunk/summary retrieval in `tests/ai/test_ai_storage.py`
- [x] T033 [US3] Obtain user approval for failing red-test evidence before starting implementation in `tests/ai/test_ai_pipeline.py`, `tests/ai/test_ai_workspace_contract.py`, and `tests/ai/test_ai_storage.py`
- [x] T034 [US3] Re-run existing-implementation search and record reuse/refactor decisions in `specs/001-graphrag-workspaces/research.md` for `src/tdoc_crawler/ai/operations/pipeline.py`, `src/tdoc_crawler/ai/storage.py`, and `src/tdoc_crawler/cli/ai.py`

### Implementation for User Story 3

- [ ] T035 [US3] Add workspace parameter support to orchestration APIs in `src/tdoc_crawler/ai/operations/pipeline.py`
- [ ] T036 [US3] Scope process-all input resolution to workspace member corpus in `src/tdoc_crawler/ai/operations/pipeline.py` and `src/tdoc_crawler/ai/operations/workspaces.py`
- [ ] T037 [US3] Add workspace association fields/validation to artifact models in `src/tdoc_crawler/ai/models.py`
- [ ] T038 [US3] Persist and filter artifact tables by workspace in `src/tdoc_crawler/ai/storage.py`
- [ ] T039 [US3] Pass workspace scope from CLI process/status/query commands to library in `src/tdoc_crawler/cli/ai.py`
- [ ] T040 [US3] Remove or retire legacy unscoped artifact access paths in `src/tdoc_crawler/ai/storage.py` and `src/tdoc_crawler/ai/operations/pipeline.py`

- [x] T035 [US3] Add workspace parameter support to orchestration APIs in `src/tdoc_crawler/ai/operations/pipeline.py`
- [x] T036 [US3] Scope process-all input resolution to workspace member corpus in `src/tdoc_crawler/ai/operations/pipeline.py` and `src/tdoc_crawler/ai/operations/workspaces.py`
- [x] T037 [US3] Add workspace association fields/validation to artifact models in `src/tdoc_crawler/ai/models.py`
- [x] T038 [US3] Persist and filter artifact tables by workspace in `src/tdoc_crawler/ai/storage.py`
- [x] T039 [US3] Pass workspace scope from CLI process/status/query commands to library in `src/tdoc_crawler/cli/ai.py`
- [x] T040 [US3] Remove or retire legacy unscoped artifact access paths in `src/tdoc_crawler/ai/storage.py` and `src/tdoc_crawler/ai/operations/pipeline.py`
  **Checkpoint**: All user stories are independently functional.

______________________________________________________________________

## Phase 6: Polish & Cross-Cutting Concerns

**Purpose**: Final consistency checks, documentation sync, and quality gates.

- [ ] T041 [P] Update workspace quickstart and command examples in `specs/001-graphrag-workspaces/quickstart.md`
- [ ] T042 [P] Update design notes for final naming and compatibility decisions in `specs/001-graphrag-workspaces/research.md`
- [ ] T043 [P] Sync finalized contract examples and non-functional descriptions in `specs/001-graphrag-workspaces/contracts/workspace-api.openapi.yaml`
- [ ] T044 Run Ruff/Ty fixes for touched modules in `src/tdoc_crawler/ai/models.py`, `src/tdoc_crawler/ai/storage.py`, `src/tdoc_crawler/ai/operations/pipeline.py`, `src/tdoc_crawler/ai/operations/workspaces.py`, and `src/tdoc_crawler/cli/ai.py`
- [ ] T045 [P] Execute and stabilize focused AI tests in `tests/ai/test_ai_workspaces.py`, `tests/ai/test_ai_workspace_contract.py`, `tests/ai/test_ai_pipeline.py`, `tests/ai/test_ai_storage.py`, and `tests/ai/test_ai_cli.py`
- [ ] T046 [P] Validate SC-003 performance and scale: generate test dataset (30+ source items across 3+ workspaces, 8+ items per workspace, mixed docx/pdf/md/txt), measure workspace creation + corpus registration time, verify completion under 2 minutes per SC-003
- [x] T041 [P] Update workspace quickstart and command examples in `specs/001-graphrag-workspaces/quickstart.md`
- [x] T042 [P] Update design notes for final naming and compatibility decisions in `specs/001-graphrag-workspaces/research.md`
- [x] T043 [P] Sync finalized contract examples and non-functional descriptions in `specs/001-graphrag-workspaces/contracts/workspace-api.openapi.yaml`
- [x] T044 Run Ruff/Ty fixes for touched modules in `src/tdoc_crawler/ai/models.py`, `src/tdoc_crawler/ai/storage.py`, `src/tdoc_crawler/ai/operations/pipeline.py`, `src/tdoc_crawler/ai/operations/workspaces.py`, and `src/tdoc_crawler/cli/ai.py`
- [x] T045 [P] Execute and stabilize focused AI tests in `tests/ai/test_ai_workspaces.py`, `tests/ai/test_ai_workspace_contract.py`, `tests/ai/test_ai_pipeline.py`, `tests/ai/test_ai_storage.py`, and `tests/ai/test_ai_cli.py`
- [x] T046 [P] Validate SC-003 performance and scale

______________________________________________________________________

+4 −1
Original line number Diff line number Diff line
@@ -122,7 +122,10 @@ def extract_docx_to_markdown(
    docx_path: Path,
    output_dir: Path,
) -> Path:
    """Convert a DOCX file to Markdown using Docling.
    """Convert a DOCX file to Markdown using extraction library.

    Note: Uses Docling in Phases 1-8. Replaced by Kreuzberg in Phase 9 with full refactoring (no backward compatibility).


    Args:
        docx_path: Path to the source DOCX file.
+2 −1
Original line number Diff line number Diff line
@@ -37,7 +37,8 @@ class AiConfig(BaseModel):
    ai_store_path: Path  # Default: <cache_dir>/.ai/lancedb/

    # Extraction
    # (No configurable params for Docling in v1 — uses defaults)
    # Extraction
    # (No configurable params for extraction in v1 — uses defaults. Note: Docling used in Phases 1-8, replaced by Kreuzberg in Phase 9 with full refactoring)

    # Embeddings
    embedding_model: str = "BAAI/bge-small-en-v1.5"
+4 −4
Original line number Diff line number Diff line
@@ -19,7 +19,7 @@

```bash
# Extract DOCX to Markdown and classify files
tdoc-crawler ai process --tdoc-id SP-123456
tdoc-crawler ai process --tdoc-id SP-123456 --checkout-path /path/to/SP-123456
```

Output:
@@ -38,10 +38,10 @@ Completed SP-123456 in 8.4s

```bash
# Process all downloaded TDocs
tdoc-crawler ai process --all
tdoc-crawler ai process --all --checkout-base /path/to/checkout

# Process only new (unprocessed) TDocs
tdoc-crawler ai process --all --new-only
tdoc-crawler ai process --all --checkout-base /path/to/checkout --new-only
```

## Check Status
@@ -58,7 +58,7 @@ tdoc-crawler ai status

```bash
# Search across all processed TDocs
tdoc-crawler ai query "uplink power control enhancements"
tdoc-crawler ai query --query "uplink power control enhancements"

# Get results as JSON
tdoc-crawler ai query "uplink power control" --json --top-k 10
Loading