Commit 709c047a authored by Jan Reimes's avatar Jan Reimes
Browse files

docs(02-checkout-graph-deprecation-config): create phase plans for checkout,...

docs(02-checkout-graph-deprecation-config): create phase plans for checkout, graph, deprecation, and config fixes
parent eaebc5ba
Loading
Loading
Loading
Loading

.planning/ROADMAP.md

0 → 100644
+70 −0
Original line number Diff line number Diff line
# Roadmap — 3GPP Crawler Codebase Improvement

## Phases

- [x] **Phase 01: Normalization & Progress Bars** - Consolidate duplicate normalization logic and fix progress bar document counts (completed 2026-04-12)
- [ ] **Phase 2: Checkout, Graph, Deprecation & Config** - Fix checkout paths, datetime errors, deprecated imports, and config drift

---

## Phase Details

### Phase 01: Normalization & Progress Bars

**Goal**: Eliminate duplicate normalization code and fix progress bar UX issues so users see document counts during long operations

**Depends on**: Nothing

**Requirements**: NORM-01, NORM-02, PROGRESS-01, PROGRESS-02

**Success Criteria** (what must be TRUE):
1. All normalization functions exist in single location (`src/tdoc_crawler/utils/normalization.py`)
2. `meetings/utils.py` re-exports from normalization.py, no duplicate functions
3. 6+ files importing normalization use single source (verified via grep)
4. Unit tests exist for all normalization functions covering edge cases
5. Progress bar shows "N/N" format (e.g., "5/69") during `workspace process`
6. Progress bar shows "N/N" format during `add_members` command

**Plans**: 2 plans

**Plan list:**
- [x] 01-normalization-PLAN.md — Consolidate normalization logic and add unit tests
- [x] 02-progress-PLAN.md — Fix progress bar display to show document counts

---

### Phase 2: Checkout, Graph, Deprecation & Config

**Goal**: Fix checkout path issues, datetime scope errors in graph building, remove deprecated imports, and align config defaults

**Depends on**: Phase 1

**Requirements**: CHECKOUT-01, GRAPH-01, DEPRECATED-01, CONFIG-01

**Success Criteria** (what must be TRUE):
1. Empty folder detection triggers re-download correctly
2. "No document files found" warning no longer appears for valid TDocs with empty folders
3. Graph building completes without datetime scope errors
4. Errors in graph building are caught and reported with meaningful messages
5. No import errors from deprecated modules (AiStorage, EmbeddingsManager, tdoc_ai.operations.pipeline, lancedb)
6. `.env.example` embedding model matches code defaults in `threegpp_ai/config.py`

**Plans**: 3 plans

**Plan list:**
- [ ] 02-01-PLAN.md — Fix checkout empty folder detection and re-download triggers
- [ ] 02-02-PLAN.md — Add graph error handling and verify deprecated imports removed
- [ ] 02-03-PLAN.md — Align embedding model defaults between .env.example and code

---

## Progress

| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 01. Normalization & Progress Bars | 2/2 | Complete    | 2026-04-12 |
| 2. Checkout, Graph, Deprecation & Config | 0/0 | Not started | - |

---

*Last updated: 2026-04-12*
+196 −0
Original line number Diff line number Diff line
---
phase: 02-checkout-graph-deprecation-config
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
  - packages/3gpp-ai/threegpp_ai/operations/workspaces.py
  - packages/3gpp-ai/threegpp_ai/operations/classify.py
  - tests/test_checkout.py
autonomous: true
requirements:
  - CHECKOUT-01

must_haves:
  truths:
    - "Empty checkout folders trigger re-download automatically"
    - "No 'No document files found' warnings for valid TDocs"
    - "Checkout path matches 3GPP FTP hierarchy structure"
  artifacts:
    - path: "packages/3gpp-ai/threegpp_ai/operations/workspaces.py"
      provides: "Checkout logic with empty folder detection"
      contains: "_checkout_tdoc_if_needed"
    - path: "packages/3gpp-ai/threegpp_ai/operations/classify.py"
      provides: "Classification with proper empty folder handling"
      contains: "classify_document_files"
  key_links:
    - from: "packages/3gpp-ai/threegpp_ai/operations/workspaces.py"
      to: "packages/3gpp-ai/threegpp_ai/operations/classify.py"
      via: "checkout then classify flow"
      pattern: "classify_document_files.*folder_path"
---

<objective>
Fix checkout path issues and empty folder detection to prevent "No document files found" errors during workspace processing.

Purpose: When documents are deleted but workspace metadata remains, the system must re-checkout documents automatically instead of failing with warnings.

Output: Robust checkout logic with proper empty folder detection and re-download triggers.
</objective>

<execution_context>
@$HOME/.config/opencode/get-shit-done/workflows/execute-plan.md
@$HOME/.config/opencode/get-shit-done/templates/summary.md
</execution_context>

<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/REQUIREMENTS.md
@.planning/codebase/CONCERNS.md

# Key interfaces from workspaces.py
```python
async def _resolve_tdoc_metadata(tdoc_id: str, db_file: Path | None = None) -> TDocMetadata | None:
    """Resolve TDoc metadata via fallback chain: Database → WhatTheSpec → 3GPP Portal."""

def _checkout_tdoc_if_needed(tdoc_id: str, metadata: TDocMetadata, checkout_base: Path) -> Path | None:
    """Checkout a TDoc if not already checked out."""

def resolve_tdoc_checkout_path(tdoc_id: str, checkout_base: Path) -> Path | None:
    """Find existing TDoc checkout path by scanning for 'Docs' directories."""
```

# Current issue (CONCERNS.md line 96-119)
When running "ai workspace process" after deleting crawled documents, checkout path construction may be incorrect.
Error pattern: "WARNING No document files found in C:\Users\...\checkout\...\S4-250638 for document S4-250638"

# CacheManager pattern (from AGENTS.md)
All paths MUST use CacheManager - NEVER hardcode ~/.3gpp-crawler
</context>

<tasks>

<task type="auto" tdd="true">
  <name>Task 1: Add empty folder detection tests</name>
  <files>tests/test_checkout.py</files>
  <behavior>
    - Test: Empty folder (only .ai subfolder) triggers re-download
    - Test: Folder with actual files does NOT trigger re-download
    - Test: Non-existent folder triggers checkout
    - Test: resolve_tdoc_checkout_path finds correct path in nested Docs structure
  </behavior>
  <action>
    Create test file with pytest fixtures for checkout scenarios:
    1. Mock checkout_base with rglob("Docs") structure
    2. Test _checkout_tdoc_if_needed with empty folder (has .ai/ but no docs)
    3. Test _checkout_tdoc_if_needed with populated folder
    4. Verify resolve_tdoc_checkout_path handles 3GPP FTP hierarchy:
       - TSG_SA/WG4_CODEC/TSGS4_131-bis-e/Docs/S4-250638
       - TSG_RAN/WG1_RH/TSGR1_115/Docs/R1-2300001
    5. Test classification flow when folder is empty vs populated
    
    Use tmp_path for isolated test directories. Mock TDocMetadata with realistic FTP URLs.
  </action>
  <verify>
    <automated>uv run pytest tests/test_checkout.py -v</automated>
  </verify>
  <done>
    All checkout tests pass, covering empty folder detection, re-download triggers, and path resolution edge cases.
  </done>
</task>

<task type="auto" tdd="true">
  <name>Task 2: Fix empty folder detection in workspaces.py</name>
  <files>packages/3gpp-ai/threegpp_ai/operations/workspaces.py</files>
  <behavior>
    - Test: Empty folder (only .ai subfolder) returns False for has_files check
    - Test: Folder with .pdf/.docx/.xlsx files returns True for has_files check
    - Test: _checkout_tdoc_if_needed re-downloads when folder empty
    - Test: Warning message includes full path for debugging
  </behavior>
  <action>
    Update _checkout_tdoc_if_needed() at line 447-483:
    
    1. Improve empty folder detection (line 460-466):
       - Current: any(f.is_file() for f in existing_path.iterdir())
       - Problem: Only checks top-level, misses .ai/ subfolder scenario
       - Fix: Check recursively, exclude .ai/ from "has files" determination
       - Pattern: any(f.is_file() and not f.parent.name == ".ai" for f in existing_path.rglob("*"))
    
    2. Add better logging (line 466):
       - Current: "folder exists but is empty"
       - Add: List what was found (e.g., "found: .ai/ subfolder only")
    
    3. Ensure re-download triggers (line 467-479):
       - Verify checkout_tdoc() is called when folder empty
       - Add error handling for FileNotFoundError with path context
    
    4. Fix resolve_tdoc_checkout_path (line 390-394):
       - Verify rglob("Docs") matches 3GPP FTP structure
       - Add debug logging for path resolution failures
    
    Per DRY principle: Use CacheManager for all path operations if not already.
  </action>
  <verify>
    <automated>uv run pytest tests/test_checkout.py::test_empty_folder_triggers_redownload -v</automated>
  </verify>
  <done>
    Empty folder detection works correctly, re-download triggers automatically, no false "No document files found" warnings.
  </done>
</task>

<task type="auto">
  <name>Task 3: Improve classification error messages</name>
  <files>packages/3gpp-ai/threegpp_ai/operations/classify.py</files>
  <action>
    Update classify_document_files() at line 216-233:
    
    1. Enhance warning message (line 232):
       - Current: "No document files found in {folder_path} for document {document_id}"
       - Add: List expected file types (.pdf, .docx, .xlsx, .pptx)
       - Add: Suggest re-download command if folder empty
       - Example: "No document files found in {path}. Expected: .pdf/.docx/.xlsx. Run 'tdoc-crawler checkout {id}' to re-download."
    
    2. Add folder structure debug (line 230-232):
       - Before returning [], log what files exist (if any)
       - Pattern: logger.debug("Folder contents: %s", list(folder_path.iterdir()))
    
    3. Handle edge case: folder_path is symlink or mount point
       - Add: folder_path.resolve() before checking existence
    
    This improves debugging when checkout issues occur without changing core logic.
  </action>
  <verify>
    <automated>uv run pytest tests/test_checkout.py::test_classification_empty_folder_message -v</automated>
  </verify>
  <done>
    Error messages include actionable debugging information, expected file types, and re-download suggestions.
  </done>
</task>

</tasks>

<verification>
- [ ] All test_checkout.py tests pass
- [ ] Empty folder triggers re-download in manual test
- [ ] No "No document files found" warnings for valid TDocs with files
- [ ] Checkout path matches 3GPP FTP hierarchy (verified with rg --files)
</verification>

<success_criteria>
1. CHECKOUT-01 requirement satisfied: Empty folder detection triggers re-download correctly
2. "No document files found" warning no longer appears for valid TDocs with actual document files
3. Path construction matches 3GPP FTP hierarchy (TSG_SA/WG4_*/TSGS4_*/Docs/TDocID)
4. Tests cover edge cases: empty folder, populated folder, non-existent folder, symlink paths
</success_criteria>

<output>
After completion, create `.planning/phases/02-checkout-graph-deprecation-config/02-01-SUMMARY.md` with:
- Checkout fixes implemented
- Test coverage added
- Before/after error message examples
- Manual verification results
</output>
+230 −0
Original line number Diff line number Diff line
---
phase: 02-checkout-graph-deprecation-config
plan: 02
type: execute
wave: 1
depends_on: []
files_modified:
  - packages/3gpp-ai/threegpp_ai/cli.py
  - packages/3gpp-ai/threegpp_ai/operations/workspaces.py
  - tests/test_graph_errors.py
autonomous: true
requirements:
  - GRAPH-01
  - DEPRECATED-01

must_haves:
  truths:
    - "Graph building errors are caught with meaningful error messages"
    - "No import errors from deprecated modules (AiStorage, EmbeddingsManager, lancedb)"
    - "Datetime variables properly scoped in all graph operations"
  artifacts:
    - path: "packages/3gpp-ai/threegpp_ai/cli.py"
      provides: "Workspace process with graph error handling"
      contains: "_process_workspace_members"
    - path: "packages/3gpp-ai/threegpp_ai/operations/workspaces.py"
      provides: "Graph building with try/catch error handling"
      exports: ["checkout_tdoc_to_workspace"]
  key_links:
    - from: "packages/3gpp-ai/threegpp_ai/cli.py"
      to: "packages/3gpp-ai/threegpp_ai/operations/workspaces.py"
      via: "process workflow"
      pattern: "_process_workspace_members.*checkout_tdoc"
---

<objective>
Add error handling for graph building failures and verify no deprecated imports remain in the codebase.

Purpose: Prevent silent graph building failures and ensure codebase is clean of deprecated module references.

Output: Robust error handling for graph operations, verified clean import tree.
</objective>

<execution_context>
@$HOME/.config/opencode/get-shit-done/workflows/execute-plan.md
@$HOME/.config/opencode/get-shit-done/templates/summary.md
</execution_context>

<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/REQUIREMENTS.md
@.planning/codebase/CONCERNS.md

# GRAPH-01 Issue (CONCERNS.md line 122-146)
"Graph building failed" errors with datetime scope issues. May be from LightRAG library internally.
Current code shows no direct "Graph building failed" matches - error may be:
1. Generated by LightRAG library internally
2. From previous version now fixed
3. Dynamic based on runtime conditions

# DEPRECATED-01 Issue (CONCERNS.md line 198-223)
AGENTS.md lists deprecated components:
- AiStorage, EmbeddingsManager, create_embeddings_manager()
- tdoc_ai.operations.pipeline (legacy flow)
- tdoc_ai.storage.lancedb, sentence-transformers, tokenizers, lancedb

Grep search found NO imports of these deprecated modules - may already be clean.

# LightRAG Integration (from AGENTS.md)
```python
from tdoc_ai import LightRAGConfig, TDocRAG, TDocProcessor
config = LightRAGConfig.from_env()
rag = TDocRAG(config)
await rag.start("my-workspace")
```

Graph building happens inside LightRAG's insert() operation.
</context>

<tasks>

<task type="auto">
  <name>Task 1: Verify deprecated imports are removed</name>
  <files>packages/3gpp-ai/**/*.py</files>
  <action>
    Search entire 3gpp-ai package for deprecated imports:
    
    1. Run comprehensive grep searches:
       - `from.*AiStorage|import.*AiStorage`
       - `from.*EmbeddingsManager|import.*EmbeddingsManager`
       - `from.*lancedb|import.*lancedb`
       - `from.*sentence_transformers|import.*sentence-transformers`
       - `tdoc_ai.operations.pipeline`
       - `tdoc_ai.storage`
    
    2. Check imports in key files:
       - packages/3gpp-ai/threegpp_ai/__init__.py
       - packages/3gpp-ai/threegpp_ai/operations/*.py
       - packages/3gpp-ai/threegpp_ai/cli.py
    
    3. If ANY deprecated imports found:
       - Document file and line number
       - Identify replacement (should be LightRAG-based)
       - Remove/update import
    
    4. If NO deprecated imports found (expected):
       - Document verification in test file
       - Update AGENTS.md deprecated list if accurate
       - Mark DEPRECATED-01 as complete
    
    Expected result: No deprecated imports remain (already clean based on initial grep).
  </action>
  <verify>
    <automated>rtk grep -r "AiStorage|EmbeddingsManager|lancedb" --include "*.py" packages/3gpp-ai/</automated>
  </verify>
  <done>
    Comprehensive search confirms zero deprecated imports, or all found imports are updated to LightRAG equivalents.
  </done>
</task>

<task type="auto" tdd="true">
  <name>Task 2: Add graph error handling tests</name>
  <files>tests/test_graph_errors.py</files>
  <behavior>
    - Test: Graph building error caught and logged with meaningful message
    - Test: Datetime variables properly scoped before graph operations
    - Test: Error does not crash entire workspace process
    - Test: Failed graph building still allows embedding to complete
  </behavior>
  <action>
    Create test file for graph error scenarios:
    
    1. Mock LightRAG insert() to raise various errors:
       - ValueError with datetime message
       - RuntimeError from graph building
       - Timeout during graph insertion
    
    2. Test _process_workspace_members() error handling:
       - Verify error caught and logged
       - Verify processing continues for other documents
       - Verify results array includes error status
    
    3. Test datetime scoping:
       - Verify datetime imported at module level (line 11 in cli.py)
       - Verify datetime.combine() calls have proper scope (line 205-206)
       - Test with various date formats (partial dates, timezone-aware)
    
    4. Document actual error patterns from LightRAG:
       - Run workspace process with debug logging
       - Capture any "Graph building failed" messages
       - Trace error source (library vs. our code)
    
    Use pytest fixtures to mock LightRAG components without actual LLM/embedding calls.
  </action>
  <verify>
    <automated>uv run pytest tests/test_graph_errors.py -v</automated>
  </verify>
  <done>
    Tests cover graph error scenarios, datetime scoping verified, error handling prevents crashes.
  </done>
</task>

<task type="auto">
  <name>Task 3: Add graph error handling in workspaces.py</name>
  <files>packages/3gpp-ai/threegpp_ai/operations/workspaces.py</files>
  <action>
    Investigate where graph building occurs in workspace processing:
    
    1. Find graph building calls:
       - Search for rag.insert(), graph.build(), or similar
       - Check extraction.py, chunking.py, summarize.py operations
       - Trace _process_workspace_members() call chain
    
    2. Add try/catch around graph operations:
       ```python
       try:
           await rag.insert(document_chunks)
       except ValueError as e:
           if "datetime" in str(e).lower():
               logger.error(f"Graph building failed for {doc_id}: datetime scope error. Check datetime imports and variable scoping.")
           else:
               logger.error(f"Graph building failed for {doc_id}: {e}")
           # Continue processing - embeddings still work
       except Exception as e:
           logger.error(f"Graph building failed for {doc_id}: {type(e).__name__}: {e}")
       ```
    
    3. Add debug logging before graph operations:
       - Log document ID, chunk count, datetime variables in scope
       - Log LightRAG configuration (model, backend)
    
    4. If graph building is inside LightRAG library (not our code):
       - Document this finding
       - Add error handling at our integration point
       - Consider upgrading LightRAG version if bug is known
    
    Per KARPATHY-guidelines: Make surgical changes, surface assumptions, define verifiable success.
  </action>
  <verify>
    <automated>uv run pytest tests/test_graph_errors.py::test_graph_error_caught -v</automated>
  </verify>
  <done>
    Graph building errors caught with meaningful messages, processing continues for other documents, datetime scoping verified.
  </done>
</task>

</tasks>

<verification>
- [ ] No deprecated imports found (or all updated)
- [ ] Graph error tests pass
- [ ] Manual test: workspace process completes even with graph errors
- [ ] Error messages include document ID and actionable debugging info
</verification>

<success_criteria>
1. GRAPH-01 requirement satisfied: Graph building errors caught and reported with meaningful messages
2. DEPRECATED-01 requirement satisfied: Zero imports of deprecated modules (AiStorage, EmbeddingsManager, lancedb, etc.)
3. Datetime variables properly scoped in all graph-related operations
4. Workspace process continues even when graph building fails for individual documents
</success_criteria>

<output>
After completion, create `.planning/phases/02-checkout-graph-deprecation-config/02-02-SUMMARY.md` with:
- Deprecated import verification results
- Graph error handling implemented (or documented as LightRAG-internal)
- Test coverage for error scenarios
- Manual verification results with sample error messages
</output>
+231 −0

File added.

Preview size limit exceeded, changes collapsed.