Commit 05bb6889 authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(workspace): restructure 3gpp-ai to extraction-only package

- Eliminate workspace management code from 3gpp-ai, moving it to main app.
- Directly write extraction artifacts to `.llm-wiki/<workspace>/sources/`.
- Update CLI commands to reflect new structure and deprecate old commands.
- Ensure all tests pass and documentation reflects the new extraction-only scope.
parent f65e215f
Loading
Loading
Loading
Loading
+70 −20
Original line number Diff line number Diff line
@@ -2,37 +2,87 @@

## What This Is

A brownfield extension of `3gpp-ai` focused on high-fidelity extraction of complex 3GPP PDFs into deterministic structured artifacts, summarize workflows, and future-compatible metadata for wiki/retrieval systems.
A brownfield extension of 3gpp-ai that compiles deterministic extraction artifacts from TDocs/specs into a provenance-grounded llm-wiki, then answers queries from curated wiki pages instead of traditional vector-first RAG paths.

## Core Value

Extract technically accurate, traceable document structure and meaning from complex PDFs into deterministic Markdown and canonical JSON artifacts.

## Current State
## Current Milestone: v1.1 llm-wiki

- Latest shipped milestone: **v1.0 Advanced PDF Extraction Pipeline**
- Milestone archive: `.planning/milestones/v1.0-ROADMAP.md`
- Requirements archive: `.planning/milestones/v1.0-REQUIREMENTS.md`
- Delivery status: 6 phases complete, 12 plans complete
**Goal:** Replace the traditional RAG pipeline in packages/3gpp-ai with a Karpathy-style llm-wiki architecture that ingests existing extracted artifacts and serves citation-grounded answers.

### Shipped in v1.0
**Target features:**

- Deterministic extraction profile policy and persisted effective config snapshots.
- Canonical JSON + Markdown dual-output contracts with stable element IDs/provenance.
- Deterministic extraction quality gates and reason-coded quality reporting.
- Structured table/figure/equation fidelity improvements with additive compatibility.
- Structured-first summarize path with markdown fallback and wiki output mode.
- Embedding-surface decommission to establish extraction-only baseline for next milestone.
- Deterministic wiki-page compiler and publish contracts from canonical extraction artifacts.
- Citation-grounded wiki query mode with release-aware provenance constraints.
- Framework decision and implementation baseline for llm-wiki (primary recommendation: llama-index-core profile).
- Migration/deprecation path for remaining LightRAG-first surfaces.

## Next Milestone Goals
## Requirements

- Define next milestone scope and requirements with `/gsd-new-milestone`.
- Run milestone audit early to avoid close-time audit debt.
- Build on extraction-only baseline for retrieval/information-access architecture.
### Validated

## Known Follow-ups
- [x] Deterministic extraction profile policy and persisted effective config snapshots. - v1.0
- [x] Canonical JSON + Markdown dual-output contracts with stable provenance IDs. - v1.0
- [x] Deterministic extraction quality gate lifecycle and reason-coded reporting. - v1.0
- [x] Structured-first summarize context assembly with markdown fallback compatibility. - v1.0
- [x] Embedding module surface decommission for extraction-only baseline. - v1.0
- [x] Framework selection and contract freeze for wiki-first v1.1 baseline. - phase 07

- No dedicated `v1.0` audit artifact was generated before close; run `/gsd-audit-milestone` at next-cycle start.
### Active

- [ ] WIKI-01: Compile deterministic wiki pages from canonical artifacts with stable IDs/slugs.
- [ ] WIKI-02: Persist release-aware provenance mappings for every wiki section.
- [ ] QUERY-01: Serve citation-grounded answers from wiki-first retrieval.
- [ ] MIGR-01: Deprecate legacy LightRAG-first command surfaces safely.

### Out of Scope

- Reintroducing embedding/vector infrastructure as default retrieval path in v1.1 - conflicts with milestone intent and recent decommission.
- Multi-backend retrieval abstraction (pgvector/opensearch/others) in v1.1 - defer until wiki baseline quality and scaling thresholds are measured.
- Autonomous wiki page rewrites without provenance lock - unacceptable for standards-traceability requirements.

## Context

- Latest shipped milestone: v1.0 Advanced PDF Extraction Pipeline, archived in .planning/milestones/v1.0-ROADMAP.md and .planning/milestones/v1.0-REQUIREMENTS.md.
- v1.1 research outputs generated under .planning/research/ with stack/features/architecture/pitfalls synthesis.
- Research recommendation for primary framework selection: llama-index-core for core orchestration, with lexical wiki retrieval baseline and optional semantic add-ons deferred.
- User direction: traditional RAG pipeline in packages/3gpp-ai will be removed/replaced by llm-wiki approach.

## Constraints

- **Architecture**: Keep extraction outputs as source-of-truth inputs to wiki compile - avoid reintroducing extract-time embedding branches.
- **Traceability**: All generated wiki claims must map to source coordinates (doc/page/element IDs) - needed for standards-grade trust.
- **Compatibility**: Preserve stable CLI ergonomics where practical during migration - reduce operational breakage for existing users.
- **Determinism**: Wiki build output must be reproducible for unchanged inputs - required for auditability and diff-based workflows.

## Key Decisions

| Decision | Rationale | Outcome |
|----------|-----------|---------|
| Use llm-wiki as v1.1 target architecture | Aligns with extraction-first baseline and user direction to replace traditional RAG | ✓ Good |
| Primary framework baseline: llama-index-core profile | Best fit from milestone research for migration risk, orchestration needs, and minimal lock-in | — Pending |
| Start with lexical wiki retrieval and citations, defer semantic default | Preserves determinism and limits complexity while quality baselines are established | — Pending |

## Evolution

This document evolves at phase transitions and milestone boundaries.

**After each phase transition** (via /gsd-transition):

1. Requirements invalidated? -> Move to Out of Scope with reason
2. Requirements validated? -> Move to Validated with phase reference
3. New requirements emerged? -> Add to Active
4. Decisions to log? -> Add to Key Decisions
5. What This Is still accurate? -> Update if drifted

**After each milestone** (via /gsd-complete-milestone):

1. Full review of all sections
2. Core Value check - still the right priority?
3. Audit Out of Scope - reasons still valid?
4. Update Context with current state

---
*Last updated: 2026-04-18 after v1.0 milestone completion*
*Last updated: 2026-04-27 after new milestone initialization (v1.1 llm-wiki)*
+85 −0
Original line number Diff line number Diff line
# Requirements: 3GPP AI Document Intelligence

**Defined:** 2026-04-27
**Core Value:** Extract technically accurate, traceable document structure and meaning from complex PDFs into deterministic Markdown and canonical JSON artifacts.

## v1.1 Requirements

### Framework and Baseline

- [x] **FRMW-01**: The milestone documents and ratifies one primary llm-wiki framework with explicit rationale and tradeoff matrix.
- [x] **FRMW-02**: The selected framework integration profile is implemented without reintroducing vector-first default behavior.

### Wiki Compilation Contracts

- [ ] **WIKI-01**: The system compiles deterministic wiki pages from canonical extraction outputs for TDocs/specs/extra documents.
- [ ] **WIKI-02**: Every wiki page section includes stable provenance fields linking to source document/page/element IDs.
- [ ] **WIKI-03**: Re-running wiki compilation on unchanged inputs yields byte-stable page outputs and unchanged compile manifests.
- [ ] **WIKI-04**: Incremental rebuild mode recompiles only pages impacted by changed source artifacts.

### Wiki Query and Trust

- [ ] **QUERY-01**: Query mode retrieves from wiki pages/topics as primary substrate rather than raw chunk/vector retrieval.
- [ ] **QUERY-02**: Answer payloads include explicit citations to wiki sections and source coordinates.
- [ ] **QUERY-03**: Strict mode blocks uncited claims in final answer payloads.
- [ ] **QUERY-04**: Release/version-aware routing prevents mixing incompatible spec revisions without explicit conflict reporting.

### Migration and Decommission

- [ ] **MIGR-01**: Legacy LightRAG-first command paths are deprecated with clear migration guidance and compatibility warnings.
- [ ] **MIGR-02**: Documentation is updated to remove default RAG/embedding-first language and describe the wiki-first architecture.
- [ ] **MIGR-03**: Any retained fallback mode requires explicit opt-in and emits telemetry/metadata indicating non-wiki path usage.

### Quality and Operations

- [ ] **QUAL-01**: Existing extraction quality reports gate wiki publish; failed extraction artifacts cannot publish wiki pages.
- [ ] **QUAL-02**: Regression tests cover deterministic compile behavior, provenance mapping, and citation completeness.
- [ ] **QUAL-03**: CLI status output reports wiki build health (counts, failed pages, last compile hash).

## v2+ Requirements

### Advanced Retrieval Extensions

- **EXT-01**: Optional semantic retrieval add-on is introduced only after lexical wiki baseline shows measurable recall gaps.
- **EXT-02**: Multi-backend retrieval abstraction is added once scale/performance thresholds justify backend diversity.
- **EXT-03**: Advanced cross-release conflict synthesis is added with evaluator-backed acceptance criteria.

## Out of Scope

| Feature | Reason |
|---------|--------|
| Reintroduce embeddings/vector retrieval as default in v1.1 | Directly conflicts with llm-wiki migration objective |
| Query-time graph extraction/generation in v1.1 | Adds latency and nondeterminism before wiki baseline is proven |
| Autonomous wiki page rewriting without provenance lock | Violates standards-traceability and auditability expectations |
| Broad cloud-service productization changes | Milestone scope is architecture migration within existing CLI package |

## Traceability

| Requirement | Phase | Status |
|-------------|-------|--------|
| FRMW-01 | Phase 07 | Complete |
| FRMW-02 | Phase 07 | Complete |
| WIKI-01 | Phase 08 | Pending |
| WIKI-02 | Phase 08 | Pending |
| WIKI-03 | Phase 08 | Pending |
| WIKI-04 | Phase 08 | Pending |
| QUERY-01 | Phase 09 | Pending |
| QUERY-02 | Phase 09 | Pending |
| QUERY-03 | Phase 09 | Pending |
| QUERY-04 | Phase 09 | Pending |
| MIGR-01 | Phase 10 | Pending |
| MIGR-02 | Phase 10 | Pending |
| MIGR-03 | Phase 10 | Pending |
| QUAL-01 | Phase 08 | Pending |
| QUAL-02 | Phase 09 | Pending |
| QUAL-03 | Phase 10 | Pending |

**Coverage:**

- v1.1 requirements: 16 total
- Mapped to phases: 16
- Unmapped: 0

---
*Requirements defined: 2026-04-27*
*Last updated: 2026-04-27 after milestone initialization (v1.1 llm-wiki)*
+76 −41
Original line number Diff line number Diff line
# Roadmap — 3GPP Crawler Codebase Improvement
# Roadmap: v1.1 llm-wiki

## Milestone

**Version:** v1.1
**Name:** llm-wiki
**Status:** Active
**Started:** 2026-04-27
**Goal:** Generate external-app-compatible workspace exports from extraction artifacts, re-integrate 3gpp-ai into main app, and remove in-house wiki-compiler dependencies.

## Phases

- [x] **Phase 01: Normalization & Progress Bars** - Consolidate duplicate normalization logic and fix progress bar document counts (completed 2026-04-12)
- [x] **Phase 02: Checkout, Graph, Deprecation & Config** - Fix checkout paths, datetime errors, deprecated imports, and config drift (completed 2026-04-19)
- [x] **Phase 07: Framework Selection and Contract Freeze** (superseded — see phase 08)
- [x] **Phase 08: External Workspace Export and Reintegration**
- [x] **Phase 09: Workspace Infrastructure Merge and Export Elimination**

---

## Phase Details

### Phase 01: Normalization & Progress Bars
### Phase 07: Framework Selection and Contract Freeze

**Goal:** Finalize framework decision and lock milestone-level technical contracts before implementation.

**Goal**: Eliminate duplicate normalization code and fix progress bar UX issues so users see document counts during long operations
**Depends on:** v1.0 extraction baseline and v1.1 research outputs

**Depends on**: Nothing
**Requirements:** FRMW-01, FRMW-02

**Requirements**: NORM-01, NORM-02, PROGRESS-01, PROGRESS-02
**Success Criteria**:

**Success Criteria** (what must be TRUE):
1. All normalization functions exist in single location (`src/tdoc_crawler/utils/normalization.py`)
2. `meetings/utils.py` re-exports from normalization.py, no duplicate functions
3. 6+ files importing normalization use single source (verified via grep)
4. Unit tests exist for all normalization functions covering edge cases
5. Progress bar shows "N/N" format (e.g., "5/69") during `workspace process`
6. Progress bar shows "N/N" format during `add_members` command
1. Framework recommendation is ratified with explicit acceptance criteria and fallback policy.
2. Wiki compile/query contract document is frozen for v1.1 scope.
3. CLI/config compatibility boundaries are documented for migration.

**Plans**: 2 plans
**Plans:** 2 plans

**Plan list:**
- [x] 01-normalization-PLAN.md — Consolidate normalization logic and add unit tests
- [x] 02-progress-PLAN.md — Fix progress bar display to show document counts

- [x] 07-01-PLAN.md - Framework selection rubric and decision record
- [x] 07-02-PLAN.md - v1.1 interface/contract freeze (schema, CLI, config)

**Note:** Phase 07 framework additions (llama-index-core, rank-bm25, rapidfuzz, wiki contracts) are reverted in phase 08-01. The strategy shifted to external-app workspace export instead of in-house wiki compilation.

---

### Phase 02: Checkout, Graph, Deprecation & Config
### Phase 08: External Workspace Export and Reintegration

**Goal**: Fix checkout path issues, datetime scope errors in graph building, remove deprecated imports, and align config defaults
**Goal:** Strip 3gpp-ai to extraction-only, add workspace export commands for external wiki apps, and re-integrate into main CLI.

**Depends on**: Phase 1
**Depends on:** Phase 07 (revert framework additions)

**Requirements**: CHECKOUT-01, GRAPH-01, DEPRECATED-01, CONFIG-01
**Requirements:** REVERT-01, EXPORT-01, REINT-01

**Success Criteria** (what must be TRUE):
1. Empty folder detection triggers re-download correctly
2. "No document files found" warning no longer appears for valid TDocs with empty folders
3. Graph building completes without datetime scope errors
4. Errors in graph building are caught and reported with meaningful messages
5. No import errors from deprecated modules (AiStorage, EmbeddingsManager, tdoc_ai.operations.pipeline, lancedb)
6. `.env.example` embedding model matches code defaults in `threegpp_ai/config.py`
**Success Criteria**:

**Plans**: 3 plans
1. No llama-index-core, rank-bm25, or rapidfuzz dependencies remain.
2. Workspace export produces valid file structures for atomicmemory/llm-wiki-compiler and lucasastorian/llmwiki.
3. Workspace init/process/export commands available from main `tdoc-crawler` CLI.
4. Standalone `3gpp-ai` CLI prints deprecation message.
5. All tests pass.

**Plans:** 3 plans

**Plan list:**
- [x] 02-01-PLAN.md — Fix checkout empty folder detection and re-download triggers
- [x] 02-02-PLAN.md — Add graph error handling and verify deprecated imports removed
- [x] 02-03-PLAN.md — Align embedding model defaults between .env.example and code

- [ ] 08-01-PLAN.md - Revert phase 07 framework additions, strip 3gpp-ai to extraction-only
- [ ] 08-02-PLAN.md - Add workspace export module and CLI commands for external wiki app formats
- [ ] 08-03-PLAN.md - Re-integrate workspace commands into main CLI, deprecate standalone entrypoint

---

## Progress
### Phase 09: Workspace Infrastructure Merge and Export Elimination

**Goal:** Merge workspace management into the main app as a core concern, eliminate the redundant export step by writing extraction artifacts directly to `.llm-wiki/<workspace>/`, and restructure the 3gpp-ai package to contain only AI extraction pipeline code.

**Depends on:** Phase 08

| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 01. Normalization & Progress Bars | 2/2 | Complete    | 2026-04-12 |
| 02. Checkout, Graph, Deprecation & Config | 3/3 | Complete    | 2026-04-19 |
**Requirements:** MERGE-01, DIRECT-01, CLEANUP-01

**Success Criteria**:

1. Workspace CRUD, registry, and member management live in `src/tdoc_crawler/` (not `packages/3gpp-ai`).
2. Extraction artifacts are written directly to `~/.3gpp-crawler/.llm-wiki/<workspace>/sources/` during `workspace process`.
3. `workspace_export.py` and the `export` CLI command are removed.
4. `CacheManager` exposes `.llm_wiki_dir(workspace_name)` property.
5. `packages/3gpp-ai` contains only: extraction pipeline, LLM client, `AiConfig`.
6. All tests pass.

**Plans:** 3 plans

**Plan list:**

- [ ] 09-01-PLAN.md - Merge workspace infrastructure into main app
- [ ] 09-02-PLAN.md - Eliminate export step, write artifacts directly to .llm-wiki/
- [ ] 09-03-PLAN.md - Restructure 3gpp-ai to extraction-only package

---

*Last updated: 2026-04-19*
## Progress

| Phase | Plans Complete | Status |
|-------|----------------|--------|
| 07. Framework Selection and Contract Freeze | 2/2 | Complete (superseded) |
| 08. External Workspace Export and Reintegration | 3/3 | Complete |
| 09. Workspace Infrastructure Merge and Export Elimination | 3/3 | Complete |

---

## Archive

- [x] Milestone v1.0 Advanced PDF Extraction Pipeline (2026-04-17 to 2026-04-18) - 6 phases complete, 12 plans complete, archived at `.planning/milestones/v1.0-ROADMAP.md`
- [x] Milestone v1.0 Advanced PDF Extraction Pipeline (2026-04-17 to 2026-04-18), archived at .planning/milestones/v1.0-ROADMAP.md

## Next Milestone Placeholder
---

No active milestone roadmap defined yet. Use `/gsd-new-milestone` to initialize the next cycle.
*Last updated: 2026-04-27 after phase 08 execution*
+17 −17
Original line number Diff line number Diff line
@@ -2,35 +2,35 @@

## Current Position

Phase: milestone close
Plan: v1.0 archived
Status: v1.0 complete
Last activity: 2026-04-18 - v1.0 milestone archived and roadmap reset for next cycle
Phase: 09 complete
Plan: 09-01, 09-02, 09-03 complete
Status: Phase 09 complete, all plans executed
Last activity: 2026-04-27 - Completed phase 09 workspace infrastructure merge and export elimination

## Project Reference

See: .planning/PROJECT.md (updated 2026-04-17)
See: .planning/PROJECT.md (updated 2026-04-27)

**Core value:** Extract technically accurate, traceable document structure and meaning from complex PDFs into deterministic Markdown and JSON artifacts.
**Current focus:** Milestone v1.0 Advanced PDF Extraction Pipeline
**Current focus:** Phase 09 complete — workspace management is now a core concern of the main app, extraction artifacts written directly to `.llm-wiki/`

## Accumulated Context

- Current extractor stack: Docling + optional VLM + artifact persistence under `.ai`
- Current extractor stack: Docling + optional VLM + artifact persistence under `.llm-wiki/`
- Milestone scope excludes embedding/RAG changes
- Summarize command must consume/benefit from structured extraction artifacts
- Future direction includes LLM-wiki approach; extraction metadata should support that model
- Strategy: no in-house wiki compiler. Instead, generate file structures for external apps.
- 3gpp-ai re-integrated into main app as extraction-only module.
- Standalone `3gpp-ai` CLI deprecated.
- Workspace management (CRUD, registry, members) is now a core concern in `src/tdoc_crawler/`.
- Extraction artifacts written directly to `~/.3gpp-crawler/.llm-wiki/<workspace>/sources/`.

### Roadmap Evolution

- Phase 6 added: As a preparation for the next milestone (the RAG/information retrieval system), remove all existing/following modules that perform embedding.
- Phase 6 planned: 06-01-PLAN.md and 06-02-PLAN.md added for embedding module decommission and clean baseline handoff.
- Phase 6 executed: surface removal, runtime/dependency cleanup, and baseline handoff docs completed.
- Phase 1 executed: extraction profile policy surface, deterministic policy routing, and profile metadata persistence delivered.
- Phase 2 executed: canonical document/page contracts, dual markdown+canonical outputs, and manifest inventory persistence delivered.
- Phase 3 executed: deterministic quality gates, persisted quality reports, and status-aware downstream policy enforcement delivered.
- Phase 4 executed: additive table/figure/equation fidelity contracts implemented with deterministic provenance normalization and regression coverage.
- Phase 5 executed: summarize now uses structured-first prompt context with deterministic fallback and CLI output-mode compatibility for wiki-ready rendering.
- Phase 6 executed: embedding module decommission and clean baseline handoff.
- Phase 7 executed: framework decision ratified via ADR, dependency/docs baseline aligned, and wiki-first query contract interfaces frozen with regression tests.
- Phase 7 superseded: framework additions reverted in phase 08-01. Strategy changed to external-app workspace export.
- Phase 8 executed: reverted framework additions, added workspace export commands, re-integrated into main CLI.
- Phase 9 executed: merged workspace infrastructure into main app, eliminated export step, restructured 3gpp-ai to extraction-only.

## Deferred Items

+138 −0

File added.

Preview size limit exceeded, changes collapsed.

Loading