docs: complete project research (b3f3068a) · Commits · Jan Reimes / 3gpp-crawler

.planning/research/ARCHITECTURE.md

0 → 100644

+217 −0

Original line number	Diff line number	Diff line
		# Architecture Patterns

		Domain: 3gpp-ai wiki-first knowledge pipeline
		Researched: 2026-04-27

		## Recommended Architecture

		Replace retrieval-first (traditional RAG) behavior with a wiki-first pipeline that treats extracted artifacts as the source of truth and generates deterministic wiki pages per document and per topic index.

		Core principle:

		1. Extract once into stable artifact records.
		2. Compile those records into wiki nodes (document pages + topic pages + relation edges).
		3. Query by traversing wiki graph/page links first, then selectively calling LLM for synthesis over retrieved wiki sections.

		This fits the current codebase because extraction and workspace orchestration already exist and are callable from CLI without introducing a new storage root.

		## Existing Integration Points (Current Code)

		### CLI command surfaces to extend

		- `threegpp_ai.cli._workspace_commands.workspace_process`
		- `threegpp_ai.cli._commands.ai_summarize` (already supports `--output-mode wiki`)
		- `threegpp_ai.cli.__init__` command tree composition
		- `threegpp_ai.cli._commands.clear_artifacts` for cleanup/rebuild workflows

		### Pipeline operations to reuse

		- `threegpp_ai.operations.convert.extract_document_structured_from_tdoc`
		- `threegpp_ai.operations.convert.convert_document_to_markdown`
		- `threegpp_ai.operations.summarize.summarize_tdoc`
		- `threegpp_ai.cli._workspace.process_workspace_members` (current workspace orchestration loop)

		### Workspace and state management to keep as SSOT

		- `threegpp_ai.operations.workspace_registry.WorkspaceRegistry`
		- `threegpp_ai.operations.workspaces.list_workspace_members`
		- existing `.ai/` artifact placement under each checked-out document

		## Proposed Components

		### Component 1: Artifact Normalizer

		What: Canonicalizes extraction output into stable schema used by wiki compiler.
		New module: `threegpp_ai.operations.wiki_artifacts`
		Input: `extract_document_structured_from_tdoc()` result
		Output: `WikiArtifactBundle` (doc metadata, sections, equations, tables, figures, anchors)

		Key requirement: every extracted element gets deterministic IDs (`doc_id`, `section_id`, `artifact_id`) so links remain stable across rebuilds.

		### Component 2: Wiki Compiler

		What: Converts normalized bundles into wiki pages and relation edges.
		New module: `threegpp_ai.operations.wiki_compiler`
		Output targets:

		- document page: one page per TDoc/spec
		- topic pages: synthesized pages grouped by tags/spec numbers/meeting topics
		- edge index: references between pages and source artifacts

		Storage recommendation: write under each member `.ai/wiki/` plus workspace-level index file managed via workspace name.

		### Component 3: Wiki Index Store

		What: Lightweight local index for page lookup, backlinks, and provenance.
		New module: `threegpp_ai.operations.wiki_store`
		Format: JSONL or SQLite-backed records (prefer SQLite if cross-document joins become hot)

		Required access patterns:

		- list pages by workspace
		- resolve page by slug/id
		- fetch backlinks and cited source artifacts
		- fetch topic neighborhood (for multi-doc synthesis)

		### Component 4: Wiki Query Engine

		What: Query orchestrator that replaces vector-first retrieval with wiki traversal + bounded LLM synthesis.
		New module: `threegpp_ai.operations.wiki_query`

		Query flow:

		1. parse user intent
		2. resolve candidate wiki pages/topics
		3. collect cited source snippets from artifact bundles
		4. synthesize answer with citations to page IDs and artifact IDs

		### Component 5: CLI adapters

		What: Add wiki lifecycle commands without breaking current users.
		New command group proposal: `3gpp-ai wiki ...`

		Suggested commands:

		- `3gpp-ai wiki build [-w workspace] [--force]`
		- `3gpp-ai wiki query "..." [-w workspace]`
		- `3gpp-ai wiki status [-w workspace]`
		- `3gpp-ai wiki rebuild-topic <topic-slug>`

		## Component Boundaries

		\| Component \| Responsibility \| Communicates With \|
		\|-----------\|---------------\|-------------------\|
		\| Extraction (`convert.py`) \| Produce markdown + raw structured artifacts \| Artifact Normalizer \|
		\| Artifact Normalizer \| Stable canonical schema with IDs \| Wiki Compiler, Wiki Query Engine \|
		\| Wiki Compiler \| Generate wiki pages and edge records \| Wiki Index Store \|
		\| Wiki Index Store \| Persist/retrieve pages, links, provenance \| Wiki Query Engine, CLI status \|
		\| Wiki Query Engine \| Retrieve wiki context and call LLM for synthesis \| LiteLLM client, Wiki Store \|
		\| CLI adapters \| Trigger build/query/status flows \| Workspace APIs + wiki modules \|

		## Data Flow

		### Build flow (workspace-level)

		1. `workspace process` continues to drive extraction readiness per member.
		2. For each member, call `extract_document_structured_from_tdoc()`.
		3. Normalize extraction payload into `WikiArtifactBundle`.
		4. Compile bundle into document page + topic candidates + edge records.
		5. Persist:
		- document-level wiki artifacts in member `.ai/wiki/`
		- workspace-level page/edge index in workspace registry-adjacent storage.
		6. Mark build status/metadata for `wiki status` visibility.

		### Query flow (wiki-first)

		1. Resolve workspace (active workspace fallback remains unchanged).
		2. Query engine reads page/topic index (not vector search).
		3. Retrieve top relevant pages via lexical + link-neighborhood ranking.
		4. Pull cited artifact snippets.
		5. Call LLM once for final synthesis with explicit citation constraints.
		6. Return answer + citation table (page IDs, source artifact IDs, paths).

		## Integration With Existing Modules

		### Minimal-change call sites

		- Keep extraction in `threegpp_ai.operations.convert` unchanged initially.
		- Extend `threegpp_ai.cli._workspace.process_workspace_members` to optionally trigger wiki compilation per member (feature-flagged).
		- Keep `summarize_tdoc` for single-document summaries; optionally route `--output-mode wiki` through wiki page rendering once wiki compiler is stable.
		- Reuse `WorkspaceRegistry` for workspace selection and lifecycle; do not create a parallel workspace abstraction.

		### Config integration

		Extend `AiConfig` with wiki-specific flags while preserving defaults:

		- `wiki_enabled: bool` (default false during migration)
		- `wiki_build_on_process: bool` (default false initially)
		- `wiki_query_max_pages: int`
		- `wiki_topic_merge_threshold: float`

		## Migration Sequencing

		### Phase 1: Foundation (no behavior change)

		1. Add `wiki_artifacts` schema + normalizer.
		2. Add `wiki_store` read/write APIs.
		3. Persist normalized artifacts beside existing `.ai` outputs.

		Exit criteria: extraction output can be normalized and stored for existing workspace members with no CLI regressions.

		### Phase 2: Wiki build pipeline (opt-in)

		1. Add `wiki_compiler` document-page generation.
		2. Add `3gpp-ai wiki build` and `wiki status` commands.
		3. Wire optional post-process build path from `workspace process`.

		Exit criteria: workspace can produce deterministic wiki pages and index successfully.

		### Phase 3: Wiki query path (parallel run)

		1. Add `wiki_query` engine and `3gpp-ai wiki query` command.
		2. Keep old retrieval/query path available for A/B validation.
		3. Compare answer quality/citation coverage on representative TDoc sets.

		Exit criteria: wiki query reaches agreed quality threshold and stable latency.

		### Phase 4: Default switch and RAG deprecation

		1. Make wiki query default for knowledge Q&A in CLI.
		2. Keep compatibility shim and clear deprecation warnings for old RAG commands.
		3. Remove obsolete vector-first code paths after one release window.

		Exit criteria: all primary workflows use wiki-first architecture without regressions.

		## Anti-Patterns to Avoid

		### Anti-Pattern 1: Dual source of truth

		What: Maintaining wiki content separately from extracted artifacts without stable references.
		Why bad: drift, stale citations, difficult rebuilds.
		Instead: extracted artifacts are canonical; wiki pages are compiled projections.

		### Anti-Pattern 2: Big-bang replacement

		What: Removing old query path before parity testing.
		Why bad: high outage and quality regression risk.
		Instead: run wiki query in parallel until measurable parity is reached.

		### Anti-Pattern 3: Workspace bypass

		What: Introducing wiki storage not keyed by existing workspace model.
		Why bad: inconsistent CLI behavior and broken user expectations.
		Instead: keep workspace as top-level namespace in all wiki components.

		## Scalability Considerations

		\| Concern \| At 100 docs \| At 10K docs \| At 1M docs \|
		\|---------\|-------------\|-------------\|------------\|
		\| Wiki index storage \| JSONL acceptable \| Prefer SQLite indices \| Partitioned store + sharded query service \|
		\| Build throughput \| single-process CLI \| batched async workers \| distributed build pipeline \|
		\| Query latency \| direct page scan \| indexed retrieval + cached neighborhoods \| precomputed topic graphs + service tier \|
		\| Rebuild strategy \| full rebuild acceptable \| incremental by changed doc \| event-driven incremental compiler \|

		## Sources

		- Local codebase integration points in `packages/3gpp-ai/threegpp_ai/cli` and `packages/3gpp-ai/threegpp_ai/operations`
		- Local pipeline notes in `packages/3gpp-ai/docs/PIPELINE.md`

.planning/research/FEATURES.md

0 → 100644

+89 −0

Original line number	Diff line number	Diff line
		# Feature Landscape

		Domain: 3GPP AI document intelligence (v1.1 llm-wiki milestone)
		Researched: 2026-04-27

		## Scope and Candidate Set

		Milestone focus is replacing traditional retrieval-first RAG with a Karpathy-style llm-wiki approach.

		Interpreted user candidate list for this milestone:

		- Candidate A: Traditional RAG stack (chunk/embedding/vector retrieval + answer synthesis)
		- Candidate B: Karpathy llm-wiki stack (deterministic wiki pages as primary knowledge substrate; LLM reasons over curated pages)

		## Table Stakes

		Features users should expect in v1.1. Missing any of these makes the llm-wiki direction non-credible.

		\| Feature \| Why Expected \| Complexity \| Notes \|
		\|---------\|--------------\|------------\|-------\|
		\| Deterministic wiki-page compiler from canonical extraction JSON \| llm-wiki requires stable, replayable knowledge pages rather than ad hoc retrieval snippets \| High \| Must preserve provenance (`doc_id`, page, element IDs) and produce stable page IDs/slugs \|
		\| Source-grounded answer mode with explicit citations \| Replacing RAG must not reduce trust or traceability \| Medium \| Every answer must reference wiki page sections and original source coordinates \|
		\| Incremental workspace rebuilds \| Users need fast updates when new TDocs/spec revisions arrive \| Medium \| Recompile only affected pages; preserve unchanged page hashes \|
		\| Entity/topic index over wiki corpus \| Retrieval still exists, but over curated pages not raw chunks \| Medium \| Search target shifts from vector chunks to canonical wiki topics/entities \|
		\| Quality-gated publishing pipeline \| Prevent low-fidelity extraction from contaminating wiki knowledge \| Medium \| Block publish on failed extraction quality reasons already available in v1.0 \|

		## Differentiators

		Features that create clear value beyond baseline replacement.

		\| Feature \| Value Proposition \| Complexity \| Notes \|
		\|---------\|-------------------\|------------\|-------\|
		\| Dual-view pages (human narrative + machine facts panel) \| Serves engineers and downstream automation from one artifact \| Medium \| Reuse canonical JSON fields for facts panel, no duplicate parsing \|
		\| Conflict-aware synthesis across releases \| Helps users understand deltas and contradictions between versions \| High \| Surface conflicting claims with per-source evidence blocks \|
		\| Auto-generated concept graph from wiki pages \| Retains graph navigation benefits without full GraphRAG runtime dependency \| High \| Build lightweight graph from page links/entities, not from online query-time extraction \|
		\| Task-oriented answer templates (compliance/checklist/spec delta) \| Improves consistency for recurring telecom workflows \| Medium \| Template outputs grounded in wiki pages and deterministic metadata \|

		## Anti-Features

		Explicitly avoid in v1.1 to preserve milestone focus and reduce architecture churn.

		\| Anti-Feature \| Why Avoid \| What to Do Instead \|
		\|--------------\|-----------\|-------------------\|
		\| Reintroducing embedding/vector infrastructure as primary path \| Conflicts with llm-wiki direction and reopens recently decommissioned surface \| Keep retrieval over compiled wiki pages and deterministic indices \|
		\| Query-time graph extraction/generation \| Adds latency and non-determinism, weakens reproducibility \| Precompute links/entities during wiki compile stage \|
		\| Multi-backend retrieval abstraction (pgvector/OpenSearch/etc.) in v1.1 \| Premature abstraction before proving llm-wiki fitness \| Ship one local-first compile + index path, add adapters later if required \|
		\| Autonomous page rewriting without provenance lock \| Risks hallucinated drift from source specs \| Enforce provenance-anchored page sections and diff-based updates \|

		## Requirements Implications (from Candidate List)

		Implications derived from Candidate A vs Candidate B evaluation:

		1. New requirement class: wiki compilation contracts.
		- Define stable page schema, deterministic slug rules, and provenance mapping guarantees.
		2. New requirement class: reproducibility and diffability.
		- Re-running compile on unchanged artifacts must produce byte-stable wiki outputs.
		3. New requirement class: governance guardrails.
		- Any synthesized statement must be linkable to specific source spans.
		4. Requirement reduction: retrieval-backend configurability is no longer a v1.1 must-have.
		- Move vector-backend flexibility to post-v1.1 unless evidence shows hard blockers.
		5. Requirement carryover from v1.0: extraction quality gates become publish gates.
		- Failed extraction quality must block wiki publish for affected pages.

		## Framework-Fit Implications (from Candidate List)

		\| Candidate \| Fit With Current 3gpp-ai Baseline \| Implication \|
		\|----------\|-----------------------------------\|-------------\|
		\| Traditional RAG (Candidate A) \| Low to medium fit. v1.0 explicitly decommissioned embedding-first modules and reset to extraction-first baseline. \| Re-adoption would reintroduce removed surface area and dependency complexity. \|
		\| Karpathy llm-wiki (Candidate B) \| High fit. v1.0 already delivers deterministic extraction artifacts, stable IDs, and quality reports needed for wiki compilation. \| Build compiler/index/publish layers on top of existing canonical outputs with minimal rollback risk. \|

		Framework recommendation for v1.1: prioritize Candidate B as the primary architecture; treat Candidate A only as an optional fallback research track if wiki answer quality cannot meet acceptance thresholds.

		## MVP Recommendation

		Prioritize:

		1. Deterministic wiki-page compiler + provenance-preserving schema
		2. Citation-grounded query/answer API over wiki pages
		3. Incremental rebuild + publish gates tied to extraction quality

		Defer:

		- Advanced cross-release conflict synthesis (high value, but not required for first viable llm-wiki cut)
		- Broad retrieval backend abstraction (not required until scale constraints are validated)

		## Sources

		- Workspace planning context: `.planning/PROJECT.md`, `.planning/STATE.md`, `.planning/milestones/v1.0-REQUIREMENTS.md`, `.planning/milestones/v1.0-ROADMAP.md`
		- Package context: `packages/3gpp-ai/AGENTS.md`, `packages/3gpp-ai/README.md`, `packages/3gpp-ai/pyproject.toml`

.planning/research/PITFALLS.md

0 → 100644

+159 −0

Original line number	Diff line number	Diff line
		# Domain Pitfalls

		Domain: 3GPP standards-document corpora migration from traditional RAG to Karpathy-style LLM Wiki compilation
		Researched: 2026-04-27
		Overall confidence: MEDIUM

		## Critical Pitfalls

		### Pitfall 1: Citation Drift During Wiki Compilation
		What goes wrong:
		LLM-generated wiki pages summarize correctly at paragraph level but lose deterministic source anchors at clause/table/figure granularity.

		Why it happens:
		Traditional RAG systems often cite chunk IDs, while wiki synthesis introduces cross-document abstraction that can dissolve exact provenance unless forced by schema.

		Consequences:
		- unverifiable statements in standards-sensitive workflows
		- weak auditability for "why this answer" in compliance or engineering review
		- inability to trace claims back to exact TS/TR versions

		Prevention:
		- require every generated claim to carry document_id, version, page, and span references
		- compile wiki entries from deterministic extraction JSON, not markdown-only text
		- enforce a no-citation/no-publish rule in the wiki build step

		Detection:
		- citation coverage below threshold (for example under 98%)
		- references that resolve to document but not to clause/span
		- frequent human reviewer notes such as "cannot verify"

		### Pitfall 2: Version Mixing Across Releases and CR States
		What goes wrong:
		The wiki blends content from multiple 3GPP releases or pre/post-CR states into one canonical statement.

		Why it happens:
		Standards corpora are highly versioned and parallelized; naive document aggregation ignores release tags, meeting rounds, and change-request lineage.

		Consequences:
		- contradictory requirements in the same wiki topic
		- incorrect downstream guidance for implementation teams
		- hidden regressions when newer text silently overrides frozen behavior

		Prevention:
		- include release, spec version, and meeting/CR lineage in extraction metadata and index keys
		- generate release-scoped wiki snapshots first, then explicitly merge with conflict markers
		- add deterministic tie-break rules for recency versus normative precedence

		Detection:
		- same concept mapped to conflicting normative verbs (shall/should/may)
		- multiple answers for identical query under same declared release scope
		- unexplained jumps in policy text after corpus refresh

		### Pitfall 3: Normative Language Flattening
		What goes wrong:
		The wiki normalizes tone and accidentally weakens normative intent, converting strict requirements into descriptive prose.

		Why it happens:
		Generative summarization rewards readability and compression, but standards interpretation depends on preserving modality and conditions exactly.

		Consequences:
		- false compliance assumptions
		- implementation bugs from softened requirements
		- legal and certification risk in downstream usage

		Prevention:
		- extract and persist normative markers (shall/shall not/should/may) as structured fields
		- render wiki sections with explicit "Normative" versus "Informative" partitions
		- add automatic checks that forbid modality downgrades during synthesis

		Detection:
		- mismatch rate between source modality and wiki modality
		- increase in unresolved "is this mandatory" reviewer questions
		- high edit churn in normative sections after manual review

		### Pitfall 4: Determinism Loss in Build Pipeline
		What goes wrong:
		Two runs over the same corpus produce materially different wiki pages, embeddings, or graph edges.

		Why it happens:
		Model sampling, non-pinned dependencies, non-stable chunk ordering, and asynchronous ingestion race conditions create non-reproducible artifacts.

		Consequences:
		- impossible regression debugging
		- flaky evaluation baselines
		- low trust in migration quality signals

		Prevention:
		- pin model versions and extraction dependencies
		- disable stochastic generation in compile stages that define canonical wiki text
		- enforce stable ordering keys and content hashing for each intermediate artifact
		- store deterministic build manifests with input/output checksums

		Detection:
		- hash diffs between same-input runs
		- large semantic delta with no corpus changes
		- non-repeatable benchmark outcomes

		## Moderate Pitfalls

		### Pitfall 5: Table and Formula Semantics Collapse
		What goes wrong:
		Critical constraints from tabular cells, footnotes, and formulas are dropped or rewritten ambiguously in wiki prose.

		Prevention:
		- preserve table cell coordinates, header lineage, and footnote references in extraction output
		- include formula normalization plus source rendering references
		- require targeted table/formula fidelity tests before promoting a wiki build

		### Pitfall 6: Over-Aggressive Deduplication of Near-Duplicate Clauses
		What goes wrong:
		Dedup logic merges clauses that appear similar but differ by release, scenario, or conditional scope.

		Prevention:
		- deduplicate only within strict scope keys (spec_id + version + clause lineage)
		- use semantic similarity as a hint, never as an automatic merge decision
		- flag high-similarity cross-version items for explicit adjudication

		### Pitfall 7: Query-Time Determinism Broken by Hybrid Retrieval Fallbacks
		What goes wrong:
		The system claims deterministic extraction-backed answers, but query path silently falls back to legacy vector chunks with weaker provenance.

		Prevention:
		- separate legacy and wiki retrieval modes at API boundary
		- expose provenance mode in every response payload
		- block fallback in strict compliance mode

		## Minor Pitfalls

		### Pitfall 8: Evaluation Set Not Representative of Standards Edge Cases
		What goes wrong:
		Migration appears successful on generic Q&A tests but fails on annexes, exception clauses, and conditional procedures.

		Prevention:
		- curate benchmark sets for normative conflicts, cross-reference chains, and release deltas
		- include deterministic extraction quality metrics alongside answer quality metrics

		### Pitfall 9: Human Review Workflow Too Late in Pipeline
		What goes wrong:
		Experts only review final wiki pages, when fixing errors is costly and root-cause attribution is hard.

		Prevention:
		- add review checkpoints at extraction schema validation and wiki pre-publish diff stages
		- provide reviewer tools that jump directly from wiki sentence to source span

		## Phase-Specific Warnings

		\| Phase Topic \| Likely Pitfall \| Mitigation \|
		\|-------------\|---------------\|------------\|
		\| Corpus normalization \| Version mixing \| Release-scoped canonical IDs and lineage fields \|
		\| Deterministic extraction \| Build nondeterminism \| Stable ordering, pinned models, checksums \|
		\| Wiki compilation \| Citation drift \| Mandatory span-level citations and publish gates \|
		\| Query integration \| Hidden legacy fallback \| Explicit retrieval mode contracts \|
		\| Validation \| False confidence from weak tests \| Standards-specific benchmark suite \|

		## Sources

		- 3GPP series and versioning context (official): https://www.3gpp.org/specifications-technologies/specifications-by-series
		- Workspace guidance for 3gpp-ai extraction/storage architecture: packages/3gpp-ai/AGENTS.md
		- Workspace global guidance on configuration/path determinism: AGENTS.md

.planning/research/STACK.md

0 → 100644

+139 −0

File added.

Preview size limit exceeded, changes collapsed.

.planning/research/SUMMARY.md

0 → 100644

+151 −0

File added.

Preview size limit exceeded, changes collapsed.