refactor(ai): remove summarization from pipeline and CLI option; use... (90da54db) · Commits · Jan Reimes / 3gpp-crawler

pyproject.toml

+2 −2

Original line number	Diff line number	Diff line
		@@ -21,7 +21,7 @@ dependencies = [
		"brotli>=1.2.0",
		"hishel>=1.1.8",
		"lxml>=6.0.2",
		"pandas<3.0.0",
		"pandas>=3.0.0",
		"pydantic>=2.12.2",
		"pydantic-sqlite>=0.4.0",
		"python-calamine>=0.5.3",
		@@ -113,4 +113,4 @@ style = "semver"

		[tool.uv.sources]
		specify-cli = { git = "https://github.com/github/spec-kit.git" }
		tdoc-ai = { path = "tdoc-ai", editable = true }
		tdoc-ai = { path = "src/tdoc-ai", editable = true }

src/tdoc-ai/AGENTS.md

0 → 100644

+136 −0

Original line number	Diff line number	Diff line
		# Assistant Rules for tdoc-ai Package

		## Overview

		The `tdoc-ai` package provides AI-powered document processing for 3GPP TDocs. It handles embeddings, knowledge graphs, summarization, and semantic search. This package is integrated into the main `tdoc-crawler` CLI under the `ai` command group.

		## Package Structure

		```
		src/tdoc-ai/tdoc_ai/
		├── __init__.py # Public API exports, factory functions
		├── config.py # AiConfig (environment-based configuration)
		├── models.py # Pydantic models (ProcessingStatus, DocumentSummary, etc.)
		├── storage.py # AiStorage (LanceDB-based vector storage)
		├── operations/
		│ ├── pipeline.py # Main processing pipeline (CLASSIFY → EXTRACT → EMBED → GRAPH)
		│ ├── embeddings.py # EmbeddingsManager (local embedding generation)
		│ ├── classify.py # Document classification
		│ ├── extract.py # DOCX to Markdown extraction
		│ ├── summarize.py # LLM-based summarization
		│ ├── graph.py # Knowledge graph operations
		│ ├── convert.py # Document conversion
		│ ├── workspaces.py # Workspace member management
		│ └── workspace_registry.py # Workspace CRUD
		```

		## Key Design Patterns

		### Factory Pattern for EmbeddingsManager

		The `EmbeddingsManager` uses a factory pattern to break the circular dependency between config, storage, and embeddings:

		```python
		# CORRECT: Use factory method
		from tdoc_ai import create_embeddings_manager
		manager = create_embeddings_manager() # or with explicit config
		storage = manager.storage # Access storage via property

		# DEPRECATED: Direct instantiation requires careful ordering
		from tdoc_ai.operations.embeddings import EmbeddingsManager
		manager = EmbeddingsManager.from_config(config)
		```

		### Pipeline Stages

		The processing pipeline runs in order:

		1. CLASSIFY - Identify main document among multiple files
		2. EXTRACT - Convert DOCX to Markdown
		3. EMBED - Generate vector embeddings (local, no LLM required)
		4. GRAPH - Build knowledge graph

		Note: Summarization is NOT part of the pipeline. Use `ai summarize <doc_id>` command for on-demand LLM-based summarization.

		### Separation: Pipeline vs CLI Summarize

		\| Command \| Purpose \| LLM Required \|
		\|---------\|---------\|--------------\|
		\| `ai workspace process` \| Embed documents for semantic search \| No \|
		\| `ai summarize <doc>` \| Generate LLM summary \| Yes \|

		## Configuration

		All configuration is environment-based via `AiConfig.from_env()`:

		- `EMBEDDING_MODEL` - Sentence transformer model (default: `sentence-transformers/all-MiniLM-L6-v2`)
		- `EMBEDDING_DIMENSION` - Vector dimension (default: 384)
		- `LLM_MODEL` - LLM model for summarization (default: `openai/gpt-4o-mini`)
		- `LanceDB path` - Storage location

		## Storage Layer

		AiStorage uses LanceDB for vector storage:
		- Embeddings are stored with document metadata
		- Supports workspace-scoped storage
		- Provides status tracking (classified, extracted, embedded, graphed)

		## CLI Integration

		The `tdoc-ai` package is exposed via `tdoc-crawler ai` commands:
		- `ai summarize <doc>` - LLM summarization
		- `ai query <text>` - Semantic search
		- `ai workspace process` - Batch embedding
		- `ai workspace list-members` - List workspace contents

		## Import Guidelines

		```python
		# Public API (preferred)
		from tdoc_ai import (
		create_embeddings_manager,
		process_document,
		process_all,
		get_status,
		query_graph,
		summarize_document,
		)

		# Internal operations when needed
		from tdoc_ai.operations.embeddings import EmbeddingsManager
		from tdoc_ai.operations.pipeline import run_pipeline

		# Models
		from tdoc_ai.models import ProcessingStatus, PipelineStage
		```

		## Common Tasks

		### Processing Documents
		```python
		from tdoc_ai import process_document
		status = process_document("SP-123456", Path("./checkouts/SP-123456"))
		```

		### Querying
		```python
		from tdoc_ai import query_graph
		results = query_graph("What is the status of 5G NR?", workspace="my_ws")
		```

		### Creating Embeddings
		```python
		from tdoc_ai import create_embeddings_manager
		manager = create_embeddings_manager()
		manager.generate_embeddings(doc_id, artifact_path)
		```

		## Lessons Learned

		1. No LLM in Pipeline: The process pipeline runs completely locally using sentence transformers. LLM access is only needed for summarization, which is a separate command.

		2. Factory Pattern: EmbeddingsManager uses `from_config()` factory to load the embedding model once, extract the dimension, create storage, then return the manager.

		3. Workspace Isolation: All operations support optional `workspace` parameter for multi-tenant isolation.

		4. Status Tracking: Each document has a ProcessingStatus tracking completed stages for resume capability.

src/tdoc-ai/README.md

0 → 100644

+17 −0

Original line number	Diff line number	Diff line
		# tdoc-ai

		Optional AI extension package for `tdoc-crawler`.

		This package contains AI-focused capabilities including:

		- Document extraction and conversion
		- Summarization
		- Embeddings and semantic search
		- GraphRAG querying
		- AI workspace management

		Install via `tdoc-crawler` extras:

		```bash
		uv add "tdoc-crawler[ai]"
		```

src/tdoc-ai/pyproject.toml

0 → 100644

+39 −0

Original line number	Diff line number	Diff line
		[project]
		name = "tdoc-ai"
		version = "0.1.0"
		description = "Optional AI/RAG extension package for tdoc-crawler"
		authors = [{ name = "Jan Reimes", email = "jan.reimes@head-acoustics.com" }]
		readme = "README.md"
		keywords = ["python", "3gpp", "rag", "ai"]
		requires-python = ">=3.14,<4.0"
		classifiers = [
		"Intended Audience :: Developers",
		"Programming Language :: Python",
		"Programming Language :: Python :: 3",
		"Programming Language :: Python :: 3.14",
		"Topic :: Software Development :: Libraries :: Python Modules",
		]
		dependencies = [
		"doc2txt>=1.0.8",
		#"doc2txt>=1.0.8 @ git+https://github.com/Quantatirsk/doc2txt-pypi.git"
		"kreuzberg[all]>=4.0.0",
		"lancedb>=0.29.2",
		"litellm>=1.81.15",
		"sentence-transformers[openvino]>=2.7.0",
		"tokenizers>=0.22.2",
		]

		[project.urls]
		Repository = "https://forge.3gpp.org/rep/reimes/tdoc-crawler"

		[build-system]
		requires = ["hatchling"]
		build-backend = "hatchling.build"

		[tool.uv.sources]
		# doc2txt repository contains pyproject.toml AND setup.py/setup.cfg
		# this causes installation of unnecessary additional dependencies.
		# If compiler issues arise due to this, consider switching to ...
		# - the git+https installation method (commented out above).
		# - or an own local workspace package (copy/improve from doc2txt) with a simplified pyproject.toml that only includes the necessary dependencies for tdoc-ai.
		doc2txt = { git = "https://github.com/Quantatirsk/doc2txt-pypi.git" }
		No newline at end of file

src/tdoc-ai/tdoc_ai/init.py

0 → 100644

+136 −0

Original line number	Diff line number	Diff line
		"""AI document processing domain package."""

		from __future__ import annotations

		import litellm

		from tdoc_ai.config import AiConfig
		from tdoc_ai.models import (
		DocumentChunk,
		DocumentClassification,
		DocumentSummary,
		GraphEdge,
		GraphNode,
		PipelineStage,
		ProcessingStatus,
		)
		from tdoc_ai.operations.convert import convert_tdoc as convert_document
		from tdoc_ai.operations.embeddings import EmbeddingsManager
		from tdoc_ai.operations.graph import query_graph
		from tdoc_ai.operations.pipeline import get_status, process_all
		from tdoc_ai.operations.pipeline import process_tdoc as process_document
		from tdoc_ai.operations.summarize import SummarizeResult
		from tdoc_ai.operations.summarize import summarize_tdoc as summarize_document
		from tdoc_ai.operations.workspace_registry import (
		DEFAULT_WORKSPACE,
		WorkspaceDisplayInfo,
		WorkspaceRegistry,
		get_active_workspace,
		set_active_workspace,
		)
		from tdoc_ai.operations.workspaces import (
		add_workspace_members,
		checkout_spec_to_workspace,
		checkout_tdoc_to_workspace,
		create_workspace,
		delete_workspace,
		ensure_ai_subfolder,
		ensure_default_workspace,
		get_workspace,
		get_workspace_member_counts,
		is_default_workspace,
		list_workspace_members,
		list_workspaces,
		make_workspace_member,
		normalize_workspace_name,
		remove_invalid_members,
		resolve_tdoc_checkout_path,
		resolve_workspace,
		)
		from tdoc_ai.storage import AiStorage
		from tdoc_crawler.config import CacheManager

		litellm.suppress_debug_info = True # Suppress provider/model info logs from litellm

		process_tdoc = process_document


		def create_embeddings_manager(config: AiConfig \| None = None) -> EmbeddingsManager:
		"""Create an EmbeddingsManager with proper initialization.

		This is the primary entry point for creating AI services.
		Loads model once, creates storage with correct dimension.

		Args:
		config: Optional config. If None, loads from environment.

		Returns:
		EmbeddingsManager with .storage and .config properties.
		"""
		if config is None:
		config = AiConfig.from_env()
		return EmbeddingsManager.from_config(config)


		# Backward compatibility alias
		def get_embeddings_manager() -> EmbeddingsManager:
		"""Get embeddings manager singleton (deprecated).

		Use create_embeddings_manager() instead.
		"""
		return create_embeddings_manager()


		def get_ai_storage(config: AiConfig \| None = None) -> AiStorage:
		"""Get storage instance (deprecated).

		Use create_embeddings_manager().storage instead.
		"""
		return create_embeddings_manager(config).storage


		__all__ = [
		"DEFAULT_WORKSPACE",
		"AiConfig",
		"AiStorage",
		"CacheManager",
		"DocumentChunk",
		"DocumentClassification",
		"DocumentSummary",
		"GraphEdge",
		"GraphNode",
		"PipelineStage",
		"ProcessingStatus",
		"SummarizeResult",
		"WorkspaceDisplayInfo",
		"WorkspaceRegistry",
		"add_workspace_members",
		"checkout_spec_to_workspace",
		"checkout_tdoc_to_workspace",
		"convert_document",
		"create_embeddings_manager",
		"create_workspace",
		"delete_workspace",
		"ensure_ai_subfolder",
		"ensure_default_workspace",
		"get_active_workspace",
		"get_ai_storage",
		"get_embeddings_manager",
		"get_status",
		"get_workspace",
		"get_workspace_member_counts",
		"is_default_workspace",
		"list_workspace_members",
		"list_workspaces",
		"make_workspace_member",
		"normalize_workspace_name",
		"process_all",
		"process_document",
		"process_tdoc",
		"query_graph",
		"remove_invalid_members",
		"resolve_tdoc_checkout_path",
		"resolve_workspace",
		"set_active_workspace",
		"summarize_document",
		]