Commit 90da54db authored by Jan Reimes's avatar Jan Reimes
Browse files

refactor(ai): remove summarization from pipeline and CLI option; use...

refactor(ai): remove summarization from pipeline and CLI option; use EmbeddingsManager factory for storage access
parent 1a6e73bc
Loading
Loading
Loading
Loading
+2 −2
Original line number Diff line number Diff line
@@ -21,7 +21,7 @@ dependencies = [
    "brotli>=1.2.0",
    "hishel>=1.1.8",
    "lxml>=6.0.2",
    "pandas<3.0.0",
    "pandas>=3.0.0",
    "pydantic>=2.12.2",
    "pydantic-sqlite>=0.4.0",
    "python-calamine>=0.5.3",
@@ -113,4 +113,4 @@ style = "semver"

[tool.uv.sources]
specify-cli = { git = "https://github.com/github/spec-kit.git" }
tdoc-ai = { path = "tdoc-ai", editable = true }
tdoc-ai = { path = "src/tdoc-ai", editable = true }

src/tdoc-ai/AGENTS.md

0 → 100644
+136 −0
Original line number Diff line number Diff line
# Assistant Rules for tdoc-ai Package

## Overview

The `tdoc-ai` package provides AI-powered document processing for 3GPP TDocs. It handles embeddings, knowledge graphs, summarization, and semantic search. This package is integrated into the main `tdoc-crawler` CLI under the `ai` command group.

## Package Structure

```
src/tdoc-ai/tdoc_ai/
├── __init__.py           # Public API exports, factory functions
├── config.py             # AiConfig (environment-based configuration)
├── models.py             # Pydantic models (ProcessingStatus, DocumentSummary, etc.)
├── storage.py            # AiStorage (LanceDB-based vector storage)
├── operations/
│   ├── pipeline.py       # Main processing pipeline (CLASSIFY → EXTRACT → EMBED → GRAPH)
│   ├── embeddings.py     # EmbeddingsManager (local embedding generation)
│   ├── classify.py       # Document classification
│   ├── extract.py        # DOCX to Markdown extraction
│   ├── summarize.py      # LLM-based summarization
│   ├── graph.py          # Knowledge graph operations
│   ├── convert.py        # Document conversion
│   ├── workspaces.py     # Workspace member management
│   └── workspace_registry.py  # Workspace CRUD
```

## Key Design Patterns

### Factory Pattern for EmbeddingsManager

The `EmbeddingsManager` uses a factory pattern to break the circular dependency between config, storage, and embeddings:

```python
# CORRECT: Use factory method
from tdoc_ai import create_embeddings_manager
manager = create_embeddings_manager()  # or with explicit config
storage = manager.storage  # Access storage via property

# DEPRECATED: Direct instantiation requires careful ordering
from tdoc_ai.operations.embeddings import EmbeddingsManager
manager = EmbeddingsManager.from_config(config)
```

### Pipeline Stages

The processing pipeline runs in order:

1. **CLASSIFY** - Identify main document among multiple files
2. **EXTRACT** - Convert DOCX to Markdown
3. **EMBED** - Generate vector embeddings (local, no LLM required)
4. **GRAPH** - Build knowledge graph

**Note:** Summarization is NOT part of the pipeline. Use `ai summarize <doc_id>` command for on-demand LLM-based summarization.

### Separation: Pipeline vs CLI Summarize

| Command | Purpose | LLM Required |
|---------|---------|--------------|
| `ai workspace process` | Embed documents for semantic search | No |
| `ai summarize <doc>` | Generate LLM summary | Yes |

## Configuration

All configuration is environment-based via `AiConfig.from_env()`:

- `EMBEDDING_MODEL` - Sentence transformer model (default: `sentence-transformers/all-MiniLM-L6-v2`)
- `EMBEDDING_DIMENSION` - Vector dimension (default: 384)
- `LLM_MODEL` - LLM model for summarization (default: `openai/gpt-4o-mini`)
- `LanceDB path` - Storage location

## Storage Layer

AiStorage uses LanceDB for vector storage:
- Embeddings are stored with document metadata
- Supports workspace-scoped storage
- Provides status tracking (classified, extracted, embedded, graphed)

## CLI Integration

The `tdoc-ai` package is exposed via `tdoc-crawler ai` commands:
- `ai summarize <doc>` - LLM summarization
- `ai query <text>` - Semantic search
- `ai workspace process` - Batch embedding
- `ai workspace list-members` - List workspace contents

## Import Guidelines

```python
# Public API (preferred)
from tdoc_ai import (
    create_embeddings_manager,
    process_document,
    process_all,
    get_status,
    query_graph,
    summarize_document,
)

# Internal operations when needed
from tdoc_ai.operations.embeddings import EmbeddingsManager
from tdoc_ai.operations.pipeline import run_pipeline

# Models
from tdoc_ai.models import ProcessingStatus, PipelineStage
```

## Common Tasks

### Processing Documents
```python
from tdoc_ai import process_document
status = process_document("SP-123456", Path("./checkouts/SP-123456"))
```

### Querying
```python
from tdoc_ai import query_graph
results = query_graph("What is the status of 5G NR?", workspace="my_ws")
```

### Creating Embeddings
```python
from tdoc_ai import create_embeddings_manager
manager = create_embeddings_manager()
manager.generate_embeddings(doc_id, artifact_path)
```

## Lessons Learned

1. **No LLM in Pipeline**: The process pipeline runs completely locally using sentence transformers. LLM access is only needed for summarization, which is a separate command.

2. **Factory Pattern**: EmbeddingsManager uses `from_config()` factory to load the embedding model once, extract the dimension, create storage, then return the manager.

3. **Workspace Isolation**: All operations support optional `workspace` parameter for multi-tenant isolation.

4. **Status Tracking**: Each document has a ProcessingStatus tracking completed stages for resume capability.

src/tdoc-ai/README.md

0 → 100644
+17 −0
Original line number Diff line number Diff line
# tdoc-ai

Optional AI extension package for `tdoc-crawler`.

This package contains AI-focused capabilities including:

- Document extraction and conversion
- Summarization
- Embeddings and semantic search
- GraphRAG querying
- AI workspace management

Install via `tdoc-crawler` extras:

```bash
uv add "tdoc-crawler[ai]"
```
+39 −0
Original line number Diff line number Diff line
[project]
name = "tdoc-ai"
version = "0.1.0"
description = "Optional AI/RAG extension package for tdoc-crawler"
authors = [{ name = "Jan Reimes", email = "jan.reimes@head-acoustics.com" }]
readme = "README.md"
keywords = ["python", "3gpp", "rag", "ai"]
requires-python = ">=3.14,<4.0"
classifiers = [
    "Intended Audience :: Developers",
    "Programming Language :: Python",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3.14",
    "Topic :: Software Development :: Libraries :: Python Modules",
]
dependencies = [
    "doc2txt>=1.0.8",
    #"doc2txt>=1.0.8 @ git+https://github.com/Quantatirsk/doc2txt-pypi.git"
    "kreuzberg[all]>=4.0.0",
    "lancedb>=0.29.2",
    "litellm>=1.81.15",
    "sentence-transformers[openvino]>=2.7.0",
    "tokenizers>=0.22.2",
]

[project.urls]
Repository = "https://forge.3gpp.org/rep/reimes/tdoc-crawler"

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.uv.sources]
# doc2txt repository contains pyproject.toml AND setup.py/setup.cfg
# this causes installation of unnecessary additional dependencies.
# If compiler issues arise due to this, consider switching to ...
# - the git+https installation method (commented out above).
# - or an own local workspace package (copy/improve from doc2txt) with a simplified pyproject.toml that only includes the necessary dependencies for tdoc-ai.
doc2txt = { git = "https://github.com/Quantatirsk/doc2txt-pypi.git" }
 No newline at end of file
+136 −0
Original line number Diff line number Diff line
"""AI document processing domain package."""

from __future__ import annotations

import litellm

from tdoc_ai.config import AiConfig
from tdoc_ai.models import (
    DocumentChunk,
    DocumentClassification,
    DocumentSummary,
    GraphEdge,
    GraphNode,
    PipelineStage,
    ProcessingStatus,
)
from tdoc_ai.operations.convert import convert_tdoc as convert_document
from tdoc_ai.operations.embeddings import EmbeddingsManager
from tdoc_ai.operations.graph import query_graph
from tdoc_ai.operations.pipeline import get_status, process_all
from tdoc_ai.operations.pipeline import process_tdoc as process_document
from tdoc_ai.operations.summarize import SummarizeResult
from tdoc_ai.operations.summarize import summarize_tdoc as summarize_document
from tdoc_ai.operations.workspace_registry import (
    DEFAULT_WORKSPACE,
    WorkspaceDisplayInfo,
    WorkspaceRegistry,
    get_active_workspace,
    set_active_workspace,
)
from tdoc_ai.operations.workspaces import (
    add_workspace_members,
    checkout_spec_to_workspace,
    checkout_tdoc_to_workspace,
    create_workspace,
    delete_workspace,
    ensure_ai_subfolder,
    ensure_default_workspace,
    get_workspace,
    get_workspace_member_counts,
    is_default_workspace,
    list_workspace_members,
    list_workspaces,
    make_workspace_member,
    normalize_workspace_name,
    remove_invalid_members,
    resolve_tdoc_checkout_path,
    resolve_workspace,
)
from tdoc_ai.storage import AiStorage
from tdoc_crawler.config import CacheManager

litellm.suppress_debug_info = True  # Suppress provider/model info logs from litellm

process_tdoc = process_document


def create_embeddings_manager(config: AiConfig | None = None) -> EmbeddingsManager:
    """Create an EmbeddingsManager with proper initialization.

    This is the primary entry point for creating AI services.
    Loads model once, creates storage with correct dimension.

    Args:
        config: Optional config. If None, loads from environment.

    Returns:
        EmbeddingsManager with .storage and .config properties.
    """
    if config is None:
        config = AiConfig.from_env()
    return EmbeddingsManager.from_config(config)


# Backward compatibility alias
def get_embeddings_manager() -> EmbeddingsManager:
    """Get embeddings manager singleton (deprecated).

    Use create_embeddings_manager() instead.
    """
    return create_embeddings_manager()


def get_ai_storage(config: AiConfig | None = None) -> AiStorage:
    """Get storage instance (deprecated).

    Use create_embeddings_manager().storage instead.
    """
    return create_embeddings_manager(config).storage


__all__ = [
    "DEFAULT_WORKSPACE",
    "AiConfig",
    "AiStorage",
    "CacheManager",
    "DocumentChunk",
    "DocumentClassification",
    "DocumentSummary",
    "GraphEdge",
    "GraphNode",
    "PipelineStage",
    "ProcessingStatus",
    "SummarizeResult",
    "WorkspaceDisplayInfo",
    "WorkspaceRegistry",
    "add_workspace_members",
    "checkout_spec_to_workspace",
    "checkout_tdoc_to_workspace",
    "convert_document",
    "create_embeddings_manager",
    "create_workspace",
    "delete_workspace",
    "ensure_ai_subfolder",
    "ensure_default_workspace",
    "get_active_workspace",
    "get_ai_storage",
    "get_embeddings_manager",
    "get_status",
    "get_workspace",
    "get_workspace_member_counts",
    "is_default_workspace",
    "list_workspace_members",
    "list_workspaces",
    "make_workspace_member",
    "normalize_workspace_name",
    "process_all",
    "process_document",
    "process_tdoc",
    "query_graph",
    "remove_invalid_members",
    "resolve_tdoc_checkout_path",
    "resolve_workspace",
    "set_active_workspace",
    "summarize_document",
]
Loading