Commit 7c036fb2 authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(ai): enhance LightRAG integration and configuration

* Refactor summarize.py to use lazy imports for EmbeddingsManager.
* Update workspaces.py to allow optional database file for spec checkout.
* Add integration tests for LightRAG pipeline in test_integration.py.
* Create unit tests for LightRAG configuration in test_lightrag_config.py.
* Implement unit tests for metadata enrichment in test_metadata.py.
* Modify ai_app.py to streamline AI loading and remove unused commands.
* Change default cache directory and database filename in config.
parent 56f17344
Loading
Loading
Loading
Loading
+94 −36
Original line number Diff line number Diff line
# tdoc-ai

AI-powered document processing for 3GPP TDocs (embeddings, knowledge graphs, semantic search).
AI-powered document processing for 3GPP TDocs (knowledge graphs, semantic search, summarization).

## Package Structure

See `src/tdoc-ai/tdoc_ai/` for full module layout.
```
packages/tdoc-ai/tdoc_ai/
├── __init__.py           # Public API exports
├── config.py             # AiConfig (legacy, for summarize only)
├── context.py            # DocumentContext
├── models.py             # Pydantic models (ProcessingStatus, etc.)
├── storage.py            # AiStorage (legacy, LanceDB-based)
├── lightrag/             # NEW: LightRAG integration
│   ├── __init__.py
│   ├── cli.py            # rag query/status commands
│   ├── config.py         # LightRAGConfig + sub-configs
│   ├── metadata.py       # RAGMetadata, enrich_text()
│   ├── pg0_manager.py   # Pg0Manager
│   ├── processor.py      # TDocProcessor
│   ├── rag.py            # TDocRAG wrapper
│   └── seeder.py        # EntitySeeder
└── operations/
    ├── classify.py       # Document classification
    ├── convert.py        # Document conversion
    ├── embeddings.py     # EmbeddingsManager (legacy)
    ├── extract.py        # Text extraction
    ├── summarize.py      # LLM summarization (LiteLLM)
    ├── workspace_*.py    # Workspace management
```

## Key Design Patterns

### Factory Pattern for EmbeddingsManager
### LightRAG Integration (New)

Breaks circular dependency between config, storage, and embeddings:
The new pipeline uses LightRAG for knowledge graph construction:

```python
from tdoc_ai import create_embeddings_manager
manager = create_embeddings_manager()
storage = manager.storage  # Access via property
```
from tdoc_ai import LightRAGConfig, TDocRAG, TDocProcessor

### Pipeline Stages
config = LightRAGConfig()
rag = TDocRAG(config)
await rag.start("my-workspace")
```

Order: **CLASSIFY****EXTRACT****EMBED****GRAPH**
### Summarization (Legacy, Still Active)

**Note:** Summarization is NOT in pipeline. Use `ai summarize <doc_id>` for on-demand LLM summaries.
Uses LiteLLM directly for on-demand summaries:

### Separation: Pipeline vs CLI Summarize
```python
from tdoc_ai import summarize_document

| Command | Purpose | LLM Required |
|---------|---------|--------------|
| `ai workspace process` | Embed documents | No |
| `ai summarize <doc>` | Generate LLM summary | Yes |
summary = summarize_document("S4-250001", markdown_content)
```

## Configuration

Environment-based via `AiConfig.from_env()`:
- `EMBEDDING_MODEL` - Sentence transformer (default: `all-MiniLM-L6-v2`)
- `EMBEDDING_DIMENSION` - Vector dimension (default: 384)
- `LLM_MODEL` - LLM for summarization (default: `openai/gpt-4o-mini`)
### LightRAG (New)

Environment-based via `LightRAGConfig`:
- `LIGHTRAG_LLM_MODEL` - LLM model (default: `qwen3:8b`)
- `LIGHTRAG_EMBEDDING_MODEL` - Embedding model (default: `qwen3-embedding:0.6b`)
- `LIGHTRAG_WORKING_DIR` - Working directory (default: `~/.3gpp-crawler/lightrag`)

### Summarization (Legacy)

Uses `AiConfig.from_env()`:
- `TDC_AI_LLM_MODEL` - LLM model (default: `openrouter/openrouter/free`)
- `TDC_AI_LLM_API_KEY` - API key for LLM

## Storage Layer

AiStorage uses LanceDB:
- Embeddings with document metadata
- Workspace-scoped storage
- Status tracking (classified, extracted, embedded, graphed)
### LightRAG (New)

File-based storage by default:
- NanoVectorDB for embeddings
- JsonKVStorage for cache
- NetworkX for knowledge graph

Optionally use pg0 for PostgreSQL-backed storage.

### Legacy (Still Active)

AiStorage uses LanceDB for status tracking in extract/summarize pipelines.

## CLI Integration

Exposed via `tdoc-crawler ai` commands. See `src/tdoc_crawler/cli/ai.py`.
Exposed via `3gpp-ai` commands:

```bash
3gpp-ai rag query "your query"
3gpp-ai rag status
3gpp-ai summarize S4-250001
```

## Import Guidelines

```python
# Public API (preferred)
from tdoc_ai import create_embeddings_manager, process_document, query_graph

# Internal operations when needed
from tdoc_ai.operations.embeddings import EmbeddingsManager
from tdoc_ai.operations.pipeline import run_pipeline
# LightRAG integration (preferred)
from tdoc_ai import (
    LightRAGConfig,
    TDocRAG,
    TDocProcessor,
    RAGMetadata,
    enrich_text,
)

# Document operations (still used)
from tdoc_ai import convert_document, summarize_document

# Workspace management (still used)
from tdoc_ai import create_workspace, list_workspaces
```

## Lessons Learned
## Pipeline Stages (Legacy)

Order: **CLASSIFY****EXTRACT****EMBED****GRAPH**

LightRAG handles embedding and graph construction automatically.

## Deprecated/Removed

1. **No LLM in Pipeline**: Runs completely locally with sentence transformers
2. **Factory Pattern**: `from_config()` loads model once, extracts dimension, creates storage
3. **Workspace Isolation**: All operations support optional `workspace` parameter
4. **Status Tracking**: `ProcessingStatus` tracks completed stages for resume capability
- `create_embeddings_manager()` - Removed from public API
- `AiStorage` - Legacy, still used internally by extract/summarize
- `EmbeddingsManager` - Legacy, still used internally
- LanceDB-based storage - Replaced by LightRAG native storage
+4 −4
Original line number Diff line number Diff line
# tdoc-ai

Optional AI extension package for `tdoc-crawler`.
Optional AI extension package for `3gpp-crawler`.

This package contains AI-focused capabilities including:

- Document extraction and conversion
- Summarization
- Embeddings and semantic search
- GraphRAG querying
- GraphRAG querying (via LightRAG)
- AI workspace management

Install via `tdoc-crawler` extras:
Install via `3gpp-crawler` extras:

```bash
uv add "tdoc-crawler[ai]"
uv add "3gpp-crawler[ai]"
```
+7 −8
Original line number Diff line number Diff line
[project]
name = "tdoc-ai"
name = "3gpp-ai"
version = "0.1.0"
description = "Optional AI/RAG extension package for tdoc-crawler"
description = "Optional AI/RAG extension package for 3gpp-crawler"
authors = [{ name = "Jan Reimes", email = "jan.reimes@head-acoustics.com" }]
readme = "README.md"
keywords = ["python", "3gpp", "rag", "ai"]
@@ -18,16 +18,15 @@ dependencies = [
    "doc2txt>=1.0.8",
    #"doc2txt>=1.0.8 @ git+https://github.com/Quantatirsk/doc2txt-pypi.git"
    "kreuzberg[all]>=4.0.0",
    "lancedb>=0.29.2",
    "litellm>=1.81.15",
    "sentence-transformers[openvino] @ git+https://github.com/huggingface/sentence-transformers.git",
    "tokenizers>=0.22.2",
    "optimum-intel[openvino]",
    "hf_xet"
    "lightrag-hku[api]>=1.4.9.3",
    "pg0-embedded>=0.12.0",
    "pydantic-settings>=2.13.1",
]

[project.urls]
Repository = "https://forge.3gpp.org/rep/reimes/tdoc-crawler"
Repository = "https://forge.3gpp.org/rep/reimes/3gpp-crawler"

[build-system]
requires = ["hatchling"]
+44 −65
Original line number Diff line number Diff line
"""AI document processing domain package."""
"""AI document processing domain package.

This package provides AI-powered document processing for 3GPP TDocs.
Supports both legacy LiteLLM summarization and modern LightRAG knowledge graph.
"""

from __future__ import annotations

import litellm

from tdoc_ai.config import AiConfig
from tdoc_ai.context import DocumentContext

# Import LightRAG integration
from tdoc_ai.lightrag import (
    DatabaseConfig,
    EmbeddingConfig,
    LightRAGConfig,
    LLMConfig,
    Pg0Error,
    Pg0Manager,
    ProcessingResult,
    ProcessingResultStatus,
    QueryMode,
    RAGMetadata,
    StorageBackend,
    TDocProcessor,
    TDocRAG,
    create_metadata_from_dict,
    enrich_text,
)
from tdoc_ai.models import (
    DocumentChunk,
    DocumentClassification,
    DocumentSummary,
    GraphEdge,
    GraphNode,
    PipelineStage,
    ProcessingStatus,
    SourceKind,
    SummarizeResult,
)
from tdoc_ai.operations.convert import convert_tdoc as convert_document
from tdoc_ai.operations.embeddings import EmbeddingsManager
from tdoc_ai.operations.graph import query_graph
from tdoc_ai.operations.pipeline import get_status, list_statuses, process_all
from tdoc_ai.operations.pipeline import process_tdoc as process_document
from tdoc_ai.operations.summarize import SummarizeResult
from tdoc_ai.operations.summarize import summarize_tdoc as summarize_document
from tdoc_ai.operations.workspace_registry import (
    DEFAULT_WORKSPACE,
@@ -49,90 +62,56 @@ from tdoc_ai.operations.workspaces import (
    resolve_tdoc_checkout_path,
    resolve_workspace,
)
from tdoc_ai.storage import AiStorage
from tdoc_crawler.config import CacheManager

litellm.suppress_debug_info = True  # Suppress provider/model info logs from litellm

process_tdoc = process_document


def create_embeddings_manager(config: AiConfig | None = None) -> EmbeddingsManager:
    """Create an EmbeddingsManager with proper initialization.

    This is the primary entry point for creating AI services.
    Loads model once, creates storage with correct dimension.

    Args:
        config: Optional config. If None, loads from environment.

    Returns:
        EmbeddingsManager with .storage and .config properties.
    """
    if config is None:
        config = AiConfig.from_env()
    return EmbeddingsManager(config)


# Backward compatibility alias
def get_embeddings_manager() -> EmbeddingsManager:
    """Get embeddings manager singleton (deprecated).

    Use create_embeddings_manager() instead.
    """
    return create_embeddings_manager()


def get_ai_storage(config: AiConfig | None = None) -> AiStorage:
    """Get storage instance (deprecated).

    Use create_embeddings_manager().storage instead.
    """
    return create_embeddings_manager(config).storage


__all__ = [
    # Workspace management
    "DEFAULT_WORKSPACE",
    "AiConfig",
    "AiStorage",
    # Shared types
    "CacheManager",
    "DocumentChunk",
    "DocumentClassification",
    # LightRAG integration
    "DatabaseConfig",
    "DocumentContext",
    "DocumentSummary",
    "GraphEdge",
    "GraphNode",
    "EmbeddingConfig",
    "LLMConfig",
    "LightRAGConfig",
    "Pg0Error",
    "Pg0Manager",
    "PipelineStage",
    "ProcessingResult",
    "ProcessingResultStatus",
    "ProcessingStatus",
    "QueryMode",
    "RAGMetadata",
    "SourceKind",
    "StorageBackend",
    "SummarizeResult",
    "TDocProcessor",
    "TDocRAG",
    "WorkspaceDisplayInfo",
    "WorkspaceRegistry",
    "add_workspace_members",
    "checkout_spec_to_workspace",
    "checkout_tdoc_to_workspace",
    # Document operations
    "convert_document",
    "create_embeddings_manager",
    "create_metadata_from_dict",
    "create_workspace",
    "delete_workspace",
    "enrich_text",
    "ensure_ai_subfolder",
    "ensure_default_workspace",
    "get_active_workspace",
    "get_ai_storage",
    "get_embeddings_manager",
    "get_status",
    "get_workspace",
    "get_workspace_member_counts",
    "is_default_workspace",
    "list_statuses",
    "list_workspace_members",
    "list_workspaces",
    "make_workspace_member",
    "normalize_workspace_name",
    "process_all",
    "process_document",
    "process_tdoc",
    "query_graph",
    "remove_invalid_members",
    "resolve_tdoc_checkout_path",
    "resolve_workspace",
+52 −0
Original line number Diff line number Diff line
"""LightRAG integration for tdoc-ai.

This package provides a thin wrapper around LightRAG with:
- Ollama LLM and embedding support (qwen3-embedding:0.6b)
- File-based or pg0-backed storage
- Async context manager pattern
- TDoc document processing with kreuzberg extraction

Example:
    >>> import asyncio
    >>> async def main():
    ...     async with TDocRAG() as rag:
    ...         await rag.insert("TDoc S4-250001 about TS 26.444")
    ...         result = await rag.query("What TDocs mention TS 26.444?")
    ...         print(result)
    >>> asyncio.run(main())
"""

from .config import (
    DatabaseConfig,
    EmbeddingConfig,
    LightRAGConfig,
    LLMConfig,
    QueryMode,
    StorageBackend,
)
from .metadata import RAGMetadata, create_metadata_from_dict, enrich_text
from .pg0_manager import Pg0Error, Pg0Manager
from .processor import ProcessingResult, ProcessingResultStatus, TDocProcessor
from .rag import TDocRAG
from .seeder import EntitySeed, EntitySeeder, EntityType

__all__ = [
    "DatabaseConfig",
    "EmbeddingConfig",
    "EntitySeed",
    "EntitySeeder",
    "EntityType",
    "LLMConfig",
    "LightRAGConfig",
    "Pg0Error",
    "Pg0Manager",
    "ProcessingResult",
    "ProcessingResultStatus",
    "QueryMode",
    "RAGMetadata",
    "StorageBackend",
    "TDocProcessor",
    "TDocRAG",
    "create_metadata_from_dict",
    "enrich_text",
]
Loading