Commit 3e71bdb0 authored by Jan Reimes's avatar Jan Reimes
Browse files

refactor(di): standardize documentation and improve formatting

* Update documentation across multiple files to ensure consistency.
* Change date formatting to use backslashes for line breaks.
* Adjust list numbering for design principles and testing strategies.
* Enhance clarity in performance considerations and risk assessments.
* Refactor agent rules and import patterns for better readability.
* Improve testability approach with clearer guidelines and examples.
* Consolidate and clarify database schema and TDoc data sources.
* Remove redundant comments and ensure all sections are concise.
parent bb1df3b7
Loading
Loading
Loading
Loading
+20 −141
Original line number Diff line number Diff line
# Assistant Rules for TDoc-Crawler
# TDoc-Crawler

Command line tool for querying structured 3GPP TDoc data.
CLI tool for querying structured 3GPP TDoc data.

---

## Development Commands

> All Python commands use `uv run` to activate the virtual environment.
## Commands

```bash
uv run pytest -v           # Run tests
ruff check src/ tests/         # Lint after changes
ruff check src/ tests/     # Lint
uv add <package>           # Add dependency
uv build                       # Package application
```

For package-specific commands, see the respective `AGENTS.md` in each package.

---

## Project Constraints

### Virtual Environment (MANDATORY)

Use `uv run <command>` for all Python commands. The virtual environment must be activated before running pytest, CLI, or any project scripts.

### Linter Rules
## Critical Constraints

- **NEVER** suppress linter issues with `# noqa` in `src/` or `tests/`
- **MUST NOT introduce:** `PLC0415`, `ANN001`, `E402`, `ANN201`, `ANN202`
- Run `ruff check src/ tests/` after changes

### Git and Version Control

- Use `git` with `main` as main branch
- Use `git add` sparingly — only for files likely to be committed
- **Never** run `git commit` or `git push` autonomously
- Use `uv run` for all Python commands
- `.env` files **MUST NOT** be committed

---
- **Never** run `git commit` or `git push` autonomously

## Code Style

### Python Standards

Use skill `python-standards` for all Python coding tasks.

**Project-Specific Rules:**

- Type hints mandatory everywhere (use `T | None`, not `Optional[T]`)
- Type hints mandatory (`T | None` not `Optional[T]`)
- Use `is`/`is not` for `None` comparisons
- Keep modules < 250 lines, functions < 75 lines, classes < 200 lines
- Use `logging` instead of `print()`

For package-specific libraries and patterns, see the respective `AGENTS.md` in each package.

### Comments

- Explain **WHY**, not WHAT (code is self-documenting)
- DO NOT use numbered steps ("Step 3: ...") — hard to maintain
- DO NOT use decorative headings ("===== TOOLS =====")
- DO NOT use emojis/Unicode (①, •, –, —) in comments
- Emojis in user-facing output only when they enhance clarity (✔︎, ✘, ∆, ‼︎)

### Testing

See `tests/AGENTS.md` for detailed test patterns and fixtures.

- Use `pytest` with fixtures and parameterized tests
- Use mocking to isolate external systems
- Aim for 70%+ coverage

For package-specific test locations, see the respective `AGENTS.md` in each package.

---
- `logging` over `print()`
- Explain **WHY**, not WHAT

## Documentation

### Code Documentation

- Clear, concise docstrings using Google style
- Include type hints in docstrings
- Use examples for non-obvious usage

### User Documentation

**Files:**

1. `README.md` — Project overview, installation, Quick Start
2. `docs/index.md` — Main documentation entry (Jekyll-ready)
3. `docs/*.md` — Modular task-oriented guides (crawl, query, utils)
4. `docs/history/` — Chronological changelog

---

## Skills Reference

Load skills based on context. Skills are located in `.agents/skills/`.

### 3GPP Domain Skills

| Skill | When to Use |
|-------|-------------|
| `3gpp-basics` | 3GPP organization, hierarchy, releases, TDocs overview |
| `3gpp-working-groups` | WG codes, tbid/SubTB identifiers, subgroup hierarchy |
| `3gpp-meetings` | Meeting structure, naming conventions, quarterly plenaries |
| `3gpp-tdocs` | TDoc patterns, metadata, FTP server access |
| `3gpp-specifications` | TS/TR numbering, spec file formats, FTP directories |
| `3gpp-releases` | Release structure, versioning, TSG rounds |
| `3gpp-change-request` | CR procedure, workflow, status tracking |
| `3gpp-portal-authentication` | EOL authentication, portal data fetching |

### Programming Skills

| Skill | When to Use |
|-------|-------------|
| `python-standards` | Writing/reviewing Python code, type hints, linting |
| `test-driven-development` | TDD with pytest, fixtures, mocking, coverage |
| `code-deduplication` | Preventing semantic duplication, capability index |
| `documentation-workflow` | Updating docs, structure, best practices |
| `visual-explainer` | Creating diagrams, architecture overviews |

---
- **Skills:** `docs/skills-reference.md`
- **Testing:** `tests/AGENTS.md`
- **Packages:** See respective `AGENTS.md` files

## Packages

This workspace contains multiple packages with their own AGENTS.md:

| Location | Purpose |
|----------|---------|
| `src/tdoc_crawler/AGENTS.md` | Core crawler library (TDocs, meetings, specs) |
| `src/tdoc_crawler/cli/AGENTS.md` | CLI patterns and constraints |
| `src/tdoc-ai/AGENTS.md` | AI document processing (embeddings, graphs) |
| `src/teddi-mcp/AGENTS.md` | TEDDI MCP server patterns |
| `tests/AGENTS.md` | Test organization and fixtures |

---

## AGENTS.md Maintenance

This file serves as long-term memory for coding assistants. Principles:

**What to Include:**

- Project structure and module responsibilities
- Coding conventions and style guidelines
- Import patterns and dependency rules
- Tool preferences and usage patterns
- Lessons learned from refactoring

**What NOT to Include:**

- Checklists of completed items (belongs in git history)
- Active TODO lists (use issue tracker)
- Step-by-step implementation plans
- Temporary debugging notes
- File directory trees (changes too often)

**Updates:**

- Update after refactoring sessions with architectural insights
- Document patterns and anti-patterns
| `src/tdoc_crawler/` | Core crawler library |
| `src/tdoc_crawler/cli/` | CLI patterns |
| `packages/tdoc-ai/` | AI document processing |
| `packages/teddi-mcp/` | TEDDI MCP server |
| `packages/convert-lo/` | LibreOffice conversion |
+30 −0
Original line number Diff line number Diff line
# Skills Reference

Skills are located in `.agents/skills/`. Load based on context.

## 3GPP Domain Skills

| Skill | When to Use |
|-------|-------------|
| `3gpp-basics` | 3GPP organization, hierarchy, releases, TDocs overview |
| `3gpp-working-groups` | WG codes, tbid/SubTB identifiers, subgroup hierarchy |
| `3gpp-meetings` | Meeting structure, naming conventions, quarterly plenaries |
| `3gpp-tdocs` | TDoc patterns, metadata, FTP server access |
| `3gpp-specifications` | TS/TR numbering, spec file formats, FTP directories |
| `3gpp-releases` | Release structure, versioning, TSG rounds |
| `3gpp-change-request` | CR procedure, workflow, status tracking |
| `3gpp-portal-authentication` | EOL authentication, portal data fetching |

## Programming Skills

| Skill | When to Use |
|-------|-------------|
| `python-standards` | Writing/reviewing Python code, type hints, linting |
| `test-driven-development` | TDD with pytest, fixtures, mocking, coverage |
| `code-deduplication` | Preventing semantic duplication, capability index |
| `documentation-workflow` | Updating docs, structure, best practices |
| `visual-explainer` | Creating diagrams, architecture overviews |

## Package-Specific Skills

See respective `AGENTS.md` files in each package for domain-specific skills.
+31 −223
Original line number Diff line number Diff line
# AGENTS.md - convert-lo
# convert-lo

## Scope

convert-lo provides LibreOffice document conversion utilities with support for both CLI and server modes.
LibreOffice document conversion with CLI and server modes.

## Architecture

### Conversion Modes

The `Converter` class supports two conversion backends:

1. **CLI Mode** (default fallback)
   - Uses `soffice --headless --convert-to`
   - Loads/unloads LibreOffice per conversion
   - Always available, no server required

2. **Server Mode** (unoserver)
   - Uses persistent LibreOffice listener
   - 2-4x faster for batch conversions
   - 50-75% lower CPU load
   - Requires `unoserver` running
| Mode | Description | Performance |
|------|-------------|-------------|
| **CLI** (default fallback) | `soffice --headless --convert-to` | Baseline |
| **Server** (unoserver) | Persistent listener | 2-4x faster batch |

### Hybrid Detection

By default (`server_mode="auto"`), the converter:
1. Checks for running unoserver at configured host:port
Default (`server_mode="auto"`):
1. Checks for running unoserver
2. Uses server if available
3. Falls back to CLI mode silently
3. Falls back to CLI silently

### Key Components

| Module | Purpose |
|--------|---------|
| `converter.py` | `Converter` class with hybrid mode support |
| `server.py` | `ServerManager` for lifecycle management |
| `converter.py` | `Converter` class with hybrid mode |
| `server.py` | `ServerManager` lifecycle |
| `locator.py` | LibreOffice executable discovery |
| `formats.py` | `LibreOfficeFormat` enum and validation |
| `benchmark.py` | Performance comparison script |

## Usage Patterns

### Basic Conversion (Auto-Detect)

```python
from convert_lo import Converter, LibreOfficeFormat

# Auto-detect server, fallback to CLI
converter = Converter()
result = converter.convert(
    input_file=Path("document.docx"),
    output_format=LibreOfficeFormat.PDF,
    output_dir=Path("./output"),
)
```

### Force Server Mode

```python
from convert_lo import Converter

# Require server (raises error if unavailable)
converter = Converter(server_mode="server")

# Or auto-start server if needed
converter = Converter(
    server_mode="auto",
    auto_start_server=True,
)
```

### Force CLI Mode

```python
from convert_lo import Converter

# Always use CLI (skip server detection)
converter = Converter(server_mode="cli")
```

### Context Manager (Auto Start/Stop)

```python
from convert_lo import Converter

with Converter(auto_start_server=True) as converter:
    # Server started on entry, stopped on exit
    results = converter.convert_batch(files, LibreOfficeFormat.PDF, output_dir)
```

### Manual Server Management

```python
from convert_lo import ServerManager, Converter

# Start server explicitly
manager = ServerManager(port=2003)
manager.start()

# Use converter with specific server
converter = Converter(
    server_mode="server",
    server_port=2003,
)

# ... do conversions ...

# Stop server
manager.stop()
```

### Check Server Status

```python
from convert_lo import is_server_running, ServerManager

# Check if server is running
if is_server_running("127.0.0.1", 2003):
    print("Server is available")

# Get detailed status
manager = ServerManager()
status = manager.status()
print(f"Running: {status.is_running}, PID: {status.pid}")
```
| `formats.py` | `LibreOfficeFormat` enum |
| `benchmark.py` | Performance comparison |

## Configuration

### Converter Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `soffice_path` | `Path | None` | `None` | Auto-detect LibreOffice path |
| `server_host` | `str` | `"127.0.0.1"` | Unoserver host |
| `server_port` | `int` | `2003` | Unoserver XML-RPC port |
| `server_mode` | `"auto" | "server" | "cli"` | `"auto"` | Conversion mode |
| `auto_start_server` | `bool` | `False` | Auto-start server if needed |

### ServerManager Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `host` | `str` | `"127.0.0.1"` | Server interface |
| `port` | `int` | `2003` | XML-RPC port |
| `uno_port` | `int` | `2002` | UNO port |
| `soffice_path` | `Path | None` | `None` | LibreOffice path |
| `timeout` | `int` | `30` | Startup timeout (seconds) |
| Parameter | Default | Description |
|-----------|---------|-------------|
| `server_mode` | `"auto"` | `"auto"`, `"server"`, or `"cli"` |
| `server_port` | `2003` | Unoserver XML-RPC port |
| `auto_start_server` | `False` | Auto-start if needed |

### Environment Variables

| Variable | Description |
|----------|-------------|
| `LIBREOFFICE_PATH` | Override auto-detection for soffice executable |
| `LIBREOFFICE_PATH` | Override auto-detection |

## Supported Formats

See `LibreOfficeFormat` enum in `formats.py` for complete list:

See `LibreOfficeFormat` enum in `formats.py`:
- **Text**: ODT, DOC, DOCX, RTF, PDF, HTML, TXT, MD, EPUB
- **Spreadsheet**: ODS, XLS, XLSX, CSV
- **Presentation**: ODP, PPT, PPTX
- **Graphics**: ODG, SVG

## Performance

### Benchmark Script

Run performance comparison:

```bash
uv run python -m convert_lo.benchmark --file-count 20 --size medium
```

Options:
- `-n, --file-count`: Number of test files (default: 10)
- `-s, --size`: Document size: small/medium/large (default: medium)
- `-f, --format`: Output format: pdf/docx/odt/txt/html (default: pdf)

### Expected Performance

| Scenario | CLI Mode | Server Mode | Speedup |
|----------|----------|-------------|---------|
| Single file | ~2-3s | ~2-3s | 1.0x |
| 10 files | ~20-30s | ~8-12s | 2-3x |
| 50 files | ~100-150s | ~30-50s | 3-4x |

Server mode shows greater benefits with:
- Larger file counts
- Complex documents
- Repeated conversions

## Error Handling

### Exceptions

| Exception | When Raised |
|-----------|-------------|
| `SofficeNotFoundError` | LibreOffice executable not found |
| `UnsupportedConversionError` | Invalid format or unsupported conversion pair |
| `ConversionError` | Conversion process fails |

### Fallback Behavior

When `server_mode="auto"`:
- Server unavailable → CLI mode (silent)
- Server conversion fails → CLI retry (logged warning)
- Auto-start fails → CLI mode (logged warning)

When `server_mode="server"`:
- Server unavailable → `ConversionError`
- Server conversion fails → CLI retry (logged warning)
| `SofficeNotFoundError` | LibreOffice not found |
| `UnsupportedConversionError` | Invalid format pair |
| `ConversionError` | Conversion fails |

## Testing

### Run Tests

```bash
uv run pytest tests/convert_lo/ -v
uv run pytest tests/convert_lo/ --cov=convert_lo --cov-report=term-missing
```

### Test Files

| File | Purpose |
|------|---------|
| `test_converter.py` | Core `Converter` tests (CLI mode) |
| `test_server.py` | `ServerManager` tests |
| `test_hybrid_converter.py` | Hybrid mode detection and fallback |
| `test_integration.py` | Real LibreOffice integration tests |
| `test_locator.py` | Executable discovery tests |
| `test_formats.py` | Format enum and validation tests |

### Mocking Server in Tests

```python
from unittest.mock import patch

# Mock server detection
with patch("convert_lo.converter.is_server_running", return_value=True):
    with patch("convert_lo.converter.UnoClient") as mock_client:
        converter = Converter(server_mode="auto")
        # Test server-based conversion
uv run pytest tests/convert-lo/ -v
uv run pytest tests/convert-lo/ --cov=convert_lo
```

## Implementation Notes

1. **Thread Safety**: LibreOffice is NOT thread-safe. All conversions run sequentially.

2. **Server Lifecycle**: 
   - Server persists after conversion
   - Use context manager or explicit `stop()` for cleanup
   - Auto-start creates temporary `ServerManager`

3. **Fallback Strategy**: Server conversion failures trigger CLI retry, ensuring reliability.

4. **LSP Warnings**: Import of `unoserver.client` may show LSP errors (optional import pattern). This is expected and handled at runtime.

## Dependencies

- `unoserver>=3.6` (optional, provides server mode)
- LibreOffice (required, any recent version)

## Lessons Learned

1. **Auto-detect is key**: Users shouldn't need to know about server vs CLI. Default to `server_mode="auto"`.

2. **Graceful degradation**: Always provide CLI fallback. Server is an optimization, not a requirement.
1. **Thread Safety**: LibreOffice is NOT thread-safe. Conversions run sequentially.
2. **Server Lifecycle**: Server persists after conversion. Use context manager for cleanup.
3. **Fallback Strategy**: Server failures trigger CLI retry.
4. **LSP Warnings**: `unoserver.client` import may show errors (optional import pattern).

3. **Context manager pattern**: Makes server lifecycle management trivial for users.
## Usage Examples

4. **Benchmark early**: Performance differences are significant. Include benchmarks in CI for regression detection.
See `docs/convert-lo-usage.md` for detailed usage patterns.
+22 −90
Original line number Diff line number Diff line
# Assistant Rules for tdoc-ai Package
# tdoc-ai

## Overview

The `tdoc-ai` package provides AI-powered document processing for 3GPP TDocs. It handles embeddings, knowledge graphs, summarization, and semantic search. This package is integrated into the main `tdoc-crawler` CLI under the `ai` command group.
AI-powered document processing for 3GPP TDocs (embeddings, knowledge graphs, semantic search).

## Package Structure

```
src/tdoc-ai/tdoc_ai/
├── __init__.py           # Public API exports, factory functions
├── config.py             # AiConfig (environment-based configuration)
├── models.py             # Pydantic models (ProcessingStatus, DocumentSummary, etc.)
├── storage.py            # AiStorage (LanceDB-based vector storage)
├── operations/
│   ├── pipeline.py       # Main processing pipeline (CLASSIFY → EXTRACT → EMBED → GRAPH)
│   ├── embeddings.py     # EmbeddingsManager (local embedding generation)
│   ├── classify.py       # Document classification
│   ├── extract.py        # DOCX to Markdown extraction
│   ├── summarize.py      # LLM-based summarization
│   ├── graph.py          # Knowledge graph operations
│   ├── convert.py        # Document conversion
│   ├── workspaces.py     # Workspace member management
│   └── workspace_registry.py  # Workspace CRUD
```
See `src/tdoc-ai/tdoc_ai/` for full module layout.

## Key Design Patterns

### Factory Pattern for EmbeddingsManager

The `EmbeddingsManager` uses a factory pattern to break the circular dependency between config, storage, and embeddings:
Breaks circular dependency between config, storage, and embeddings:

```python
# CORRECT: Use factory method
from tdoc_ai import create_embeddings_manager
manager = create_embeddings_manager()  # or with explicit config
storage = manager.storage  # Access storage via property

# DEPRECATED: Direct instantiation requires careful ordering
from tdoc_ai.operations.embeddings import EmbeddingsManager
manager = EmbeddingsManager.from_config(config)
manager = create_embeddings_manager()
storage = manager.storage  # Access via property
```

### Pipeline Stages

The processing pipeline runs in order:
Order: **CLASSIFY****EXTRACT****EMBED****GRAPH**

1. **CLASSIFY** - Identify main document among multiple files
2. **EXTRACT** - Convert DOCX to Markdown
3. **EMBED** - Generate vector embeddings (local, no LLM required)
4. **GRAPH** - Build knowledge graph

**Note:** Summarization is NOT part of the pipeline. Use `ai summarize <doc_id>` command for on-demand LLM-based summarization.
**Note:** Summarization is NOT in pipeline. Use `ai summarize <doc_id>` for on-demand LLM summaries.

### Separation: Pipeline vs CLI Summarize

| Command | Purpose | LLM Required |
|---------|---------|--------------|
| `ai workspace process` | Embed documents for semantic search | No |
| `ai workspace process` | Embed documents | No |
| `ai summarize <doc>` | Generate LLM summary | Yes |

## Configuration

All configuration is environment-based via `AiConfig.from_env()`:

- `EMBEDDING_MODEL` - Sentence transformer model (default: `sentence-transformers/all-MiniLM-L6-v2`)
Environment-based via `AiConfig.from_env()`:
- `EMBEDDING_MODEL` - Sentence transformer (default: `all-MiniLM-L6-v2`)
- `EMBEDDING_DIMENSION` - Vector dimension (default: 384)
- `LLM_MODEL` - LLM model for summarization (default: `openai/gpt-4o-mini`)
- `LanceDB path` - Storage location
- `LLM_MODEL` - LLM for summarization (default: `openai/gpt-4o-mini`)

## Storage Layer

AiStorage uses LanceDB for vector storage:
- Embeddings are stored with document metadata
- Supports workspace-scoped storage
- Provides status tracking (classified, extracted, embedded, graphed)
AiStorage uses LanceDB:
- Embeddings with document metadata
- Workspace-scoped storage
- Status tracking (classified, extracted, embedded, graphed)

## CLI Integration

The `tdoc-ai` package is exposed via `tdoc-crawler ai` commands:
- `ai summarize <doc>` - LLM summarization
- `ai query <text>` - Semantic search
- `ai workspace process` - Batch embedding
- `ai workspace list-members` - List workspace contents
Exposed via `tdoc-crawler ai` commands. See `src/tdoc_crawler/cli/ai.py`.

## Import Guidelines

```python
# Public API (preferred)
from tdoc_ai import (
    create_embeddings_manager,
    process_document,
    process_all,
    get_status,
    query_graph,
    summarize_document,
)
from tdoc_ai import create_embeddings_manager, process_document, query_graph

# Internal operations when needed
from tdoc_ai.operations.embeddings import EmbeddingsManager
from tdoc_ai.operations.pipeline import run_pipeline

# Models
from tdoc_ai.models import ProcessingStatus, PipelineStage
```

## Common Tasks

### Processing Documents
```python
from tdoc_ai import process_document
status = process_document("SP-123456", Path("./checkouts/SP-123456"))
```

### Querying
```python
from tdoc_ai import query_graph
results = query_graph("What is the status of 5G NR?", workspace="my_ws")
```

### Creating Embeddings
```python
from tdoc_ai import create_embeddings_manager
manager = create_embeddings_manager()
manager.generate_embeddings(doc_id, artifact_path)
```

## Lessons Learned

1. **No LLM in Pipeline**: The process pipeline runs completely locally using sentence transformers. LLM access is only needed for summarization, which is a separate command.

2. **Factory Pattern**: EmbeddingsManager uses `from_config()` factory to load the embedding model once, extract the dimension, create storage, then return the manager.

3. **Workspace Isolation**: All operations support optional `workspace` parameter for multi-tenant isolation.

4. **Status Tracking**: Each document has a ProcessingStatus tracking completed stages for resume capability.
1. **No LLM in Pipeline**: Runs completely locally with sentence transformers
2. **Factory Pattern**: `from_config()` loads model once, extracts dimension, creates storage
3. **Workspace Isolation**: All operations support optional `workspace` parameter
4. **Status Tracking**: `ProcessingStatus` tracks completed stages for resume capability
+22 −76

File changed.

Preview size limit exceeded, changes collapsed.

Loading