Commit 08a29848 authored by Jan Reimes's avatar Jan Reimes
Browse files

Initial commit: Migrate teddi-mcp from tdoc-crawler

- FastMCP 3.0 server for ETSI TEDDI
- CLI and MCP interfaces
- HTTP caching with hishel
- Protocol-based abstraction
- Full test suite (23 tests)
- Documentation (index, usage, development, api)
parents
Loading
Loading
Loading
Loading

.gitignore

0 → 100644
+35 −0
Original line number Diff line number Diff line
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Distribution / packaging
build/
dist/
*.egg-info/

# Virtual environments
.venv/
venv/

# IDE
.idea/
.vscode/
*.swp
*.swo

# Testing
.pytest_cache/
.coverage
htmlcov/

# Cache
.ruff_cache/
.mypy_cache/

# OS
.DS_Store
Thumbs.db

# Local cache
.cache/

AGENTS.md

0 → 100644
+92 −0
Original line number Diff line number Diff line
# teddi-mcp

FastMCP 3.0 server for ETSI TEDDI (TErms and Definitions Database Interactive).

## Quick Start

```bash
uv pip install -e .
teddi-mcp search-term --term "QoS" --search-pattern exactmatch
teddi-mcp serve  # Start MCP server
```

**Key Files:**
- `models.py` — Pydantic models
- `client.py``TeddiClient` with Protocol abstraction
- `server.py` — FastMCP 3.0 server (stdio)
- `parser.py` — HTML parsing with TB grouping
- `cli.py` — Typer CLI (non-MCP)

## Key Design Patterns

### Protocol-Based Abstraction

`TeddiSource` protocol enables mocking/swapping TEDDI sources:

```python
class TeddiSource(Protocol):
    async def search_terms(self, request: SearchRequest) -> SearchResponse: ...
    async def get_available_technical_bodies(self) -> list[TechnicalBody]: ...
    async def fetch_document(self, url: str) -> bytes: ...
```

### HTTP Caching

All requests use `create_cached_teddi_session()`:
- SQLite: `.cache/teddi_http.sqlite3`
- TTL: 2 hours (refresh on access)
- Auto-retries: 429, 500, 502, 503, 504

### TB Grouping in Parser

TEDDI results have nested tables. Parser handles TB inheritance:
- Empty TB cell inherits previous TB value
- Results grouped by technical body

### Dual API: MCP + CLI

Both expose identical tools:
- `search_term()` — query with filters
- `list_technical_bodies()` — list TBs
- `fetch_document()` — retrieve content (MCP-only)

## Data Models

See `models.py` for complete definitions:

- **Enums:** `SearchIn`, `SearchPattern`, `TechnicalBody`
- **Data Classes:** `DocumentRef`, `TermResult`

## TEDDI Endpoint

**POST** `https://webapp.etsi.org/Teddi/search`

**Parameters:** `term`, `searchin`, `searchpattern`, `technicaldiebody`

**Response:** HTML table (parsed by `parser.py`)

## Testing

```bash
uv run pytest tests/teddi_mcp/ -v
uv run pytest tests/teddi_mcp/ --cov=teddi_mcp
```

**Test structure:** `test_models.py`, `test_parser.py`, `test_client.py`, `test_http_client.py`, `test_cli.py`, `test_server.py`

## Implementation Notes

1. **TEDDI HTML-Driven:** Endpoint reverse-engineered. Parser may need updates if TEDDI UI changes.
2. **Async-First:** All core methods async. Use `asyncio.run()` for sync wrappers.
3. **Cache Validation:** Tests auto-cache in `tests/.cache/teddi_http.sqlite3`.

## Adding Features

### New Search Filter

1. Define enum in `models.py`
2. Update `SearchRequest` dataclass
3. Update HTTP call in `client.py`
4. Update CLI in `cli.py`
5. Update MCP server in `server.py`
6. Add tests

README.md

0 → 100644
+98 −0
Original line number Diff line number Diff line
# TEDDI-MCP: FastMCP Server for ETSI TEDDI

A FastMCP 3.0 server that wraps ETSI's TErms and Definitions Database Interactive (TEDDI) for AI agent integration and command-line usage.

## Features

- **FastMCP 3.0 Server**: Expose TEDDI search as an MCP tool for AI agents (Claude, etc.)
- **CLI Interface**: Search TEDDI from the command line with table, JSON, ISON, and TOON output
- **HTTP Caching**: Automatic hishel-based caching of TEDDI responses (2-hour TTL)
- **TB Grouping**: Smart parsing of sub-table results with technical body grouping logic
- **Type-Safe**: Full Pydantic models and type hints throughout
- **Async-First**: Built on asyncio for performance

## Quick Start

### Installation

```bash
cd src/teddi-mcp
uv pip install -e .
```

### CLI Usage

```bash
# Search for a term
teddi-mcp search term "QoS" --search-pattern exactmatch

# List available technical bodies
teddi-mcp search list-bodies

# JSON output
teddi-mcp search term "QoS" --output json

# ISON output (token-optimized)
teddi-mcp search term "QoS" --output ison

# TOON output (token-optimized)
teddi-mcp search term "QoS" --output toon

# Filter by technical bodies
teddi-mcp search term "test" --technical-bodies "3gpp,etsi"
```

### MCP Server

```bash
# Start the MCP server (stdio)
teddi-mcp server
```

Then configure your AI agent client (e.g., Claude) to use this server:

```json
{
  "mcpServers": {
    "teddi": {
      "command": "teddi-mcp",
      "args": ["server"]
    }
  }
}
```

## Architecture

- **models.py**: Pydantic data models (SearchIn, SearchPattern, TechnicalBody enums)
- **client.py**: Core TeddiClient with Protocol-based abstraction
- **parser.py**: HTML parsing with TB grouping logic
- **http_client.py**: HTTP session manager with hishel caching
- **cli.py**: Typer CLI interface
- **server.py**: FastMCP 3.0 server implementation

## Testing

```bash
# Run all tests
uv run pytest tests/teddi_mcp/ -v

# Run with coverage
uv run pytest tests/teddi_mcp/ --cov=teddi_mcp --cov-report=term-missing

# Run specific test
uv run pytest tests/teddi_mcp/test_parser.py -v
```

## Development

See [AGENTS.md](AGENTS.md) for detailed development guidelines including:
- Protocol-based abstraction patterns
- HTTP caching with hishel
- Sub-table parsing with TB grouping
- Dual API design (MCP + CLI)
- Adding new features

## License

MIT

docs/api.md

0 → 100644
+176 −0
Original line number Diff line number Diff line
# API Reference

## Enums

### `OutputFormat`

Output format for CLI results.

```python
class OutputFormat(StrEnum):
    JSON = "json"
    ISON = "ison"
    TOON = "toon"
    TABLE = "table"
```

### `SearchIn`

Search scope: where to search for the term.

```python
class SearchIn(StrEnum):
    ABBREVIATIONS = "abbreviations"
    DEFINITIONS = "definitions"
    BOTH = "both"
```

### `SearchPattern`

Search pattern: how to match the term.

```python
class SearchPattern(StrEnum):
    ALL_OCCURRENCES = "alloccurrences"
    EXACT_MATCH = "exactmatch"
    STARTING_WITH = "startingwith"
    ENDING_WITH = "endingwith"
```

### `TechnicalBody`

ETSI and standardization body identifiers.

```python
class TechnicalBody(StrEnum):
    ALL = "all"
```

## Data Classes

### `DocumentRef`

Reference to a specification document from TEDDI results.

```python
@dataclass
class DocumentRef:
    technical_body: str      # e.g., '3GPP'
    specification: str       # e.g., 'TS 24.008'
    url: str                # Full HTTP URL to the specification
```

### `TermResult`

A single term found in TEDDI with its definition and document references.

```python
@dataclass
class TermResult:
    term: str                              # The term/abbreviation found
    description: str                        # Definition or description
    documents: list[DocumentRef]           # Documents grouped by technical body
```

### `SearchRequest`

Request parameters for TEDDI search.

```python
@dataclass
class SearchRequest:
    term: str                                           # Term to search for
    search_in: SearchIn = SearchIn.BOTH                 # Scope
    search_pattern: SearchPattern = SearchPattern.ALL_OCCURRENCES  # Pattern
    technical_bodies: list[TechnicalBody] | None = None  # Filter by TBs
```

### `SearchResponse`

Response from TEDDI search containing all matching terms.

```python
@dataclass
class SearchResponse:
    query: SearchRequest        # Original search request
    results: list[TermResult]   # Terms found matching the criteria
    total_count: int           # Total number of matching results
```

## Client

### `TeddiClient`

Main client for TEDDI search operations.

```python
class TeddiClient:
    def __init__(self, client: httpx.AsyncClient | None = None) -> None:
        """Initialize client with optional custom HTTP client."""

    async def __aenter__(self) -> "TeddiClient":
        """Async context manager entry."""

    async def __aexit__(self, *args: object) -> None:
        """Async context manager exit."""

    async def search_terms(self, request: SearchRequest) -> SearchResponse:
        """Search for terms in TEDDI."""

    async def get_available_technical_bodies(self) -> list[str]:
        """Get list of available technical bodies."""

    async def fetch_document(self, url: str) -> bytes:
        """Fetch document content from URL."""
```

## HTTP Client

### `create_cached_teddi_async_client`

Creates an async HTTP client with caching.

```python
def create_cached_teddi_async_client(
    cache_dir: Path | None = None,
    ttl_seconds: int = 7200,
) -> httpx.AsyncClient:
    """
    Create async HTTP client with hishel caching.
    
    - SQLite cache in `.cache/teddi_http.sqlite3`
    - TTL: 2 hours (7200 seconds) by default
    - Auto-retries on 429, 500, 502, 503, 504
    """
```

## Parser

### `parse_teddi_response`

Parse TEDDI HTML response into structured results.

```python
def parse_teddi_response(html: str) -> list[TermResult]:
    """
    Parse HTML table response from TEDDI search.
    
    Handles:
    - Nested tables for document references
    - TB grouping (empty cells inherit from previous)
    - Relative URL resolution
    """
```

## TEDDI Endpoint

**POST** `https://webapp.etsi.org/Teddi/search`

**Parameters:**
- `qWhatToSearch`: Term to search
- `qWhereOption`: Search scope (1=abbreviations, 3=definitions, 2=both)
- `qWhatOption`: Match pattern (1=alloccurrences, 2=startingwith, 3=endingwith, 4=exactmatch)
- `qShowTBs`: Technical body filter (default: all)
- `btnSearch.x`, `btnSearch.y`: Search button coordinates

**Response:** HTML table parsed by `parse_teddi_response()`

docs/development.md

0 → 100644
+111 −0
Original line number Diff line number Diff line
# Development Guide

## Setup

1. Clone the repository:
   ```bash
   git clone https://forge.3gpp.org/rep/reimes/teddi-mcp.git
   cd teddi-mcp
   ```

2. Sync dependencies:
   ```bash
   uv sync --all-extras
   ```

3. Install pre-commit hooks:
   ```bash
   uv run pre-commit install
   ```

## Running Tests

```bash
# Run all tests
uv run pytest tests/ -v

# Run with coverage
uv run pytest tests/ --cov=src.teddi_mcp --cov-report=term-missing

# Run specific test file
uv run pytest tests/test_parser.py -v
```

## Code Quality

```bash
# Format and lint
uv run ruff format
uv run ruff check --fix

# Type checking
uv run pyright
```

## Project Structure

```
teddi-mcp/
├── src/
│   └── teddi_mcp/     # Main package
│       ├── models.py  # Pydantic models
│       ├── client.py  # TeddiClient with Protocol abstraction
│       ├── parser.py   # HTML parsing with TB grouping
│       ├── http_client.py  # HTTP session with hishel caching
│       ├── cli.py     # Typer CLI
│       └── server.py  # FastMCP 3.0 server
├── tests/             # Test suite
├── docs/              # Documentation
├── pyproject.toml     # Project config
└── README.md          # Overview
```

## Key Design Patterns

### Protocol-Based Abstraction

`TeddiSource` protocol enables mocking/swapping TEDDI sources:

```python
class TeddiSource(Protocol):
    async def search_terms(self, request: SearchRequest) -> SearchResponse: ...
    async def get_available_technical_bodies(self) -> list[TechnicalBody]: ...
    async def fetch_document(self, url: str) -> bytes: ...
```

### HTTP Caching

All requests use `create_cached_teddi_session()`:
- SQLite: `.cache/teddi_http.sqlite3`
- TTL: 2 hours (refresh on access)
- Auto-retries: 429, 500, 502, 503, 504

### TB Grouping in Parser

TEDDI results have nested tables. Parser handles TB inheritance:
- Empty TB cell inherits previous TB value
- Results grouped by technical body

### Dual API: MCP + CLI

Both expose identical tools:
- `search_term()` — query with filters
- `list_technical_bodies()` — list TBs
- `fetch_document()` — retrieve content (MCP-only)

## Adding Features

### New Search Filter

1. Define enum in `models.py`
2. Update `SearchRequest` dataclass
3. Update HTTP call in `client.py`
4. Update CLI in `cli.py`
5. Update MCP server in `server.py`
6. Add tests

## Implementation Notes

1. **TEDDI HTML-Driven**: Endpoint reverse-engineered. Parser may need updates if TEDDI UI changes.
2. **Async-First**: All core methods async. Use `asyncio.run()` for sync wrappers.
3. **Cache Validation**: Tests auto-cache in `tests/.cache/teddi_http.sqlite3`.