Commit 1af60ce2 authored by Jan Reimes's avatar Jan Reimes
Browse files

docs: update documentation for workspace architecture and environment variables

* Added workspace directory name to .env.example
* Updated README.md to reflect workspace architecture
* Revised config.md to clarify workspace directory settings
* Removed deprecated AI RAG query section from query.md
parent ff0a9919
Loading
Loading
Loading
Loading
+9 −0
Original line number Diff line number Diff line
@@ -19,6 +19,9 @@
# Checkout directory name (default: checkout)
# TDC_CHECKOUT_DIRNAME=checkout

# Workspaces directory name (default: workspaces)
# TDC_WORKSPACES_DIRNAME=workspaces

# ============================================================================
# ETSI ONLINE (EOL) CREDENTIALS
# ============================================================================
@@ -84,6 +87,12 @@ TDC_MAX_RETRIES=3
# Maximum number of documents to crawl (default: 1000)
# TDC_LIMIT_TDOCS=1000

# Maximum meetings to crawl overall (negative = newest N)
# TDC_LIMIT_MEETINGS=

# Per sub-WG meeting limit (negative = newest N)
# TDC_LIMIT_MEETINGS_PER_SUBWG=

# Number of parallel subinterpreter workers (default: 4)
TDC_WORKERS=4

+23 −20
Original line number Diff line number Diff line
@@ -15,7 +15,7 @@ A command-line tool for crawling the 3GPP FTP server, caching 3GPP document meta
- **Case-Insensitive Queries**: Search for TDocs regardless of case
- **Multiple Output Formats**: Export results as table, JSON, or YAML
- **Incremental Updates**: Only fetch new data on subsequent crawls
- **Wiki-First Architecture**: Extraction artifacts organized in ~/.3gpp-crawler/wiki/ for external tool consumption
- **Workspace Architecture**: Extraction artifacts organized in `~/.3gpp-crawler/workspaces/<ws>/sources/<doc>/`
- **Rich CLI**: Beautiful terminal output with progress indicators

## Installation
@@ -38,6 +38,7 @@ uv sync
### Using pip (not recommended)

```bash
# Note: package name may differ from repository name
pip install 3gpp-crawler
```

@@ -57,8 +58,8 @@ cp .env.example .env
# Acts as a "3GPP-compatible fallback". While whatthespec.net is the primary
# data source, it is community-maintained. Credentials allow falling back to
# official 3GPP portal endpoints if the primary source is unavailable.
EOL_USERNAME=your_username
EOL_PASSWORD=your_password
TDC_EOL_USERNAME=your_username
TDC_EOL_PASSWORD=your_password

# HTTP Cache Configuration (optional - uses defaults if not set)
HTTP_CACHE_TTL=7200                      # Cache TTL in seconds (default: 7200 = 2 hours)
@@ -71,11 +72,11 @@ Alternatively, you can:
uvx tdoc-crawler crawl-meetings --eol-username your_username --eol-password your_password

# Configure HTTP caching via CLI:
uvx tdoc-crawler crawl-tdocs --cache-ttl 3600 --cache-refresh
uvx tdoc-crawler crawl --cache-ttl 3600 --cache-refresh

# Or set environment variables directly:
export EOL_USERNAME=your_username
export EOL_PASSWORD=your_password
export TDC_EOL_USERNAME=your_username
export TDC_EOL_PASSWORD=your_password
export HTTP_CACHE_TTL=3600
```

@@ -87,20 +88,22 @@ NOTE: If no credentials are provided, the tool will prompt you interactively

| Command | Alias | Purpose |
|---------|-------|---------|
| **Crawling** | | |
| **tdoc-crawler** | | |
| `crawl` | `ct` | Crawl TDoc metadata from FTP |
| `crawl-meetings` | `cm` | Populate meeting database (Run this first!) |
| `crawl-tdocs` | `ct` | Crawl TDoc metadata from FTP |
| `crawl-specs` | `cs` | Crawl technical specification metadata |
| **Querying** | | |
| `query` | `qt` | Search for TDocs (auto-fetches if missing) |
| `query-meetings` | `qm` | Search and display meeting metadata |
| `query-tdocs` | `qt` | Search for TDocs (auto-fetches if missing) |
| `query-specs` | `qs` | Search technical specifications |
| **Utilities** | | |
| `open` | | Download and open a TDoc |
| `checkout` | | Batch download TDocs to local folder |
| `open-spec` | `os` | Download and open latest spec document |
| `checkout-spec` | `cos` | Batch download technical specifications |
| `stats` | | View database statistics |
| **spec-crawler** | | |
| `crawl` | | Crawl technical specification metadata |
| `query` | | Search technical specifications |
| `open` | | Download and open latest spec document |
| `checkout` | | Batch download technical specifications |
| **3gpp-crawler** | | |
| `config {init,show,validate,docs}` | | Manage configuration |
| `workspace {create,list,...}` | | Manage workspaces and processing |

### 1. Crawl Metadata

@@ -111,10 +114,10 @@ Gather metadata from 3GPP and WhatTheSpec:
tdoc-crawler crawl-meetings

# Crawl TDoc metadata (RAN, SA, CT)
tdoc-crawler crawl-tdocs
tdoc-crawler crawl

# Populate spec catalog
spec-crawler crawl-specs
spec-crawler crawl
```

### 2. Query Metadata
@@ -126,7 +129,7 @@ Search and filter stored information:
tdoc-crawler query R1-2400001

# Query specifications
spec-crawler query-specs 23.501
spec-crawler query 23.501

# List recent meetings
tdoc-crawler query-meetings --limit 10
@@ -141,13 +144,13 @@ Open documents, batch download (checkout), and check database status:
tdoc-crawler open R1-2400001

# Download and open latest version of a spec
spec-crawler open-spec 23.501
spec-crawler open 23.501

# Batch download (checkout) TDocs to local folder
tdoc-crawler checkout R1-2400001 S2-2400567

# Batch checkout specifications
spec-crawler checkout-spec 26130-26140
spec-crawler checkout 26130-26140

# View database statistics
tdoc-crawler stats
+3 −2
Original line number Diff line number Diff line
@@ -29,14 +29,14 @@ Config files are discovered in this order (later files override earlier):

### Path Settings

*File system paths for cache, database, checkout, and AI storage*
*File system paths for cache, database, checkout, and workspaces*

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `cache_dir` | Path | ~/.3gpp-crawler | Root cache directory for storing downloaded files and metadata |
| `db_filename` | str | "3gpp_crawler.db" | SQLite database filename for storing crawl metadata |
| `checkout_dirname` | str | "checkout" | Subdirectory name for checked-out documents |
| `ai_cache_dirname` | str | "lightrag" | Subdirectory name for AI-related cache (embeddings, graphs) |
| `workspaces_dirname` | str | "workspaces" | Subdirectory name for workspace data (sources, wiki) |

### HTTP Settings

@@ -248,6 +248,7 @@ For backward compatibility, environment variables are still supported:
| Variable | Description |
|----------|-------------|
| `TDC_CACHE_DIR` | Cache directory path |
| `TDC_WORKSPACES_DIRNAME` | Workspaces subdirectory name |
| `TDC_EOL_USERNAME` | ETSI Online username |
| `TDC_EOL_PASSWORD` | ETSI Online password |
| `TDC_TIMEOUT` | HTTP timeout in seconds |
+0 −4
Original line number Diff line number Diff line
@@ -139,7 +139,3 @@ converter.convert(
)
"
```

## Relationship to AI Conversion Artifacts

`convert-lo` handles format conversion only. Structured AI extraction artifacts were previously produced by the `3gpp-ai` package (now removed from this repository).
+45 −1
Original line number Diff line number Diff line
@@ -127,7 +127,8 @@ manager = resolve_cache_manager()
manager.root              # cache root directory
manager.db_file           # SQLite database
manager.http_cache_file   # HTTP cache
manager.checkout_dir      # Spec checkout directory
manager.checkout_dir      # Document checkout directory
manager.workspaces_dir    # Workspace data directory (sources, wiki)
```

The `CacheManager` is instantiated by the CLI wrapper. Library users must register their own instance at program start.
@@ -182,6 +183,49 @@ manager = resolve_cache_manager()

Avoid `typing.TYPE_CHECKING` as a workaround for circular imports. It indicates structural problems. Refactor instead — extract shared types to a neutral `models/` layer.

## Design Principles

### Directory Separation: Checkout vs. Workspaces

Document checkouts and workspace artifacts are stored in **separate directory trees** under the cache root:

```
~/.3gpp-crawler/
├── checkout/          # Raw downloaded/extracted documents (TDocs, Specs)
├── workspaces/        # Workspace data (sources, wiki) — separate from checkouts
├── 3gpp_crawler.db
└── http-cache.sqlite3
```

**Why separate:** Checkouts are raw crawled data managed by `tdocs/` and `specs/` packages. Workspaces are user-curated processing artifacts (converted PDFs, extracted markdown, wiki output). Mixing them causes naming collisions (a workspace named "Specs" would collide with the spec checkout directory) and makes cleanup error-prone.

**Rule:** `manager.checkout_dir` is for raw documents only. `manager.workspaces_dir` is for workspace artifacts only. Never use `checkout_dir` for workspace paths or vice versa.

### SourceKind-Aware Processing

The extraction pipeline must branch on `SourceKind` (TDoc vs. Spec). TDocs and Specs have fundamentally different fetch and file-resolution logic:

- **TDocs**: Checked out via FTP/meeting folder structure, resolved through `resolve_tdoc_checkout_path()`
- **Specs**: Checked out from `Specs/archive/{series}/{spec_number}/`, resolved via spec-specific path logic

Functions like `convert_for_wiki()` and `_ensure_converted()` accept a `source_kind` parameter and dispatch to the appropriate handler. Never assume all sources are TDocs.

### No Artifact Directories in Checkout Folders

Processing artifacts (converted PDFs, extracted markdown, figures, tables) must be stored in the workspace `sources/` directory, never as subdirectories of checkout folders.

**Pattern:** `workspaces/<workspace>/sources/<doc-id>/` — not `checkout/<doc-path>/.ai/`

Checkout folders contain only the raw downloaded/extracted source files. This keeps checkouts clean and ensures artifacts can be deleted independently without affecting source data.

### Workspace Directory Creation is Lazy

Workspace directories are created on demand by `crud.create_workspace()` (which calls `manager.workspaces_dir / <name>.mkdir()`). The `ensure_paths()` method creates the top-level `workspaces/` directory at startup, but individual workspace subdirectories are created only when a workspace is created.

### Function Renaming Follows Scope

When a function's scope changes, rename it to match. Example: `delete_ai_folder()` became `delete_artifact_folder()` when it stopped being `.ai`-specific and started cleaning generic workspace artifact directories. Names should describe current behavior, not historical behavior.

## Project Structure

Generate on the fly (never hardcode listings):
Loading