docs: update documentation for workspace architecture and environment variables (1af60ce2) · Commits · Jan Reimes / 3gpp-crawler

.env.example

+9 −0

Original line number	Diff line number	Diff line
		@@ -19,6 +19,9 @@
		# Checkout directory name (default: checkout)
		# TDC_CHECKOUT_DIRNAME=checkout

		# Workspaces directory name (default: workspaces)
		# TDC_WORKSPACES_DIRNAME=workspaces

		# ============================================================================
		# ETSI ONLINE (EOL) CREDENTIALS
		# ============================================================================
		@@ -84,6 +87,12 @@ TDC_MAX_RETRIES=3
		# Maximum number of documents to crawl (default: 1000)
		# TDC_LIMIT_TDOCS=1000

		# Maximum meetings to crawl overall (negative = newest N)
		# TDC_LIMIT_MEETINGS=

		# Per sub-WG meeting limit (negative = newest N)
		# TDC_LIMIT_MEETINGS_PER_SUBWG=

		# Number of parallel subinterpreter workers (default: 4)
		TDC_WORKERS=4

README.md

+23 −20

Original line number	Diff line number	Diff line
		@@ -15,7 +15,7 @@ A command-line tool for crawling the 3GPP FTP server, caching 3GPP document meta
		- Case-Insensitive Queries: Search for TDocs regardless of case
		- Multiple Output Formats: Export results as table, JSON, or YAML
		- Incremental Updates: Only fetch new data on subsequent crawls
		- Wiki-First Architecture: Extraction artifacts organized in ~/.3gpp-crawler/wiki/ for external tool consumption
		- Workspace Architecture: Extraction artifacts organized in `~/.3gpp-crawler/workspaces/<ws>/sources/<doc>/`
		- Rich CLI: Beautiful terminal output with progress indicators

		## Installation
		@@ -38,6 +38,7 @@ uv sync
		### Using pip (not recommended)

		```bash
		# Note: package name may differ from repository name
		pip install 3gpp-crawler
		```

		@@ -57,8 +58,8 @@ cp .env.example .env
		# Acts as a "3GPP-compatible fallback". While whatthespec.net is the primary
		# data source, it is community-maintained. Credentials allow falling back to
		# official 3GPP portal endpoints if the primary source is unavailable.
		EOL_USERNAME=your_username
		EOL_PASSWORD=your_password
		TDC_EOL_USERNAME=your_username
		TDC_EOL_PASSWORD=your_password

		# HTTP Cache Configuration (optional - uses defaults if not set)
		HTTP_CACHE_TTL=7200 # Cache TTL in seconds (default: 7200 = 2 hours)
		@@ -71,11 +72,11 @@ Alternatively, you can:
		uvx tdoc-crawler crawl-meetings --eol-username your_username --eol-password your_password

		# Configure HTTP caching via CLI:
		uvx tdoc-crawler crawl-tdocs --cache-ttl 3600 --cache-refresh
		uvx tdoc-crawler crawl --cache-ttl 3600 --cache-refresh

		# Or set environment variables directly:
		export EOL_USERNAME=your_username
		export EOL_PASSWORD=your_password
		export TDC_EOL_USERNAME=your_username
		export TDC_EOL_PASSWORD=your_password
		export HTTP_CACHE_TTL=3600
		```

		@@ -87,20 +88,22 @@ NOTE: If no credentials are provided, the tool will prompt you interactively

		\| Command \| Alias \| Purpose \|
		\|---------\|-------\|---------\|
		\| Crawling \| \| \|
		\| tdoc-crawler \| \| \|
		\| `crawl` \| `ct` \| Crawl TDoc metadata from FTP \|
		\| `crawl-meetings` \| `cm` \| Populate meeting database (Run this first!) \|
		\| `crawl-tdocs` \| `ct` \| Crawl TDoc metadata from FTP \|
		\| `crawl-specs` \| `cs` \| Crawl technical specification metadata \|
		\| Querying \| \| \|
		\| `query` \| `qt` \| Search for TDocs (auto-fetches if missing) \|
		\| `query-meetings` \| `qm` \| Search and display meeting metadata \|
		\| `query-tdocs` \| `qt` \| Search for TDocs (auto-fetches if missing) \|
		\| `query-specs` \| `qs` \| Search technical specifications \|
		\| Utilities \| \| \|
		\| `open` \| \| Download and open a TDoc \|
		\| `checkout` \| \| Batch download TDocs to local folder \|
		\| `open-spec` \| `os` \| Download and open latest spec document \|
		\| `checkout-spec` \| `cos` \| Batch download technical specifications \|
		\| `stats` \| \| View database statistics \|
		\| spec-crawler \| \| \|
		\| `crawl` \| \| Crawl technical specification metadata \|
		\| `query` \| \| Search technical specifications \|
		\| `open` \| \| Download and open latest spec document \|
		\| `checkout` \| \| Batch download technical specifications \|
		\| 3gpp-crawler \| \| \|
		\| `config {init,show,validate,docs}` \| \| Manage configuration \|
		\| `workspace {create,list,...}` \| \| Manage workspaces and processing \|

		### 1. Crawl Metadata

		@@ -111,10 +114,10 @@ Gather metadata from 3GPP and WhatTheSpec:
		tdoc-crawler crawl-meetings

		# Crawl TDoc metadata (RAN, SA, CT)
		tdoc-crawler crawl-tdocs
		tdoc-crawler crawl

		# Populate spec catalog
		spec-crawler crawl-specs
		spec-crawler crawl
		```

		### 2. Query Metadata
		@@ -126,7 +129,7 @@ Search and filter stored information:
		tdoc-crawler query R1-2400001

		# Query specifications
		spec-crawler query-specs 23.501
		spec-crawler query 23.501

		# List recent meetings
		tdoc-crawler query-meetings --limit 10
		@@ -141,13 +144,13 @@ Open documents, batch download (checkout), and check database status:
		tdoc-crawler open R1-2400001

		# Download and open latest version of a spec
		spec-crawler open-spec 23.501
		spec-crawler open 23.501

		# Batch download (checkout) TDocs to local folder
		tdoc-crawler checkout R1-2400001 S2-2400567

		# Batch checkout specifications
		spec-crawler checkout-spec 26130-26140
		spec-crawler checkout 26130-26140

		# View database statistics
		tdoc-crawler stats

docs/config.md

+3 −2

Original line number	Diff line number	Diff line
		@@ -29,14 +29,14 @@ Config files are discovered in this order (later files override earlier):

		### Path Settings

		File system paths for cache, database, checkout, and AI storage
		File system paths for cache, database, checkout, and workspaces

		\| Field \| Type \| Default \| Description \|
		\|-------\|------\|---------\|-------------\|
		\| `cache_dir` \| Path \| ~/.3gpp-crawler \| Root cache directory for storing downloaded files and metadata \|
		\| `db_filename` \| str \| "3gpp_crawler.db" \| SQLite database filename for storing crawl metadata \|
		\| `checkout_dirname` \| str \| "checkout" \| Subdirectory name for checked-out documents \|
		\| `ai_cache_dirname` \| str \| "lightrag" \| Subdirectory name for AI-related cache (embeddings, graphs) \|
		\| `workspaces_dirname` \| str \| "workspaces" \| Subdirectory name for workspace data (sources, wiki) \|

		### HTTP Settings

		@@ -248,6 +248,7 @@ For backward compatibility, environment variables are still supported:
		\| Variable \| Description \|
		\|----------\|-------------\|
		\| `TDC_CACHE_DIR` \| Cache directory path \|
		\| `TDC_WORKSPACES_DIRNAME` \| Workspaces subdirectory name \|
		\| `TDC_EOL_USERNAME` \| ETSI Online username \|
		\| `TDC_EOL_PASSWORD` \| ETSI Online password \|
		\| `TDC_TIMEOUT` \| HTTP timeout in seconds \|

docs/convert-lo-usage.md

+0 −4

Original line number	Diff line number	Diff line
		@@ -139,7 +139,3 @@ converter.convert(
		)
		"
		```

		## Relationship to AI Conversion Artifacts

		`convert-lo` handles format conversion only. Structured AI extraction artifacts were previously produced by the `3gpp-ai` package (now removed from this repository).

docs/development.md

+45 −1

Original line number	Diff line number	Diff line
		@@ -127,7 +127,8 @@ manager = resolve_cache_manager()
		manager.root # cache root directory
		manager.db_file # SQLite database
		manager.http_cache_file # HTTP cache
		manager.checkout_dir # Spec checkout directory
		manager.checkout_dir # Document checkout directory
		manager.workspaces_dir # Workspace data directory (sources, wiki)
		```

		The `CacheManager` is instantiated by the CLI wrapper. Library users must register their own instance at program start.
		@@ -182,6 +183,49 @@ manager = resolve_cache_manager()

		Avoid `typing.TYPE_CHECKING` as a workaround for circular imports. It indicates structural problems. Refactor instead — extract shared types to a neutral `models/` layer.

		## Design Principles

		### Directory Separation: Checkout vs. Workspaces

		Document checkouts and workspace artifacts are stored in separate directory trees under the cache root:

		```
		~/.3gpp-crawler/
		├── checkout/ # Raw downloaded/extracted documents (TDocs, Specs)
		├── workspaces/ # Workspace data (sources, wiki) — separate from checkouts
		├── 3gpp_crawler.db
		└── http-cache.sqlite3
		```

		Why separate: Checkouts are raw crawled data managed by `tdocs/` and `specs/` packages. Workspaces are user-curated processing artifacts (converted PDFs, extracted markdown, wiki output). Mixing them causes naming collisions (a workspace named "Specs" would collide with the spec checkout directory) and makes cleanup error-prone.

		Rule: `manager.checkout_dir` is for raw documents only. `manager.workspaces_dir` is for workspace artifacts only. Never use `checkout_dir` for workspace paths or vice versa.

		### SourceKind-Aware Processing

		The extraction pipeline must branch on `SourceKind` (TDoc vs. Spec). TDocs and Specs have fundamentally different fetch and file-resolution logic:

		- TDocs: Checked out via FTP/meeting folder structure, resolved through `resolve_tdoc_checkout_path()`
		- Specs: Checked out from `Specs/archive/{series}/{spec_number}/`, resolved via spec-specific path logic

		Functions like `convert_for_wiki()` and `_ensure_converted()` accept a `source_kind` parameter and dispatch to the appropriate handler. Never assume all sources are TDocs.

		### No Artifact Directories in Checkout Folders

		Processing artifacts (converted PDFs, extracted markdown, figures, tables) must be stored in the workspace `sources/` directory, never as subdirectories of checkout folders.

		Pattern: `workspaces/<workspace>/sources/<doc-id>/` — not `checkout/<doc-path>/.ai/`

		Checkout folders contain only the raw downloaded/extracted source files. This keeps checkouts clean and ensures artifacts can be deleted independently without affecting source data.

		### Workspace Directory Creation is Lazy

		Workspace directories are created on demand by `crud.create_workspace()` (which calls `manager.workspaces_dir / <name>.mkdir()`). The `ensure_paths()` method creates the top-level `workspaces/` directory at startup, but individual workspace subdirectories are created only when a workspace is created.

		### Function Renaming Follows Scope

		When a function's scope changes, rename it to match. Example: `delete_ai_folder()` became `delete_artifact_folder()` when it stopped being `.ai`-specific and started cleaning generic workspace artifact directories. Names should describe current behavior, not historical behavior.

		## Project Structure

		Generate on the fly (never hardcode listings):

Original line number	Diff line number	Diff line
		@@ -29,14 +29,14 @@ Config files are discovered in this order (later files override earlier):

		### Path Settings

		File system paths for cache, database, checkout, and AI storage
		File system paths for cache, database, checkout, and workspaces

		\| Field \| Type \| Default \| Description \|
		\|-------\|------\|---------\|-------------\|
		\| `cache_dir` \| Path \| ~/.3gpp-crawler \| Root cache directory for storing downloaded files and metadata \|
		\| `db_filename` \| str \| "3gpp_crawler.db" \| SQLite database filename for storing crawl metadata \|
		\| `checkout_dirname` \| str \| "checkout" \| Subdirectory name for checked-out documents \|
		\| `ai_cache_dirname` \| str \| "lightrag" \| Subdirectory name for AI-related cache (embeddings, graphs) \|
		\| `workspaces_dirname` \| str \| "workspaces" \| Subdirectory name for workspace data (sources, wiki) \|

		### HTTP Settings

		@@ -248,6 +248,7 @@ For backward compatibility, environment variables are still supported:
		\| Variable \| Description \|
		\|----------\|-------------\|
		\| `TDC_CACHE_DIR` \| Cache directory path \|
		\| `TDC_WORKSPACES_DIRNAME` \| Workspaces subdirectory name \|
		\| `TDC_EOL_USERNAME` \| ETSI Online username \|
		\| `TDC_EOL_PASSWORD` \| ETSI Online password \|
		\| `TDC_TIMEOUT` \| HTTP timeout in seconds \|