docs: map existing codebase (e6129944) · Commits · Jan Reimes / 3gpp-crawler

.planning/codebase/ARCHITECTURE.md

0 → 100644

+268 −0

Original line number	Diff line number	Diff line
		# Architecture

		Analysis Date: 2026-03-25

		## Pattern Overview

		Overall: Domain-oriented modular architecture with clear separation of concerns

		Key Characteristics:
		- Domain-oriented packages: `tdocs/`, `meetings/`, `specs/` contain related business logic
		- Source abstraction: Each domain uses sources/ subpackage for data fetching strategies
		- Operations layer: Business operations separated from sources and models
		- Pydantic models: All data structures use Pydantic for validation
		- Single source of truth: `CacheManager` for all file paths and configuration

		## Layers

		### CLI Layer
		- Purpose: Command-line interface using Typer framework
		- Location: `src/tdoc_crawler/cli/`
		- Contains: Command definitions, argument types, output formatting
		- Depends on: Business logic layers (database, operations, sources)
		- Used by: End users via `tdoc-crawler` and `spec-crawler` commands

		### Domain Layer (TDocs)
		- Purpose: TDoc crawling and metadata management
		- Location: `src/tdoc_crawler/tdocs/`
		- Contains: Sources (WhatTheSpec, Portal, DocList), operations (fetch, checkout, crawl)
		- Depends on: `models/`, `database/`, `config/`, `http_client/`
		- Used by: CLI commands (`crawl-tdocs`, `query-tdocs`, `open`, `checkout`)

		### Domain Layer (Meetings)
		- Purpose: Meeting metadata and TDoc list retrieval
		- Location: `src/tdoc_crawler/meetings/`
		- Contains: Sources (FTP, Portal), operations, MeetingCrawler facade
		- Depends on: `models/`, `database/`, `config/`
		- Used by: CLI commands (`crawl-meetings`, `query-meetings`)

		### Domain Layer (Specs)
		- Purpose: 3GPP specification management
		- Location: `src/tdoc_crawler/specs/`
		- Contains: Sources (FTP, Portal), operations (checkout, download)
		- Depends on: `models/`, `database/`, `config/`
		- Used by: CLI commands via `spec-crawler` entry point

		### Data Access Layer
		- Purpose: SQLite database abstraction using pydantic-sqlite
		- Location: `src/tdoc_crawler/database/`
		- Contains: `TDocDatabase`, `MeetingDatabase`, `SpecDatabase`, `WorkingGroupDatabase`
		- Depends on: `models/`, `config/` (for path resolution)
		- Used by: All domain operations

		### Models Layer
		- Purpose: Pydantic data models for validation and serialization
		- Location: `src/tdoc_crawler/models/`
		- Contains: Base models, domain-specific models (TDocMetadata, MeetingMetadata, SpecMetadata)
		- Depends on: Pydantic library only
		- Used by: All layers

		### HTTP Client Layer
		- Purpose: HTTP caching and session management using hishel
		- Location: `src/tdoc_crawler/http_client/`
		- Contains: `create_cached_session()`, `PoolConfig`, `download_to_file()`
		- Depends on: `config/` (for cache file path)
		- Used by: All sources that make HTTP requests

		### Configuration Layer
		- Purpose: Centralized path and configuration management
		- Location: `src/tdoc_crawler/config/`
		- Contains: `CacheManager`, `resolve_cache_manager()`, path constants
		- Depends on: Environment variables (`TDC_CACHE_DIR`, `TDC_AI_STORE_PATH`)
		- Used by: All layers (MUST be registered at startup)

		### Parsers Layer
		- Purpose: HTML, Excel, and portal page parsing
		- Location: `src/tdoc_crawler/parsers/`
		- Contains: `MeetingParser`, `PortalParser`, `parse_tdoc_portal_page()`
		- Depends on: BeautifulSoup, lxml, python-calamine
		- Used by: Sources layer

		### AI/Processing Layer (packages/3gpp-ai)
		- Purpose: AI-powered document processing, LightRAG integration
		- Location: `packages/3gpp-ai/threegpp_ai/`
		- Contains: LightRAG config, TDocProcessor, workspace management, RAG operations
		- Depends on: `tdoc_crawler` (for TDoc metadata), LightRAG library, litellm
		- Used by: CLI via `tdoc-crawler ai` commands

		### Conversion Layer (packages/convert-lo)
		- Purpose: LibreOffice-based document conversion
		- Location: `packages/convert-lo/convert_lo/`
		- Contains: `convert_to_pdf()`, `convert_to_markdown()`, format detection
		- Depends on: LibreOffice (external)
		- Used by: AI processing layer

		### Pool Executors (packages/pool_executors)
		- Purpose: Serial and parallel execution utilities
		- Location: `packages/pool_executors/pool_executors/`
		- Contains: `SerialPoolExecutor`, `ParallelPoolExecutor`, factory functions
		- Depends on: concurrent.futures
		- Used by: Crawlers and batch operations

		## Data Flow

		### TDoc Crawl Flow

		1. CLI command `crawl-tdocs` invoked with meeting ID
		2. `MeetingDatabase` queried for meeting info
		3. `DocumentListSource.fetch()` retrieves TDoc list from 3GPP FTP
		4. `TDocDatabase.upsert_tdocs()` stores metadata
		5. `crawl_log` table updated with status

		6. Duplicate detection prevents re-crawling

		State Management:
		- SQLite database via pydantic-sqlite for persistent storage
		- File-based HTTP cache (hishel) for response caching
		- Workspace state in `workspaces.json` for AI processing

		### TDoc Fetch Flow

		1. `TDocDatabase.query_tdocs()` retrieves known TDocs
		2. `fetch_missing_tdocs()` identifies missing metadata
		3. Source strategy selection:
		- WhatTheSpec: Fast, unauthenticated API
		- Portal: Authenticated, full metadata
		4. `create_source()` returns appropriate source instance
		5. Source fetches and stores metadata in database

		6. Results returned to caller

		### Spec Checkout Flow

		1. `SpecDatabase.query()` retrieves spec info
		2. `checkout_specs()` initiates download process
		3. For each spec:
		- FTP server queried for latest version
		- `SpecDownloads` table checked for existing files
		- Download to `checkout_dir` if needed
		4. ZIP extraction and file organization
		5. Results returned with checkout paths

		### AI Processing Flow

		1. `WorkspaceRegistry` loads/creates workspace
		2. `TDocProcessor.process()` iterates through documents
		3. For each TDoc:
		- Checkout from database (metadata)
		- Download document if missing
		- Extract text via `kreuzberg`
		- Chunk text
		- Generate embeddings
		- Build knowledge graph (LightRAG)
		4. Results stored in workspace storage

		5. Query via `TDocRAG.query()` performs semantic search

		## Key Abstractions

		### TDocSource (Abstract Base)
		- Purpose: Common interface for TDoc data sources
		- Examples: `src/tdoc_crawler/tdocs/sources/base.py`
		- Pattern: Strategy pattern with `fetch_by_id()` and `fetch_batch()` methods

		```python
		class TDocSource(ABC):
		@abstractmethod
		def fetch_by_id(self, tdoc_id: str, ...) -> TDocMetadata \| None:
		...
		@abstractmethod
		def fetch_batch(self, tdoc_ids: list[str], ...) -> list[TDocMetadata \| None:
		...
		```

		- Subclasses: `WhatTheSpecSource`, `PortalSource`, `DocumentListSource`

		### CacheManager
		- Purpose: Single source of truth for all file paths
		- Examples: `src/tdoc_crawler/config/__init__.py`
		- Pattern: Singleton with global registration
		```python
		manager = resolve_cache_manager()
		manager.db_file # Database path
		manager.http_cache_file # HTTP cache path
		manager.checkout_dir # Document checkout directory
		manager.ai_cache_dir # AI storage directory
		```

		### Database (Pydantic-SQLite)
		- Purpose: Type-safe database access with Pydantic models
		- Examples: `src/tdoc_crawler/database/__init__.py`
		- Pattern: Each domain has dedicated database class
		```python
		# TDoc operations
		db = TDocDatabase(manager.db_file)
		tdocs = db.query_tdocs(config)
		db.upsert_tdocs(metadata)

		# Meeting operations
		db = MeetingDatabase(manager.db_file)
		meetings = db.query_meetings(filters)

		# Spec operations
		db = SpecDatabase(manager.db_file)
		specs = db.query_specs(filters)
		```

		## Entry Points

		### Main CLI Entry Point
		- Location: `src/tdoc_crawler/cli/tdoc_app.py`
		- Triggers: `tdoc-crawler` command
		- Responsibilities: Register all commands, initialize CacheManager, handle global options

		### Spec CLI Entry Point
		- Location: `src/tdoc_crawler/cli/spec_app.py`
		- Triggers: `spec-crawler` command
		- Responsibilities: Spec-specific commands (checkout, download)

		### Library Entry Point
		- Location: `src/tdoc_crawler/__init__.py`
		- Triggers: `import tdoc_crawler` in external code
		- Responsibilities: Export public API components

		## Error Handling

		Strategy: Fail fast with clear error messages

		Patterns:
		- Type validation via Pydantic (`ValidationError`)
		- Custom exceptions for domain errors (`PortalParsingError`, `PortalAuthenticationError`)
		- Database errors wrapped via pydantic-sqlite
		- HTTP errors via requests library

		```python
		# Source pattern - raise clear errors
		if metadata is None:
		raise ValueError(f"TDoc {tdoc_id} not found")

		# Database pattern - let pydantic handle validation
		try:
		db.upsert_tdocs([metadata])
		except ValidationError as exc:
		raise ValueError(f"Invalid metadata: {exc}") from exc
		```

		## Cross-Cutting Concerns

		Logging: Python `logging` module via `src/tdoc_crawler/logging/`
		- `get_logger(__name__)` for module-specific loggers
		- `set_verbosity()` for CLI control
		- Environment variable `TDC_VERBOSITY` for default level

		Validation: Pydantic models with strict validation
		- All data models inherit from `BaseModel`
		- Type hints mandatory (`T \| None` not `Optional[T]`)

		Authentication: Portal credentials via `src/tdoc_crawler/credentials/`
		- Environment variables: `EOL_USERNAME`, `EOL_PASSWORD`
		- `PromptCredentialsOption` for interactive credential input

		HTTP Caching: Hishel with file-based SQLite cache
		- `create_cached_session()` must be used for ALL 3gpp.org requests
		- 50-90% faster incremental crawls
		- Prevents rate-limiting

		---

		Architecture analysis: 2026-03-25

.planning/codebase/CONCERNS.md

0 → 100644

+204 −0

Original line number	Diff line number	Diff line
		# Codebase Concerns

		Analysis Date: 2026-03-25

		## Tech Debt

		### SQL Injection Potential via f-string Table Names
		- Issue: Dynamic SQL queries using f-strings with table names bypass parameterized query protection
		- Files: `src/tdoc_crawler/database/base.py`
		- Line 153: `cursor = self.connection._db.execute(f"SELECT * FROM {table}")`
		- Line 207: `cursor = self.connection._db.execute(f"SELECT COUNT(*) FROM {table}")`
		- Line 209: `self.connection._db.execute(f"DELETE FROM {table}")`
		- Impact: While tables are whitelisted via `model_map`, the pattern is fragile and could be copied elsewhere without proper safeguards
		- Current mitigation: Whitelist check in `model_map` dictionary
		- Fix approach: Use parameterized queries or a dedicated query builder that enforces whitelist at compile time

		### Broad Exception Handling
		- Issue: Multiple `except Exception` blocks swallow specific error information
		- Files:
		- `src/tdoc_crawler/database/base.py` (lines 55, 187)
		- `src/tdoc_crawler/database/specs.py` (lines 187, 316, 478)
		- `src/tdoc_crawler/clients/portal.py` (lines 219, 244)
		- `src/tdoc_crawler/tdocs/sources/doclist.py` (lines 78, 133, 140, 214, 323, 361, 434)
		- `src/tdoc_crawler/tdocs/operations/fetch.py` (lines 239, 300)
		- `src/tdoc_crawler/tdocs/sources/portal.py` (lines 73, 83)
		- `src/tdoc_crawler/tdocs/sources/whatthespec.py` (line 152)
		- `src/tdoc_crawler/workers/tdoc_worker.py` (line 52)
		- `src/tdoc_crawler/meetings/operations/crawl.py` (line 168)
		- `src/tdoc_crawler/specs/downloads.py` (lines 114, 252, 265)
		- `src/tdoc_crawler/cli/specs.py` (line 87)
		- `src/tdoc_crawler/tdocs/operations/crawl.py` (line 265)
		- Impact: Debugging difficulty, potential silent failures, loss of error context
		- Fix approach: Catch specific exception types, log with full context, re-raise with `from exc`

		### Unused WorkspaceStatus Enum
		- Issue: `WorkspaceStatus` enum has only `ACTIVE` value, with TODO comment questioning its necessity
		- Files: `packages/3gpp-ai/threegpp_ai/models.py` (line 72-77)
		- Impact: Dead code, confusion about intended functionality
		- Fix approach: Either remove if truly unnecessary, or implement full lifecycle (ACTIVE, ARCHIVED, DELETED)

		## Known Bugs

		### Duplicate Login Check in Portal Client
		- Symptoms: Redundant authentication failure check
		- Files: `src/tdoc_crawler/clients/portal.py` (lines 152-158)
		- Trigger: Authentication flow always
		- Details: The same check `response_text.lower() == "failed"` is performed twice consecutively
		- Workaround: None needed - second check is unreachable
		- Fix approach: Remove duplicate check at line 157-158

		## Security Considerations

		### Subprocess Security (noqa: S603, S606)
		- Risk: Shell command execution without full validation
		- Files: `src/tdoc_crawler/cli/utils.py` (lines 25, 29, 36)
		- Current mitigation: Commands use hardcoded paths (`/usr/bin/open`, `/usr/bin/xdg-open`)
		- Recommendations: Consider using `shutil.which()` to verify command existence before execution

		### Environment Variable Credential Handling
		- Risk: Credentials stored in environment variables may leak via process listing
		- Files: `src/tdoc_crawler/credentials.py`
		- Current mitigation: Uses `hide_input=True` for password prompts
		- Recommendations: Document that credentials should be set via secure credential managers in production

		### HTTP Session Cookie Handling
		- Risk: Authentication session uses non-cached session for cookies
		- Files: `src/tdoc_crawler/clients/portal.py` (lines 121-122, 298-307)
		- Current mitigation: Separate session for auth vs. cached requests
		- Recommendations: Good pattern - document the rationale in code comments

		## Performance Bottlenecks

		### In-Memory Table Scanning for Queries
		- Problem: Spec queries load all records then filter in Python
		- Files: `src/tdoc_crawler/database/specs.py` (lines 190-250)
		- Cause: `_table_rows()` fetches entire table, then filters with list comprehensions
		- Improvement path: Add parameterized SQL WHERE clauses for filtering before loading data

		### Document List Parsing
		- Problem: Large Excel files parsed entirely in memory
		- Files: `src/tdoc_crawler/tdocs/sources/doclist.py`
		- Cause: Pandas loads entire Excel file before processing
		- Improvement path: Consider streaming parsers for very large meeting document lists

		### Console.print() Usage Over Logging
		- Problem: 71 `console.print()` calls throughout CLI instead of structured logging
		- Files:
		- `src/tdoc_crawler/cli/query.py` (26 occurrences)
		- `src/tdoc_crawler/cli/printing.py` (5 occurrences)
		- `src/tdoc_crawler/cli/_shared.py` (6 occurrences)
		- `src/tdoc_crawler/cli/crawl.py` (10 occurrences)
		- `src/tdoc_crawler/cli/tdoc_app.py` (14 occurrences)
		- `src/tdoc_crawler/cli/specs.py` (4 occurrences)
		- Cause: Direct console output bypasses logging infrastructure
		- Improvement path: Acceptable for CLI output, but ensure error cases use `logger` for debuggability

		## Fragile Areas

		### NotImplementedError in Source Classes
		- Files:
		- `src/tdoc_crawler/specs/sources/base.py` (lines 16, 20)
		- `src/tdoc_crawler/tdocs/sources/doclist.py` (line 404)
		- `src/tdoc_crawler/tdocs/sources/portal.py` (line 105)
		- `src/tdoc_crawler/tdocs/sources/whatthespec.py` (line 174)
		- Why fragile: Abstract methods raise `NotImplementedError` - indicates incomplete interface implementation
		- Safe modification: Check all subclasses implement required interface methods before modifying base class
		- Test coverage: Limited - these error paths may not be covered by tests

		### Multi-Package Workspace Dependencies
		- Files:
		- `packages/3gpp-ai/` depends on `src/tdoc_crawler/`
		- `packages/convert-lo/` is a workspace dependency
		- `packages/pool_executors/` is a workspace dependency
		- Why fragile: Changes to core `tdoc_crawler` may break AI package without clear boundaries
		- Safe modification: Run full test suite including AI tests when modifying core package
		- Test coverage: AI tests exist in `tests/ai/`

		## Scaling Limits

		### SQLite Concurrency
		- Current capacity: Single-file database with pydantic-sqlite
		- Limit: Write concurrency limited; concurrent crawls may conflict
		- Scaling path: Consider PostgreSQL for high-volume crawling scenarios

		### HTTP Cache Size
		- Current capacity: Single SQLite file for HTTP cache (`~/.3gpp-crawler/http-cache.sqlite3`)
		- Limit: Large crawl sessions may create multi-GB cache files
		- Scaling path: Implement cache rotation/compaction, or use redis for distributed caching

		### Subinterpreter Worker Pool
		- Current capacity: Configurable via `TDC_WORKERS` env var (default: 4)
		- Limit: Each worker has full Python runtime overhead
		- Scaling path: Consider async/await patterns with httpx for better I/O-bound scaling

		## Dependencies at Risk

		### doc2txt Git Dependency
		- Risk: Git dependency for doc2txt may fail in restricted environments
		- Files: `packages/3gpp-ai/pyproject.toml` (lines 49)
		- Impact: Installation failures in air-gapped or proxy-restricted environments
		- Migration plan: Consider vendoring a simplified local copy as suggested in the comment

		### LightRAG Version Pinning
		- Risk: `lightrag-hku[offline]>=1.4.9.3` - loose version constraint
		- Impact: Breaking changes in LightRAG updates may break the pipeline
		- Migration plan: Pin to specific version, test before upgrading

		### charset_normalizer and chardet Redundancy
		- Risk: Both included explicitly due to requests v2.32.5 compatibility
		- Files: `pyproject.toml` (lines 30-31, with comment at line 17)
		- Impact: Increased dependency footprint
		- Migration plan: Monitor requests releases for when these can be removed

		## Missing Critical Features

		### Workspace Lifecycle Management
		- Problem: No workspace archival/deletion functionality
		- Files: `packages/3gpp-ai/threegpp_ai/models.py`
		- Blocks: Long-term workspace management, storage cleanup

		### TDoc Version History
		- Problem: No version tracking for TDoc content changes
		- Files: `src/tdoc_crawler/tdocs/models.py`
		- Blocks: Diff viewing, change history, revision tracking

		## Test Coverage Gaps

		### Integration Tests Marked but Not Separated
		- What's not tested: Integration tests run with full suite, no dedicated CI separation
		- Files: `tests/` - uses `pytest.mark.integration` marker
		- Risk: Slow integration tests may slow down CI feedback loop
		- Priority: Medium

		### AI Module Edge Cases
		- What's not tested: LLM provider failures, embedding dimension mismatches
		- Files: `packages/3gpp-ai/threegpp_ai/`
		- Risk: Production failures may be silent or poorly diagnosed
		- Priority: High

		### Error Path Coverage
		- What's not tested: Many `except Exception` blocks have no corresponding test cases
		- Files: Throughout `src/tdoc_crawler/database/`, `src/tdoc_crawler/tdocs/sources/`
		- Risk: Silent failures may go undetected
		- Priority: Medium

		## Code Quality Issues

		### Print Statements in Production Code
		- Files: `src/tdoc_crawler/cli/query.py`, `src/tdoc_crawler/cli/crawl.py`
		- Pattern: 71 `console.print()` calls throughout CLI
		- Impact: Not a bug but indicates CLI uses Rich for output rather than logging
		- Note: This is intentional for CLI output, but error handling should prefer logging

		### # noqa Comments Present
		- Files:
		- `src/tdoc_crawler/database/base.py` (S608 - SQL injection)
		- `src/tdoc_crawler/database/specs.py` (BLE001 - blind exception)
		- `src/tdoc_crawler/cli/utils.py` (S603, S606 - subprocess security)
		- Impact: Suppressed linter warnings indicate known issues that need addressing
		- Priority: Low (current suppressions are justified, but should be documented)

		---

		Concerns audit: 2026-03-25

.planning/codebase/CONVENTIONS.md

0 → 100644

+271 −0

File added.

Preview size limit exceeded, changes collapsed.

.planning/codebase/INTEGRATIONS.md

0 → 100644

+297 −0

File added.

Preview size limit exceeded, changes collapsed.

.planning/codebase/STACK.md

0 → 100644

+537 −0

File added.

Preview size limit exceeded, changes collapsed.