Commit e6129944 authored by Jan Reimes's avatar Jan Reimes
Browse files

docs: map existing codebase

parent dc158794
Loading
Loading
Loading
Loading
+268 −0
Original line number Diff line number Diff line
# Architecture

**Analysis Date:** 2026-03-25

## Pattern Overview

**Overall:** Domain-oriented modular architecture with clear separation of concerns

**Key Characteristics:**
- **Domain-oriented packages**: `tdocs/`, `meetings/`, `specs/` contain related business logic
- **Source abstraction**: Each domain uses sources/ subpackage for data fetching strategies
- **Operations layer**: Business operations separated from sources and models
- **Pydantic models**: All data structures use Pydantic for validation
- **Single source of truth**: `CacheManager` for all file paths and configuration

## Layers

### CLI Layer
- Purpose: Command-line interface using Typer framework
- Location: `src/tdoc_crawler/cli/`
- Contains: Command definitions, argument types, output formatting
- Depends on: Business logic layers (database, operations, sources)
- Used by: End users via `tdoc-crawler` and `spec-crawler` commands

### Domain Layer (TDocs)
- Purpose: TDoc crawling and metadata management
- Location: `src/tdoc_crawler/tdocs/`
- Contains: Sources (WhatTheSpec, Portal, DocList), operations (fetch, checkout, crawl)
- Depends on: `models/`, `database/`, `config/`, `http_client/`
- Used by: CLI commands (`crawl-tdocs`, `query-tdocs`, `open`, `checkout`)

### Domain Layer (Meetings)
- Purpose: Meeting metadata and TDoc list retrieval
- Location: `src/tdoc_crawler/meetings/`
- Contains: Sources (FTP, Portal), operations, MeetingCrawler facade
- Depends on: `models/`, `database/`, `config/`
- Used by: CLI commands (`crawl-meetings`, `query-meetings`)

### Domain Layer (Specs)
- Purpose: 3GPP specification management
- Location: `src/tdoc_crawler/specs/`
- Contains: Sources (FTP, Portal), operations (checkout, download)
- Depends on: `models/`, `database/`, `config/`
- Used by: CLI commands via `spec-crawler` entry point

### Data Access Layer
- Purpose: SQLite database abstraction using pydantic-sqlite
- Location: `src/tdoc_crawler/database/`
- Contains: `TDocDatabase`, `MeetingDatabase`, `SpecDatabase`, `WorkingGroupDatabase`
- Depends on: `models/`, `config/` (for path resolution)
- Used by: All domain operations

### Models Layer
- Purpose: Pydantic data models for validation and serialization
- Location: `src/tdoc_crawler/models/`
- Contains: Base models, domain-specific models (TDocMetadata, MeetingMetadata, SpecMetadata)
- Depends on: Pydantic library only
- Used by: All layers

### HTTP Client Layer
- Purpose: HTTP caching and session management using hishel
- Location: `src/tdoc_crawler/http_client/`
- Contains: `create_cached_session()`, `PoolConfig`, `download_to_file()`
- Depends on: `config/` (for cache file path)
- Used by: All sources that make HTTP requests

### Configuration Layer
- Purpose: Centralized path and configuration management
- Location: `src/tdoc_crawler/config/`
- Contains: `CacheManager`, `resolve_cache_manager()`, path constants
- Depends on: Environment variables (`TDC_CACHE_DIR`, `TDC_AI_STORE_PATH`)
- Used by: All layers (MUST be registered at startup)

### Parsers Layer
- Purpose: HTML, Excel, and portal page parsing
- Location: `src/tdoc_crawler/parsers/`
- Contains: `MeetingParser`, `PortalParser`, `parse_tdoc_portal_page()`
- Depends on: BeautifulSoup, lxml, python-calamine
- Used by: Sources layer

### AI/Processing Layer (packages/3gpp-ai)
- Purpose: AI-powered document processing, LightRAG integration
- Location: `packages/3gpp-ai/threegpp_ai/`
- Contains: LightRAG config, TDocProcessor, workspace management, RAG operations
- Depends on: `tdoc_crawler` (for TDoc metadata), LightRAG library, litellm
- Used by: CLI via `tdoc-crawler ai` commands

### Conversion Layer (packages/convert-lo)
- Purpose: LibreOffice-based document conversion
- Location: `packages/convert-lo/convert_lo/`
- Contains: `convert_to_pdf()`, `convert_to_markdown()`, format detection
- Depends on: LibreOffice (external)
- Used by: AI processing layer

### Pool Executors (packages/pool_executors)
- Purpose: Serial and parallel execution utilities
- Location: `packages/pool_executors/pool_executors/`
- Contains: `SerialPoolExecutor`, `ParallelPoolExecutor`, factory functions
- Depends on: concurrent.futures
- Used by: Crawlers and batch operations

## Data Flow

### TDoc Crawl Flow

1. CLI command `crawl-tdocs` invoked with meeting ID
2. `MeetingDatabase` queried for meeting info
3. `DocumentListSource.fetch()` retrieves TDoc list from 3GPP FTP
4. `TDocDatabase.upsert_tdocs()` stores metadata
5. `crawl_log` table updated with status

6. Duplicate detection prevents re-crawling

**State Management:**
- SQLite database via pydantic-sqlite for persistent storage
- File-based HTTP cache (hishel) for response caching
- Workspace state in `workspaces.json` for AI processing

### TDoc Fetch Flow

1. `TDocDatabase.query_tdocs()` retrieves known TDocs
2. `fetch_missing_tdocs()` identifies missing metadata
3. Source strategy selection:
   - **WhatTheSpec**: Fast, unauthenticated API
   - **Portal**: Authenticated, full metadata
4. `create_source()` returns appropriate source instance
5. Source fetches and stores metadata in database

6. Results returned to caller

### Spec Checkout Flow

1. `SpecDatabase.query()` retrieves spec info
2. `checkout_specs()` initiates download process
3. For each spec:
   - FTP server queried for latest version
   - `SpecDownloads` table checked for existing files
   - Download to `checkout_dir` if needed
4. ZIP extraction and file organization
5. Results returned with checkout paths

### AI Processing Flow

1. `WorkspaceRegistry` loads/creates workspace
2. `TDocProcessor.process()` iterates through documents
3. For each TDoc:
   - Checkout from database (metadata)
   - Download document if missing
   - Extract text via `kreuzberg`
   - Chunk text
   - Generate embeddings
   - Build knowledge graph (LightRAG)
4. Results stored in workspace storage

5. Query via `TDocRAG.query()` performs semantic search

## Key Abstractions

### TDocSource (Abstract Base)
- Purpose: Common interface for TDoc data sources
- Examples: `src/tdoc_crawler/tdocs/sources/base.py`
- Pattern: Strategy pattern with `fetch_by_id()` and `fetch_batch()` methods

```python
class TDocSource(ABC):
    @abstractmethod
    def fetch_by_id(self, tdoc_id: str, ...) -> TDocMetadata | None:
        ...
    @abstractmethod
    def fetch_batch(self, tdoc_ids: list[str], ...) -> list[TDocMetadata | None:
        ...
```

- Subclasses: `WhatTheSpecSource`, `PortalSource`, `DocumentListSource`

### CacheManager
- Purpose: Single source of truth for all file paths
- Examples: `src/tdoc_crawler/config/__init__.py`
- Pattern: Singleton with global registration
```python
manager = resolve_cache_manager()
manager.db_file           # Database path
manager.http_cache_file   # HTTP cache path
manager.checkout_dir      # Document checkout directory
manager.ai_cache_dir      # AI storage directory
```

### Database (Pydantic-SQLite)
- Purpose: Type-safe database access with Pydantic models
- Examples: `src/tdoc_crawler/database/__init__.py`
- Pattern: Each domain has dedicated database class
```python
# TDoc operations
db = TDocDatabase(manager.db_file)
tdocs = db.query_tdocs(config)
db.upsert_tdocs(metadata)

# Meeting operations
db = MeetingDatabase(manager.db_file)
meetings = db.query_meetings(filters)

# Spec operations  
db = SpecDatabase(manager.db_file)
specs = db.query_specs(filters)
```

## Entry Points

### Main CLI Entry Point
- Location: `src/tdoc_crawler/cli/tdoc_app.py`
- Triggers: `tdoc-crawler` command
- Responsibilities: Register all commands, initialize CacheManager, handle global options

### Spec CLI Entry Point
- Location: `src/tdoc_crawler/cli/spec_app.py`
- Triggers: `spec-crawler` command
- Responsibilities: Spec-specific commands (checkout, download)

### Library Entry Point
- Location: `src/tdoc_crawler/__init__.py`
- Triggers: `import tdoc_crawler` in external code
- Responsibilities: Export public API components

## Error Handling

**Strategy:** Fail fast with clear error messages

**Patterns:**
- Type validation via Pydantic (`ValidationError`)
- Custom exceptions for domain errors (`PortalParsingError`, `PortalAuthenticationError`)
- Database errors wrapped via pydantic-sqlite
- HTTP errors via requests library

```python
# Source pattern - raise clear errors
if metadata is None:
    raise ValueError(f"TDoc {tdoc_id} not found")

# Database pattern - let pydantic handle validation
try:
    db.upsert_tdocs([metadata])
except ValidationError as exc:
    raise ValueError(f"Invalid metadata: {exc}") from exc
```

## Cross-Cutting Concerns

**Logging:** Python `logging` module via `src/tdoc_crawler/logging/`
- `get_logger(__name__)` for module-specific loggers
- `set_verbosity()` for CLI control
- Environment variable `TDC_VERBOSITY` for default level

**Validation:** Pydantic models with strict validation
- All data models inherit from `BaseModel`
- Type hints mandatory (`T | None` not `Optional[T]`)

**Authentication:** Portal credentials via `src/tdoc_crawler/credentials/`
- Environment variables: `EOL_USERNAME`, `EOL_PASSWORD`
- `PromptCredentialsOption` for interactive credential input

**HTTP Caching:** Hishel with file-based SQLite cache
- `create_cached_session()` must be used for ALL 3gpp.org requests
- 50-90% faster incremental crawls
- Prevents rate-limiting

---

*Architecture analysis: 2026-03-25*
+204 −0
Original line number Diff line number Diff line
# Codebase Concerns

**Analysis Date:** 2026-03-25

## Tech Debt

### SQL Injection Potential via f-string Table Names
- **Issue:** Dynamic SQL queries using f-strings with table names bypass parameterized query protection
- **Files:** `src/tdoc_crawler/database/base.py`
  - Line 153: `cursor = self.connection._db.execute(f"SELECT * FROM {table}")`
  - Line 207: `cursor = self.connection._db.execute(f"SELECT COUNT(*) FROM {table}")`
  - Line 209: `self.connection._db.execute(f"DELETE FROM {table}")`
- **Impact:** While tables are whitelisted via `model_map`, the pattern is fragile and could be copied elsewhere without proper safeguards
- **Current mitigation:** Whitelist check in `model_map` dictionary
- **Fix approach:** Use parameterized queries or a dedicated query builder that enforces whitelist at compile time

### Broad Exception Handling
- **Issue:** Multiple `except Exception` blocks swallow specific error information
- **Files:** 
  - `src/tdoc_crawler/database/base.py` (lines 55, 187)
  - `src/tdoc_crawler/database/specs.py` (lines 187, 316, 478)
  - `src/tdoc_crawler/clients/portal.py` (lines 219, 244)
  - `src/tdoc_crawler/tdocs/sources/doclist.py` (lines 78, 133, 140, 214, 323, 361, 434)
  - `src/tdoc_crawler/tdocs/operations/fetch.py` (lines 239, 300)
  - `src/tdoc_crawler/tdocs/sources/portal.py` (lines 73, 83)
  - `src/tdoc_crawler/tdocs/sources/whatthespec.py` (line 152)
  - `src/tdoc_crawler/workers/tdoc_worker.py` (line 52)
  - `src/tdoc_crawler/meetings/operations/crawl.py` (line 168)
  - `src/tdoc_crawler/specs/downloads.py` (lines 114, 252, 265)
  - `src/tdoc_crawler/cli/specs.py` (line 87)
  - `src/tdoc_crawler/tdocs/operations/crawl.py` (line 265)
- **Impact:** Debugging difficulty, potential silent failures, loss of error context
- **Fix approach:** Catch specific exception types, log with full context, re-raise with `from exc`

### Unused WorkspaceStatus Enum
- **Issue:** `WorkspaceStatus` enum has only `ACTIVE` value, with TODO comment questioning its necessity
- **Files:** `packages/3gpp-ai/threegpp_ai/models.py` (line 72-77)
- **Impact:** Dead code, confusion about intended functionality
- **Fix approach:** Either remove if truly unnecessary, or implement full lifecycle (ACTIVE, ARCHIVED, DELETED)

## Known Bugs

### Duplicate Login Check in Portal Client
- **Symptoms:** Redundant authentication failure check
- **Files:** `src/tdoc_crawler/clients/portal.py` (lines 152-158)
- **Trigger:** Authentication flow always
- **Details:** The same check `response_text.lower() == "failed"` is performed twice consecutively
- **Workaround:** None needed - second check is unreachable
- **Fix approach:** Remove duplicate check at line 157-158

## Security Considerations

### Subprocess Security (noqa: S603, S606)
- **Risk:** Shell command execution without full validation
- **Files:** `src/tdoc_crawler/cli/utils.py` (lines 25, 29, 36)
- **Current mitigation:** Commands use hardcoded paths (`/usr/bin/open`, `/usr/bin/xdg-open`)
- **Recommendations:** Consider using `shutil.which()` to verify command existence before execution

### Environment Variable Credential Handling
- **Risk:** Credentials stored in environment variables may leak via process listing
- **Files:** `src/tdoc_crawler/credentials.py`
- **Current mitigation:** Uses `hide_input=True` for password prompts
- **Recommendations:** Document that credentials should be set via secure credential managers in production

### HTTP Session Cookie Handling
- **Risk:** Authentication session uses non-cached session for cookies
- **Files:** `src/tdoc_crawler/clients/portal.py` (lines 121-122, 298-307)
- **Current mitigation:** Separate session for auth vs. cached requests
- **Recommendations:** Good pattern - document the rationale in code comments

## Performance Bottlenecks

### In-Memory Table Scanning for Queries
- **Problem:** Spec queries load all records then filter in Python
- **Files:** `src/tdoc_crawler/database/specs.py` (lines 190-250)
- **Cause:** `_table_rows()` fetches entire table, then filters with list comprehensions
- **Improvement path:** Add parameterized SQL WHERE clauses for filtering before loading data

### Document List Parsing
- **Problem:** Large Excel files parsed entirely in memory
- **Files:** `src/tdoc_crawler/tdocs/sources/doclist.py`
- **Cause:** Pandas loads entire Excel file before processing
- **Improvement path:** Consider streaming parsers for very large meeting document lists

### Console.print() Usage Over Logging
- **Problem:** 71 `console.print()` calls throughout CLI instead of structured logging
- **Files:** 
  - `src/tdoc_crawler/cli/query.py` (26 occurrences)
  - `src/tdoc_crawler/cli/printing.py` (5 occurrences)
  - `src/tdoc_crawler/cli/_shared.py` (6 occurrences)
  - `src/tdoc_crawler/cli/crawl.py` (10 occurrences)
  - `src/tdoc_crawler/cli/tdoc_app.py` (14 occurrences)
  - `src/tdoc_crawler/cli/specs.py` (4 occurrences)
- **Cause:** Direct console output bypasses logging infrastructure
- **Improvement path:** Acceptable for CLI output, but ensure error cases use `logger` for debuggability

## Fragile Areas

### NotImplementedError in Source Classes
- **Files:** 
  - `src/tdoc_crawler/specs/sources/base.py` (lines 16, 20)
  - `src/tdoc_crawler/tdocs/sources/doclist.py` (line 404)
  - `src/tdoc_crawler/tdocs/sources/portal.py` (line 105)
  - `src/tdoc_crawler/tdocs/sources/whatthespec.py` (line 174)
- **Why fragile:** Abstract methods raise `NotImplementedError` - indicates incomplete interface implementation
- **Safe modification:** Check all subclasses implement required interface methods before modifying base class
- **Test coverage:** Limited - these error paths may not be covered by tests

### Multi-Package Workspace Dependencies
- **Files:** 
  - `packages/3gpp-ai/` depends on `src/tdoc_crawler/`
  - `packages/convert-lo/` is a workspace dependency
  - `packages/pool_executors/` is a workspace dependency
- **Why fragile:** Changes to core `tdoc_crawler` may break AI package without clear boundaries
- **Safe modification:** Run full test suite including AI tests when modifying core package
- **Test coverage:** AI tests exist in `tests/ai/`

## Scaling Limits

### SQLite Concurrency
- **Current capacity:** Single-file database with pydantic-sqlite
- **Limit:** Write concurrency limited; concurrent crawls may conflict
- **Scaling path:** Consider PostgreSQL for high-volume crawling scenarios

### HTTP Cache Size
- **Current capacity:** Single SQLite file for HTTP cache (`~/.3gpp-crawler/http-cache.sqlite3`)
- **Limit:** Large crawl sessions may create multi-GB cache files
- **Scaling path:** Implement cache rotation/compaction, or use redis for distributed caching

### Subinterpreter Worker Pool
- **Current capacity:** Configurable via `TDC_WORKERS` env var (default: 4)
- **Limit:** Each worker has full Python runtime overhead
- **Scaling path:** Consider async/await patterns with httpx for better I/O-bound scaling

## Dependencies at Risk

### doc2txt Git Dependency
- **Risk:** Git dependency for doc2txt may fail in restricted environments
- **Files:** `packages/3gpp-ai/pyproject.toml` (lines 49)
- **Impact:** Installation failures in air-gapped or proxy-restricted environments
- **Migration plan:** Consider vendoring a simplified local copy as suggested in the comment

### LightRAG Version Pinning
- **Risk:** `lightrag-hku[offline]>=1.4.9.3` - loose version constraint
- **Impact:** Breaking changes in LightRAG updates may break the pipeline
- **Migration plan:** Pin to specific version, test before upgrading

### charset_normalizer and chardet Redundancy
- **Risk:** Both included explicitly due to requests v2.32.5 compatibility
- **Files:** `pyproject.toml` (lines 30-31, with comment at line 17)
- **Impact:** Increased dependency footprint
- **Migration plan:** Monitor requests releases for when these can be removed

## Missing Critical Features

### Workspace Lifecycle Management
- **Problem:** No workspace archival/deletion functionality
- **Files:** `packages/3gpp-ai/threegpp_ai/models.py`
- **Blocks:** Long-term workspace management, storage cleanup

### TDoc Version History
- **Problem:** No version tracking for TDoc content changes
- **Files:** `src/tdoc_crawler/tdocs/models.py`
- **Blocks:** Diff viewing, change history, revision tracking

## Test Coverage Gaps

### Integration Tests Marked but Not Separated
- **What's not tested:** Integration tests run with full suite, no dedicated CI separation
- **Files:** `tests/` - uses `pytest.mark.integration` marker
- **Risk:** Slow integration tests may slow down CI feedback loop
- **Priority:** Medium

### AI Module Edge Cases
- **What's not tested:** LLM provider failures, embedding dimension mismatches
- **Files:** `packages/3gpp-ai/threegpp_ai/`
- **Risk:** Production failures may be silent or poorly diagnosed
- **Priority:** High

### Error Path Coverage
- **What's not tested:** Many `except Exception` blocks have no corresponding test cases
- **Files:** Throughout `src/tdoc_crawler/database/`, `src/tdoc_crawler/tdocs/sources/`
- **Risk:** Silent failures may go undetected
- **Priority:** Medium

## Code Quality Issues

### Print Statements in Production Code
- **Files:** `src/tdoc_crawler/cli/query.py`, `src/tdoc_crawler/cli/crawl.py`
- **Pattern:** 71 `console.print()` calls throughout CLI
- **Impact:** Not a bug but indicates CLI uses Rich for output rather than logging
- **Note:** This is intentional for CLI output, but error handling should prefer logging

### # noqa Comments Present
- **Files:** 
  - `src/tdoc_crawler/database/base.py` (S608 - SQL injection)
  - `src/tdoc_crawler/database/specs.py` (BLE001 - blind exception)
  - `src/tdoc_crawler/cli/utils.py` (S603, S606 - subprocess security)
- **Impact:** Suppressed linter warnings indicate known issues that need addressing
- **Priority:** Low (current suppressions are justified, but should be documented)

---

*Concerns audit: 2026-03-25*
+271 −0

File added.

Preview size limit exceeded, changes collapsed.

+297 −0

File added.

Preview size limit exceeded, changes collapsed.

+537 −0

File added.

Preview size limit exceeded, changes collapsed.

Loading