Commit 0515da64 authored by Jan Reimes's avatar Jan Reimes
Browse files

feat(tdocs): add URL validation and accessibility check for TDocMetadata

* Introduced `is_valid` property to check if TDoc metadata is valid.
* Added `has_valid_url` property as an alias for `is_valid`.
* Implemented `validate_url_accessible` method to check URL accessibility via HTTP HEAD requests.
* Added logging for validation failures and successful checks.

test(cli): enhance tests for open and checkout commands with whatthespec fallback

* Added tests to verify behavior when TDocs are missing and no credentials are available.
* Ensured that whatthespec fallback is triggered and TDocs are correctly inserted into the database.
* Verified that the CLI handles missing TDocs gracefully and reports errors correctly.

test(targeted_fetch): implement tests for whatthespec fallback scenarios

* Created tests to ensure whatthespec is used when no credentials are available.
* Verified multiple TDocs can be fetched and inserted correctly.
* Handled graceful failure scenarios when whatthespec does not return data.

test(whatthespec): add unit tests for WhatTheSpec resolution

* Implemented tests for successful resolution, error handling, and field mapping.
* Verified that defaults are used for missing fields and agenda item numbers are handled correctly.
* Ensured that the HTTP session is closed after resolution.
parent 6ad64128
Loading
Loading
Loading
Loading
+1 −2
Original line number Diff line number Diff line
@@ -147,8 +147,7 @@
				"args": [
					"open",
					"S4-260001",
					"--cache-dir",
					"./tests/test-cache"

				]
			}
		]
+68 −2
Original line number Diff line number Diff line
@@ -448,6 +448,72 @@ tdoc-crawler stats
tdoc-crawler stats --cache-dir /path/to/cache
```

## 🔄 WhatTheSpec Fallback (Automatic TDoc Resolution)

The `open` and `checkout` commands support automatic TDoc resolution via **WhatTheSpec** when the 3GPP portal is unavailable or you lack EOL credentials.

### What is WhatTheSpec?

WhatTheSpec (whatthespec.net) is a community-maintained indexing service that catalogs 3GPP technical documents with searchable metadata. It provides **no-authentication** access to TDoc information, making it ideal as a fallback when portal credentials are unavailable.

### Automatic Fallback Behavior

When using `open` or `checkout` commands:

1. **First:** Attempts to fetch missing TDoc metadata from the 3GPP portal (if credentials available)
1. **If credentials unavailable:** Uses WhatTheSpec as automatic fallback
1. **Result:** TDoc file downloads and opens normally without requiring EOL login

**Key Point:** This is **transparent** to the user. The system automatically tries WhatTheSpec if needed—no additional flags or configuration required.

### When WhatTheSpec is Used

| Scenario | Portal Fetch | WhatTheSpec Fallback |
|----------|--------------|---------------------|
| Credentials available | ✓ Attempted first | - (skipped) |
| Credentials unavailable | ✓ Attempted (will fail) | ✓ Used for remaining TDocs |
| Portal credentials invalid | ✓ Attempted (will fail) | ✓ Used for remaining TDocs |

### Examples

```bash
# No credentials needed – WhatTheSpec handles it automatically
tdoc-crawler open R1-2400001

# Batch checkout multiple TDocs without credentials
tdoc-crawler checkout R1-2400001 S2-2400567 S4-2400100

# Mix: Some TDocs from portal, others from WhatTheSpec (if credentials available)
tdoc-crawler open S2-2400001
```

### Benefits

- **No credentials required** – Works without EOL account
- **Faster access** – No authentication overhead
- **Resilient** – Falls back gracefully if portal is unavailable
- **Transparent** – Existing commands work unchanged

### Limitations

- **Metadata only** – WhatTheSpec provides basic document metadata
- **Slower fallback** – Portal fetch typically faster than WhatTheSpec for large numbers of TDocs
- **Coverage** – Older or obscure TDocs may not be indexed on WhatTheSpec

### How to Use EOL Credentials (Alternative)

If you prefer to use portal credentials for better performance or availability:

```bash
export EOL_USERNAME=your_username
export EOL_PASSWORD=your_password

# Now portal fetch is attempted first (faster, more comprehensive)
tdoc-crawler open R1-2400001
```

See **Configuration** section for credential setup details.

## 🚀 HTTP Caching

HTTP caching is **enabled by default** with sensible settings. All HTTP requests are automatically cached to a persistent SQLite database, dramatically improving performance for incremental crawls.
@@ -579,8 +645,8 @@ Credentials are used for authenticated access to the 3GPP portal. Most commands
Credentials are resolved in this order:

1. **CLI parameters** (`--eol-username`, `--eol-password`)
2. **Environment variables** (`EOL_USERNAME`, `EOL_PASSWORD`)
3. **Interactive prompt** (only if `EOL_PROMPT=true` or `--prompt-credentials` is set)
1. **Environment variables** (`EOL_USERNAME`, `EOL_PASSWORD`)
1. **Interactive prompt** (only if `EOL_PROMPT=true` or `--prompt-credentials` is set)

### Configuration

+191 −0
Original line number Diff line number Diff line
# Summary: TDoc URL Extraction Without Authentication (tdc-yux)

## Date

2026-02-03

## Issue

tdc-yux - Replace single TDoc download URL

## Overview

Implemented a new URL extraction method that allows fetching TDoc download URLs from the 3GPP portal without requiring authentication. This is achieved by using the `DownloadTDoc.aspx` endpoint instead of the authenticated `CreateTdoc.Aspx` view.

## Changes Made

### 1. Constants Update (`src/tdoc_crawler/crawlers/constants.py`)

- Added new constant `TDOC_DOWNLOAD_URL` for the DownloadTDoc.aspx endpoint
- Location: After `TDOC_VIEW_URL` constant

```python
TDOC_DOWNLOAD_URL: Final[str] = f"{PORTAL_BASE_URL}/ngppapp/DownloadTDoc.aspx"
```

### 2. Portal Module Update (`src/tdoc_crawler/crawlers/portal.py`)

#### New Import

- Added import for `TDOC_DOWNLOAD_URL` constant
- Added import for `re` module (for regex pattern matching)

#### New Function: `extract_tdoc_url_from_portal()`

- **Purpose**: Extract direct FTP download URL using unauthenticated DownloadTDoc.aspx endpoint
- **Location**: After `parse_tdoc_portal_page()` function, before `fetch_tdoc_metadata()`
- **Parameters**:
  - `tdoc_id`: TDoc identifier (e.g., 'S4-251364')
  - `timeout`: Request timeout in seconds (default 15 seconds)
- **Returns**: Direct FTP URL (e.g., 'https://www.3gpp.org/ftp/tsg_sa/WG4_CODEC/TSGS4_133-e/Docs/S4-251364.zip')
- **Raises**:
  - `PortalParsingError`: If URL extraction fails or TDoc not found
  - `requests.RequestException`: For network errors
- **Features**:
  - Uses browser-like headers to avoid 403 Forbidden responses
  - Extracts JavaScript redirect URL using regex pattern
  - Handles both direct patterns and CDATA sections
  - Validates extracted URL format
  - Includes comprehensive DEBUG logging
  - Includes TDoc ID and failure reason in error messages

#### Updated `fetch_tdoc_metadata()` Function

- **Changes**: Added URL extraction attempt before authentication
- **Strategy**: Try unauthenticated method first, fall back to authenticated method if it fails
- **Timeout**: Uses min(timeout, 15) for URL extraction to avoid long delays

#### Updated `PortalSession.fetch_tdoc_metadata()` Method

- **Changes**: Added URL extraction attempt if no URL provided
- **Strategy**: Try unauthenticated method first, fall back to authenticated method
- **Timeout**: Uses min(self.timeout, 15) for URL extraction

### 3. URL Extraction Logic

#### Regex Patterns

```python
# Main pattern: window.location.href='URL'
pattern = r"window\.location\.href\s*=\s*['\"]([^'\"]+)['\"]"

# CDATA pattern for nested JavaScript
cdata_pattern = r"<!\[CDATA\[(.*?)\]\]>"
```

#### Extraction Process

1. Fetch `DownloadTDoc.aspx?contributionUid={tdoc_id}` without authentication
1. Check for error messages ("TDoc cannot be found")
1. Extract URL from JavaScript redirect pattern
1. Validate URL format (must start with http://, https://, or ftp://)
1. Return extracted URL or raise exception with TDoc ID and failure reason

### 4. Error Handling

#### Error Messages Include

- TDoc ID for context
- Specific failure reason (e.g., "JavaScript redirect not found", "Invalid URL format")

#### Example Error Messages

- `"Failed to extract URL for TDoc S4-251364: JavaScript redirect not found"`
- `"TDoc INVALID-12345 not found on portal"`
- `"Invalid URL format for TDoc TEST-12345: {url}"`

### 5. Logging

#### DEBUG Logging

- URL extraction attempts
- Success messages with extracted URL
- Failure messages with TDoc ID and reason

#### ERROR Logging

- Network errors with TDoc ID
- URL extraction failures with failure reason

## Implementation Details

### Browser Headers

The implementation uses browser-like headers to avoid 403 Forbidden responses:

```python
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
})
```

### HTTP Status Check

- Uses `response.raise_for_status()` to catch HTTP errors
- Validates that status code is 200 (raises exception otherwise)

### URL Validation

1. Checks if URL starts with valid protocol (http://, https://, ftp://)
1. Optionally validates that URL contains the TDoc ID (case-insensitive)

## Benefits

1. **No Authentication Required**: Can fetch TDoc URLs without EOL credentials
1. **Faster Access**: No login flow needed for URL extraction
1. **Better Error Handling**: Clear error messages with TDoc ID and failure reason
1. **Fallback Strategy**: Falls back to authenticated method if unauthenticated extraction fails
1. **Comprehensive Logging**: DEBUG-level logging for troubleshooting

## Testing

### Tests Passing

- `test_fetch_tdoc_metadata_success` - PASSED
- `test_fetch_tdoc_metadata_invalid_tdoc` - PASSED
- `test_fetch_tdoc_metadata_invalid_credentials` - PASSED

### Verified Functionality

- Regex pattern matching for JavaScript redirects
- CDATA section handling
- Error detection for non-existent TDocs
- Error handling for missing redirects
- URL validation logic

## Related Files

- `src/tdoc_crawler/crawlers/constants.py` - Added TDOC_DOWNLOAD_URL constant
- `src/tdoc_crawler/crawlers/portal.py` - Added extract_tdoc_url_from_portal() function
- `src/tdoc_crawler/crawlers/portal.py` - Updated fetch_tdoc_metadata() function
- `src/tdoc_crawler/crawlers/portal.py` - Updated PortalSession.fetch_tdoc_metadata() method

## Backward Compatibility

- All existing tests pass without modification
- The `fetch_tdoc_metadata()` function now has optional `cache_dir` parameter with default value
- Falls back to authenticated method if unauthenticated extraction fails
- No breaking changes to existing API

## Future Improvements

1. Add retry logic for network failures (as mentioned in requirements)
1. Add more comprehensive URL validation against FTP patterns
1. Add caching for extracted URLs
1. Add support for other portal endpoints
1. Add unit tests for the new URL extraction functionality

## Notes

- The implementation successfully extracts TDoc URLs without authentication
- The JavaScript redirect pattern is: `window.location.href='https://www.3gpp.org/ftp/...'`
- The implementation handles both direct patterns and CDATA sections
- Error messages include the TDoc ID and specific failure reason for easier debugging
- The implementation uses a 15-second timeout for URL extraction to avoid long delays
- All existing functionality is preserved with no breaking changes
+49 −13
Original line number Diff line number Diff line
# Summary: Environment-Driven Lazy Credential Resolution (tdc-yuy)

## Date

2026-02-03

## Issue

tdc-yuy - Refactor PortalCredentials handling for lazy resolution

## Overview

Refactored credential handling to use environment-driven lazy resolution. Credentials are now only set into environment variables at CLI entry points and resolved later when actually needed by crawlers. Interactive prompting is now controlled by `EOL_PROMPT` environment variable and respects `sys.stdin.isatty()` to prevent hanging in non-interactive contexts.

## Changes Made
@@ -16,6 +19,7 @@ Refactored credential handling to use environment-driven lazy resolution. Creden
**Created new file with credential management functions:**

#### Function: `set_credentials()`

- **Purpose**: Set credential environment variables from CLI inputs
- **Location**: Top-level module function
- **Parameters**:
@@ -29,6 +33,7 @@ Refactored credential handling to use environment-driven lazy resolution. Creden
  - Does not overwrite environment variables if values are `None`

#### Function: `resolve_credentials()`

- **Purpose**: Resolve portal credentials from parameters, environment, or interactive prompt
- **Location**: Top-level module function
- **Parameters**:
@@ -38,8 +43,8 @@ Refactored credential handling to use environment-driven lazy resolution. Creden
- **Returns**: `PortalCredentials` instance if resolved, `None` otherwise
- **Resolution Order**:
  1. CLI parameters (username, password)
  2. Environment variables (`EOL_USERNAME`, `EOL_PASSWORD`)
  3. Interactive prompt (if `EOL_PROMPT=true` or `prompt=True`, and stdin is a TTY)
  1. Environment variables (`EOL_USERNAME`, `EOL_PASSWORD`)
  1. Interactive prompt (if `EOL_PROMPT=true` or `prompt=True`, and stdin is a TTY)
- **Key Features**:
  - **TTY Check**: Respects `sys.stdin.isatty()` - will not prompt if stdin is not a TTY
  - **Environment Variable Control**: If `prompt` is `None`, reads from `EOL_PROMPT` env var
@@ -50,10 +55,12 @@ Refactored credential handling to use environment-driven lazy resolution. Creden
### 2. CLI App Update (`src/tdoc_crawler/cli/app.py`)

#### Import Changes

- Removed `sys` import (no longer needed in main CLI file)
- Added import: `from tdoc_crawler.credentials import set_credentials`

#### Updated `crawl-meetings` Command

- **Before**: `prompt_credentials: PromptCredentialsOption = True`
- **After**: `prompt_credentials: PromptCredentialsOption = None`
- **Reason**: Default should be `None` to use environment variable control
@@ -61,6 +68,7 @@ Refactored credential handling to use environment-driven lazy resolution. Creden
- **Config**: Changed `credentials=credentials` to `credentials=None` (now resolved lazily)

#### Updated `query-tdocs` Command

- **Removed**: `prompt_for_credentials = sys.stdin.isatty()` logic
- **Removed**: `credentials = resolve_credentials(eol_username, eol_password, prompt=prompt_for_credentials)`
- **Added**: `set_credentials(eol_username, eol_password, prompt=None)` if not `no_fetch`
@@ -70,15 +78,18 @@ Refactored credential handling to use environment-driven lazy resolution. Creden
### 3. CLI Helpers Update (`src/tdoc_crawler/cli/helpers.py`)

#### Import Changes

- Removed: `PortalCredentials` from imports (no longer needed here)
- Removed: `resolve_credentials` function (moved to credentials.py)

### 4. CLI Fetching Update (`src/tdoc_crawler/cli/fetching.py`)

#### Import Changes

- Added import: `from tdoc_crawler.credentials import resolve_credentials`

#### Updated `fetch_missing_tdocs()` Function

- **Added**: Lazy credential resolution
- **Logic**:
  ```python
@@ -94,9 +105,11 @@ Refactored credential handling to use environment-driven lazy resolution. Creden
### 5. Meetings Crawler Update (`src/tdoc_crawler/crawlers/meetings.py`)

#### Import Changes

- Added import: `from tdoc_crawler.credentials import resolve_credentials`

#### Updated `crawl()` Method

- **Added**: Lazy credential resolution before session creation
- **Logic**:
  ```python
@@ -108,6 +121,7 @@ Refactored credential handling to use environment-driven lazy resolution. Creden
### 6. Environment Configuration (`.env.example`)

#### Added EOL_PROMPT Variable

```bash
# Whether to prompt for credentials when missing (default: false unless EOL_PROMPT=true)
# Set to "true", "1", or "yes" to enable interactive prompting
@@ -115,6 +129,7 @@ EOL_PROMPT=false
```

#### Documentation Update

- Added clear documentation for `EOL_PROMPT` environment variable
- Explained that default is `false` (no prompting unless explicitly enabled)
- Listed valid truthy values: "true", "1", "yes"
@@ -124,6 +139,7 @@ EOL_PROMPT=false
#### Added "ETSI Online (EOL) Credentials" Section

**New Content Includes**:

- When credentials are needed (crawling meeting metadata, fetching authenticated metadata)
- Credential resolution order (CLI params → env vars → interactive prompt)
- Environment variable configuration examples
@@ -140,6 +156,7 @@ EOL_PROMPT=false
| `EOL_PROMPT=true` | Prompts interactively when credentials missing |

**Usage Examples**:

```bash
# Using environment variables
export EOL_USERNAME=myuser
@@ -162,6 +179,7 @@ tdoc-crawler crawl-meetings --prompt-credentials
### Credential Flow

**Old Flow**:

```
CLI entry point

@@ -175,6 +193,7 @@ Use credentials in crawler
```

**New Flow**:

```
CLI entry point

@@ -196,6 +215,7 @@ Return PortalCredentials or None
**Purpose**: Prevent credential prompting in non-interactive contexts

**Implementation**:

```python
should_prompt = prompt if prompt is not None else os.getenv("EOL_PROMPT", "").lower() in ("true", "1", "yes")
if should_prompt and not sys.stdin.isatty():
@@ -203,6 +223,7 @@ if should_prompt and not sys.stdin.isatty():
```

**Use Cases**:

- **Piped input**: Commands like `cat ids.txt | tdoc-crawler query-tdocs` won't prompt
- **CI/CD**: Automated pipelines won't hang on password prompt
- **Interactive terminal**: Standard usage allows prompting if `EOL_PROMPT=true`
@@ -210,16 +231,19 @@ if should_prompt and not sys.stdin.isatty():
### Environment Variable Handling

**EOL_USERNAME**:

- Source: CLI `--eol-username` or environment
- Default: None
- Used by: `resolve_credentials()`

**EOL_PASSWORD**:

- Source: CLI `--eol-password` or environment
- Default: None
- Used by: `resolve_credentials()`

**EOL_PROMPT**:

- Source: CLI `--prompt-credentials` or environment
- Default: `false`
- Used by: `resolve_credentials()`
@@ -228,17 +252,18 @@ if should_prompt and not sys.stdin.isatty():
## Benefits

1. **No Unnecessary Prompting**: Credentials are only resolved when actually needed
2. **TTY Safety**: Prevents hanging in non-interactive contexts (pipes, CI/CD)
3. **Environment Control**: `EOL_PROMPT` env var provides fine-grained control
4. **Lazy Resolution**: Faster command startup - no credential resolution overhead
5. **Clearer Separation**: CLI sets env vars, crawlers resolve from env
6. **Backward Compatible**: Existing `.env` files with credentials still work
7. **Non-Breaking**: Existing CLI usage patterns unchanged
8. **Better Error Handling**: Returns `None` instead of raising when credentials unavailable
1. **TTY Safety**: Prevents hanging in non-interactive contexts (pipes, CI/CD)
1. **Environment Control**: `EOL_PROMPT` env var provides fine-grained control
1. **Lazy Resolution**: Faster command startup - no credential resolution overhead
1. **Clearer Separation**: CLI sets env vars, crawlers resolve from env
1. **Backward Compatible**: Existing `.env` files with credentials still work
1. **Non-Breaking**: Existing CLI usage patterns unchanged
1. **Better Error Handling**: Returns `None` instead of raising when credentials unavailable

## Testing

### Test Results

- **148 tests passed**
- **3 tests failed** (pre-existing failures unrelated to this change):
  - `test_crawl_collects_tdocs` - asyncio.run() non-coroutine issue
@@ -248,45 +273,55 @@ if should_prompt and not sys.stdin.isatty():
### Manual Verification

**Scenario 1: Credentials in .env**

```bash
EOL_USERNAME=testuser
EOL_PASSWORD=testpass
EOL_PROMPT=false
```

**Expected**: No prompting, credentials used automatically
**Status**: ✓ Working

**Scenario 2: No credentials, EOL_PROMPT=false**

```bash
# No EOL_* variables set
EOL_PROMPT=false
```

**Expected**: No prompting, uses unauthenticated endpoints
**Status**: ✓ Working

**Scenario 3: No credentials, EOL_PROMPT=true, Interactive TTY**

```bash
# No EOL_* variables set
EOL_PROMPT=true
```

**Expected**: Prompts for username and password
**Status**: ✓ Working

**Scenario 4: No credentials, EOL_PROMPT=true, Piped Input**

```bash
# No EOL_* variables set
EOL_PROMPT=true
echo "S4-123456" | tdoc-crawler query-tdocs
```

**Expected**: No prompting (stdin.isatty() == False)
**Status**: ✓ Working

**Scenario 5: CLI args override env vars**

```bash
EOL_USERNAME=envuser
EOL_PASSWORD=envpass
tdoc-crawler crawl-meetings --eol-username cliuser --eol-password clipass
```

**Expected**: Uses CLI args (cliuser/clipass)
**Status**: ✓ Working

@@ -309,16 +344,17 @@ tdoc-crawler crawl-meetings --eol-username cliuser --eol-password clipass
- **Environment variables**: `EOL_USERNAME` and `EOL_PASSWORD` still supported

**Breaking Changes**:

- **Default prompting behavior**: Changed from always prompt to never prompt unless `EOL_PROMPT=true`
- **Removal of sys.stdin.isatty() from CLI**: Moved into `resolve_credentials()` for consistency

## Future Improvements

1. **Enhanced TTY Detection**: Consider additional non-interactive contexts (e.g., Jupyter notebooks)
2. **Credential Validation**: Add validation for username/password format before attempting authentication
3. **Credential Caching**: Cache resolved credentials to avoid repeated prompts in long-running sessions
4. **Multiple Account Support**: Support for different accounts for different working groups
5. **Prompt Timeout**: Add timeout to credential prompts to avoid indefinite blocking
1. **Credential Validation**: Add validation for username/password format before attempting authentication
1. **Credential Caching**: Cache resolved credentials to avoid repeated prompts in long-running sessions
1. **Multiple Account Support**: Support for different accounts for different working groups
1. **Prompt Timeout**: Add timeout to credential prompts to avoid indefinite blocking

## Notes

+73 −6
Original line number Diff line number Diff line
@@ -8,16 +8,68 @@ directory structure as the server.
from __future__ import annotations

import logging
import shutil
import zipfile
from pathlib import Path
from urllib.parse import urlparse

import requests

from tdoc_crawler.models import TDocMetadata

logger = logging.getLogger(__name__)


def _sanitize_path_component(component: str) -> str:
    """Sanitize a path component to be valid on all platforms.

    Removes or replaces characters that are invalid in file/directory names:
    - Windows reserved names (CON, PRN, AUX, NUL, COM1-9, LPT1-9)
    - Invalid characters: < > : " | ? *
    - Special sequences like "..." which can cause issues

    Args:
        component: Path component to sanitize

    Returns:
        Sanitized path component
    """
    if not component:
        return "_"

    # Replace problematic sequences
    sanitized = component.replace("...", "_")

    # Windows reserved names (case-insensitive)
    reserved = {
        "con",
        "prn",
        "aux",
        "nul",
        "com1",
        "com2",
        "com3",
        "com4",
        "com5",
        "com6",
        "com7",
        "com8",
        "com9",
        "lpt1",
        "lpt2",
        "lpt3",
        "lpt4",
        "lpt5",
        "lpt6",
        "lpt7",
        "lpt8",
        "lpt9",
    }
    if sanitized.lower() in reserved:
        sanitized = f"_{sanitized}"

    return sanitized


def get_checkout_path(metadata: TDocMetadata, checkout_dir: Path) -> Path:
    """Calculate the checkout path for a TDoc based on its URL.

@@ -32,7 +84,14 @@ def get_checkout_path(metadata: TDocMetadata, checkout_dir: Path) -> Path:

    Returns:
        Path to the checkout directory for this TDoc

    Raises:
        ValueError: If the URL is invalid or contains placeholder patterns
    """
    # Validate URL before processing
    if not metadata.is_valid:
        raise ValueError(f"Invalid or corrupt URL for TDoc {metadata.tdoc_id}: {metadata.url}")

    url_path = urlparse(metadata.url).path

    # Normalize the path: remove leading slash and split into components
@@ -52,8 +111,11 @@ def get_checkout_path(metadata: TDocMetadata, checkout_dir: Path) -> Path:
    if relative_parts:
        relative_parts = relative_parts[:-1]

    # Sanitize path components to avoid invalid directory names
    sanitized_parts = [_sanitize_path_component(part) for part in relative_parts if part]

    # Build the checkout path: checkout_dir / path / tdoc_id
    checkout_path = checkout_dir.joinpath(*relative_parts) / metadata.tdoc_id if relative_parts else checkout_dir / metadata.tdoc_id
    checkout_path = checkout_dir.joinpath(*sanitized_parts) / metadata.tdoc_id if sanitized_parts else checkout_dir / metadata.tdoc_id

    return checkout_path

@@ -120,8 +182,6 @@ def _download_file(url: str, destination: Path) -> None:
        ValueError: If URL scheme is not supported
        FileNotFoundError: If download fails
    """
    from urllib.request import urlopen

    destination.parent.mkdir(parents=True, exist_ok=True)

    # Validate URL scheme
@@ -131,8 +191,15 @@ def _download_file(url: str, destination: Path) -> None:
        raise ValueError(f"unsupported-url-scheme: {url}")

    try:
        with urlopen(url, timeout=300) as response, destination.open("wb") as target:  # noqa: S310
            shutil.copyfileobj(response, target)
        response = requests.get(url, timeout=300, stream=True)  # noqa: S113
        response.raise_for_status()
        with destination.open("wb") as target:
            for chunk in response.iter_content(chunk_size=8192):
                if chunk:
                    target.write(chunk)
    except requests.exceptions.HTTPError as exc:
        status_code = exc.response.status_code if exc.response is not None else "unknown"
        raise FileNotFoundError(f"failed-to-download ({status_code}): {url}") from exc
    except Exception as exc:
        raise FileNotFoundError(f"failed-to-download: {url}") from exc

Loading