Commit 6f021326 authored by Jan Reimes's avatar Jan Reimes
Browse files

chore(docs): remove outdated documentation files and update links

* Deleted 200_Project_Domain_3GPP_TDocs_Meetings.md and 400_Crawling_and_Portal_Patterns.md
* Updated references in 300_Repo_Architecture_and_Key_Files.md and 800_Documentation.md
* Adjusted documentation focus to emphasize 3GPP domain knowledge
* Updated pyproject.toml to use hatchling and dynamic versioning
parent d11f9a60
Loading
Loading
Loading
Loading
+0 −94
Original line number Diff line number Diff line
# Project Domain: 3GPP, Meetings, and TDocs (Agent Notes)

This document provides the minimum domain context required to safely modify crawling, parsing, and storage logic.

## High-level concepts

- **3GPP** produces standards via technical groups. Technical administration and infrastructure is provided by **ETSI**.
- **Meetings** are the unit that groups documents. A meeting has a 3GPP portal meeting ID (integer) and often a “Files” directory URL.
- **TDocs** (Temporary Documents) are meeting contributions that live in a directory tree on the 3GPP web server.

## Where TDocs live (HTTP “FTP” tree)

TDocs are stored on the 3GPP web server under a directory structure like:

- `https://www.3gpp.org/ftp/tsg_<working_group_identifier>/<sub-working_group_identifier>/<meeting_identifier>/Docs/<tdoc_id>.zip`

Notes:

- The server is HTTP(S) accessible; “FTP server” is historical terminology.
- The `<sub-working_group_identifier>` and `<meeting_identifier>` parts are directory names with no stable semantics; do not parse meaning from them.

## TDoc identifiers

A TDoc identifier (the filename stem) is the database primary identifier.

### Filename pattern

The crawler uses this simplified regex:

```python
TDOC_PATTERN = re.compile(r"([RSC][1-6P].{4,10})\.(zip|txt|pdf)", re.IGNORECASE)
```

Implications:

- First character: `R` (RAN), `S` (SA), `C` (CT)
- Second character: subgroup `1``6` or plenary `P`
- Most TDocs are `.zip`; `.txt` and `.pdf` are rare but supported
- Treat IDs as case-insensitive in the CLI and database (normalize to uppercase when creating IDs)

### Examples

- Matches: `R1-2301234.zip`, `S4aA220001.zip`, `CP-123456.zip`
- Does not match: `R7-123456.zip` (invalid subgroup), `R1-12.zip` (too short), `README.txt` (not a TDoc)

## Working groups and meeting code pages

Meeting lists come from the “dynareport” pages:

- `https://www.3gpp.org/dynareport?code=Meetings-<ID>.htm`

Where `<ID>` is one of the codes below:

| ID | Title | tbid | SubTB |
|----|-------------|------|-------|
| SP | SA Plenary | 375 | 375 |
| S1 | SA1 | 375 | 384 |
| S2 | SA2 | 375 | 385 |
| S3 | SA3 | 375 | 386 |
| S4 | SA4 | 375 | 387 |
| S5 | SA5 | 375 | 388 |
| S6 | SA6 | 375 | 825 |
| CP | CT Plenary | 649 | 649 |
| C1 | CT1 | 649 | 651 |
| C2 | CT2 | 649 | 652 |
| C3 | CT3 | 649 | 653 |
| C4 | CT4 | 649 | 654 |
| C5 | CT5 | 649 | 655 |
| C6 | CT6 | 649 | 656 |
| RP | RAN Plenary | 373 | 373 |
| R1 | RAN1 | 373 | 379 |
| R2 | RAN2 | 373 | 380 |
| R3 | RAN3 | 373 | 381 |
| R4 | RAN4 | 373 | 382 |
| R5 | RAN5 | 373 | 657 |
| R6 | RAN6 | 373 | 843 |

Operational notes:

- The `tbid`/`SubTB` values are used as primary keys for reference tables.
- A meeting row may have an empty “Files” link; those meetings should be skipped by TDoc crawling.

## Portal metadata lookup (authenticated)

When a TDoc ID is known, the portal page can be queried for metadata:

- `https://portal.3gpp.org/ngppapp/CreateTdoc.Aspx?mode=view&contributionUid=<tdoc_id>`

Some metadata requires an ETSI Online Account (EOL). The crawler should treat authentication failures as non-fatal for discovery (log and continue).

## Links

- Human-facing CLI behavior: [docs/QUICK_REFERENCE.md](../QUICK_REFERENCE.md)
- Agent implementation patterns: [docs/agents-md/400_Crawling_and_Portal_Patterns.md](400_Crawling_and_Portal_Patterns.md)
+1 −2
Original line number Diff line number Diff line
@@ -50,6 +50,5 @@ When changing behavior, update code first, then update documentation:

## Links

- Domain context: [docs/agents-md/200_Project_Domain_3GPP_TDocs_Meetings.md](200_Project_Domain_3GPP_TDocs_Meetings.md)
- DB schema invariants: [docs/agents-md/320_Database_Schema_and_Invariants.md](320_Database_Schema_and_Invariants.md)
- Crawling patterns: [docs/agents-md/400_Crawling_and_Portal_Patterns.md](400_Crawling_and_Portal_Patterns.md)
- 3GPP domain knowledge: Use skills in `.config/skills/3gpp/` (see AGENTS.md for skill list)
+0 −69
Original line number Diff line number Diff line
# Crawling and Portal Patterns (Agent Notes)

This document captures implementation patterns that must remain consistent across refactors.

## Meeting crawling (portal dynareport)

- Meeting metadata is scraped from `https://www.3gpp.org/dynareport?code=Meetings-<ID>.htm` pages.
- Each meeting row may provide:
  - A meeting detail link containing an integer `MtgId` (portal meeting ID)
  - A “Files” link to the meeting’s HTTP directory (used for TDoc discovery)
- Meetings with no “Files” link must be skipped for TDoc crawling.

## TDoc discovery from meeting files directories

### Directory traversal

- Start from `meeting.files_url` and ensure it ends with `/`.
- Prefer scanning known document subdirectories first (e.g., `Docs/`, `Documents/`).
- If no subdirectories match, scan the base directory.

### Candidate filtering

- Parse HTML directory listings and identify candidates via `TDOC_PATTERN`.
- Ignore obvious non-content directories (drafts, inbox, agenda, etc.) using the crawler’s exclusion set.

## Portal validation and metadata

Once a candidate TDoc ID is discovered, metadata is fetched from the portal page:

- `https://portal.3gpp.org/ngppapp/CreateTdoc.Aspx?mode=view&contributionUid=<tdoc_id>`

Agent-facing rules:

- Authentication may be required; lack of credentials should not crash discovery.
- Use negative caching:
  - If a portal lookup fails for a TDoc, record `validation_failed=True` to avoid repeating expensive failures.
  - Provide a force mode to override negative cache.

## Meeting name resolution

The portal meeting label may differ from the meeting name stored in the database.

Resolution strategy should be multi-stage (in order):

1. Exact match (case-insensitive)
1. Normalized match (e.g., replace `#` with `-`, normalize `SA4``S4`)
1. Prefix/suffix matching for common portal vs stored naming variants
1. Edit-distance fallback (use carefully to avoid false positives)

Avoid substring “contains” matching that can create false positives.

## Progress reporting contract

Database bulk upsert methods should accept an optional progress callback:

- Signature: `Callable[[float, float], None]` with `(completed, total)`
- Callback is invoked after each successful upsert
- CLI layer provides UI (progress bar); crawler layer remains UI-agnostic

## HTTP session and caching

- Use the shared cached session factory in [src/tdoc_crawler/http_client.py](../../src/tdoc_crawler/http_client.py).
- Caching behavior (TTL, refresh) is configured via CLI/env/defaults.

## Where to look in code

- Meeting crawler: [src/tdoc_crawler/crawlers/meetings.py](../../src/tdoc_crawler/crawlers/meetings.py)
- TDoc crawler: [src/tdoc_crawler/crawlers/tdocs.py](../../src/tdoc_crawler/crawlers/tdocs.py)
- Portal session: [src/tdoc_crawler/crawlers/portal.py](../../src/tdoc_crawler/crawlers/portal.py)
+1 −1
Original line number Diff line number Diff line
@@ -62,7 +62,7 @@ This repository maintains separate documentation for humans and for coding agent

Focus updates on these key areas:

- **Domain/crawling changes**`200_*` and `400_*`
- **3GPP domain/crawling changes**Use skills in `.config/skills/3gpp/`
- **Architecture changes**`300_*`
- **Database contract changes**`320_*`
- **Engineering standards**`600_*`
+1 −3
Original line number Diff line number Diff line
@@ -25,10 +25,8 @@ The documentation uses a numbered system for easy extension and logical organiza
- `110_Tool_beads.md` - Tool usage: beads issue tracking
- `120_Tool_memorygraph.md` - Tool usage: memory-graph
- `130_Tool_sequential_thinking.md` - Tool usage: sequential-thinking
- `200_Project_Domain_3GPP_TDocs_Meetings.md` - Domain context for 3GPP meetings and TDocs
- `300_Repo_Architecture_and_Key_Files.md` - Repository architecture and “source of truth” map
- `300_Repo_Architecture_and_Key_Files.md` - Repository architecture and "source of truth" map
- `320_Database_Schema_and_Invariants.md` - Database schema contract and invariants
- `400_Crawling_and_Portal_Patterns.md` - Crawling, portal validation, and resolution patterns
- `600_Engineering_Standards.md` - Engineering standards for agents (workflow, style, constraints)
- `700_Testing_and_Mocking_Patterns.md` - Testing patterns and mocking boundaries
- `800_Documentation.md` - Documentation requirements and guidelines
Loading