Commit 368d2b59 authored by Jan Reimes's avatar Jan Reimes
Browse files

Add workspace docs with extraction engine table

docs/workspace.md covers:
- All workspace CLI commands with examples
- Extraction engine table per profile (pdf-only, markdown-only,
  default, advanced) showing which engine does what
- Engine boundaries (Docling not in markdown-only/pdf-only)
- --docx-direct flag and its scope
- Image handling per engine
- Spec source directory naming convention
- Idempotent processing guarantees
- Full workflow example
parent f3a3ff1f
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -103,7 +103,7 @@ NOTE: If no credentials are provided, the tool will prompt you interactively
| `checkout` | | Batch download technical specifications |
| **3gpp-crawler** | | |
| `config {init,show,validate,docs}` | | Manage configuration |
| `workspace {create,list,...}` | | Manage workspaces and processing |
| `workspace {create,list,...}` | | Manage workspaces and processing (see [Workspace Docs](docs/workspace.md)) |

### 1. Crawl Metadata

+2 −0
Original line number Diff line number Diff line
@@ -8,6 +8,7 @@ Welcome to the documentation for **3gpp-crawler**, a command-line tool for query

- [**Crawl Documentation**](crawl.md) – How to fetch metadata from 3GPP servers and portal.
- [**Query Documentation**](query.md) – How to search and display stored metadata.
- [**Workspace Documentation**](workspace.md) – Workspace management, extraction profiles, and engine architecture.
- [**Utility Documentation**](utils.md) – File access, spec handling, and database inspection.
- [**WhatIsWhatTheSpec**](whatthespec.md) – Understanding the primary WhatTheSpec data source and 3GPP fallback.
- [**Development Guide**](development.md) – Setup, testing, and contribution guidelines.
@@ -22,6 +23,7 @@ Welcome to the documentation for **3gpp-crawler**, a command-line tool for query
- [**Query-TDocs**](query.md#query-tdocs-alias-qt) (`qt`)
- [**Query-Meetings**](query.md#query-meetings-alias-qm) (`qm`)
- [**Query-Specs**](query.md#query-specs) (`qs`)
- [**Workspace**](workspace.md#extraction-engine-by-profile)
- [**Open TDoc**](utils.md#open)
- [**Checkout Specs**](utils.md#checkout-spec)

docs/workspace.md

0 → 100644
+170 −0
Original line number Diff line number Diff line
# Workspace Management

Workspaces organize document extraction artifacts under `~/.3gpp-crawler/wiki/<workspace>/sources/<doc-id>/`. Each workspace member (TDoc, Spec, or other document) gets a dedicated source directory with extracted Markdown, JSON, and PDF files.

## Commands

### `workspace create`

Create a new workspace.

```bash
3gpp-crawler workspace create my-project
```

### `workspace list`

Display all existing workspaces.

```bash
3gpp-crawler workspace list
```

### `workspace activate`

Set a workspace as the default target for subsequent commands.

```bash
3gpp-crawler workspace activate my-project
```

### `workspace add`

Add documents to a workspace. Accepts TDoc IDs and spec numbers.

```bash
# Add a TDoc
3gpp-crawler workspace add S4-250638

# Add a spec (resolves latest release by default)
3gpp-crawler workspace add 26260 --kind spec

# Add a spec with explicit release
3gpp-crawler workspace add 26260 --kind spec --release 18.0

# Add multiple items
3gpp-crawler workspace add 26260 26261 --kind spec --release 18.0
```

Spec members added without `--release` resolve to the latest available version from the database. If the database has no version information, the spec is auto-crawled from 3GPP.

### `workspace members`

List all members of a workspace.

```bash
3gpp-crawler workspace members
```

### `workspace process`

Extract structured data from all workspace members. This is the core pipeline — each member's source document is downloaded, optionally converted to PDF, and then extracted to Markdown (and JSON for Docling profiles).

```bash
# Process with markdown-only profile (pymupdf4llm, fastest)
3gpp-crawler workspace process --profile markdown-only

# Process with Docling (default, includes JSON output)
3gpp-crawler workspace process --profile default

# Force re-extract even if output exists
3gpp-crawler workspace process --force
```

By default (`--skip-existing`), existing output is preserved. Use `--force` to overwrite.

### `workspace clear-invalid`

Remove members whose source path no longer exists on disk.

```bash
3gpp-crawler workspace clear-invalid
```

### `workspace delete`

Permanently delete a workspace and all associated files.

```bash
3gpp-crawler workspace delete my-project --force
```

---

## Extraction Engine by Profile

The extraction pipeline selects the appropriate engine based on the chosen profile:

| Profile | PDF step | Extraction engine | Output files | Use case |
|---------|----------|-------------------|--------------|----------|
| **pdf-only** | LibreOffice (if not already PDF) | — | `.pdf` | Raw document, no extraction |
| **markdown-only** | LibreOffice → PDF (always) | **pymupdf4llm** `to_markdown()` | `.md` | Fast layout-aware Markdown, no ML, good for most docs |
| **default** | LibreOffice → PDF (unless `--docx-direct`) | **Docling** | `.md` + `.json` | Structured extraction with tables, figures, metadata |
| **advanced** | LibreOffice → PDF (unless `--docx-direct`) | **Docling** | `.md` + `.json` | Same as default + picture descriptions, code/formula enrichment |

### Engine boundaries

- **Docling is NEVER involved** in `markdown-only` or `pdf-only` profiles.
- **pymupdf4llm only runs** in `markdown-only`.
- **LibreOffice** is the universal PDF converter used by all profiles when the source file is an Office format (`.docx`, `.doc`, `.pptx`).

### `--docx-direct` flag

Skips the LibreOffice → PDF conversion step for **default** and **advanced** profiles when the source is a `.docx`/`.doc` file. The document is fed straight to Docling.

> **Note:** `--docx-direct` has **no effect** on `markdown-only` — pymupdf4llm always requires PDF input, so LibreOffice conversion is mandatory.

---

## Image handling

Controlled by the `--figures` option:

| `--figures` value | markdown-only (pymupdf4llm) | default/advanced (Docling) |
|-------------------|-----------------------------|----------------------------|
| `embed` (default) | Images embedded as base64 in `.md` | `ImageRefMode.PLACEHOLDER` — images in JSON |
| `reference` | Images extracted to `./media/` next to `.md` | `ImageRefMode.REFERENCED` — images referenced from JSON |

---

## Spec source directory naming

Spec source directories include the release version suffix:

```
~/.3gpp-crawler/wiki/my-project/sources/26260-REL18.0.0/
~/.3gpp-crawler/wiki/my-project/sources/26131-REL19.0.0/
```

Members added with `--release 18.0` use the explicit release. Members added without `--release` resolve to the latest available version — if the database has no version data, the spec metadata is auto-crawled from 3GPP before extraction.

---

## Idempotent processing

`workspace process` is idempotent by default (`--skip-existing`):

1. **Fast-path skip**: Before entering `convert_for_wiki`, `_should_skip_member()` checks if output artifacts already exist on disk.
2. **Per-profile guard**: Inside `convert_for_wiki`, each profile checks for existing output (`.md` for markdown-only, `.md` + `.json` for Docling) and returns early.
3. **Override**: `--force` re-extracts regardless of existing output.

---

## Example workflow

```bash
# 1. Create and activate a workspace
3gpp-crawler workspace create my-project
3gpp-crawler workspace activate my-project

# 2. Add documents
3gpp-crawler workspace add S4-250638
3gpp-crawler workspace add 26260 --kind spec --release 18.0

# 3. Extract
3gpp-crawler workspace process --profile markdown-only

# Output lands in:
#   ~/.3gpp-crawler/wiki/my-project/sources/S4-250638/
#   ~/.3gpp-crawler/wiki/my-project/sources/26260-REL18.0.0/
```