Add workspace docs with extraction engine table (368d2b59) · Commits · Jan Reimes / 3gpp-crawler

README.md

+1 −1

Original line number	Diff line number	Diff line
		@@ -103,7 +103,7 @@ NOTE: If no credentials are provided, the tool will prompt you interactively
		\| `checkout` \| \| Batch download technical specifications \|
		\| 3gpp-crawler \| \| \|
		\| `config {init,show,validate,docs}` \| \| Manage configuration \|
		\| `workspace {create,list,...}` \| \| Manage workspaces and processing \|
		\| `workspace {create,list,...}` \| \| Manage workspaces and processing (see [Workspace Docs](docs/workspace.md)) \|

		### 1. Crawl Metadata

docs/index.md

+2 −0

Original line number	Diff line number	Diff line
		@@ -8,6 +8,7 @@ Welcome to the documentation for 3gpp-crawler, a command-line tool for query

		- [Crawl Documentation](crawl.md) – How to fetch metadata from 3GPP servers and portal.
		- [Query Documentation](query.md) – How to search and display stored metadata.
		- [Workspace Documentation](workspace.md) – Workspace management, extraction profiles, and engine architecture.
		- [Utility Documentation](utils.md) – File access, spec handling, and database inspection.
		- [WhatIsWhatTheSpec](whatthespec.md) – Understanding the primary WhatTheSpec data source and 3GPP fallback.
		- [Development Guide](development.md) – Setup, testing, and contribution guidelines.
		@@ -22,6 +23,7 @@ Welcome to the documentation for 3gpp-crawler, a command-line tool for query
		- [Query-TDocs](query.md#query-tdocs-alias-qt) (`qt`)
		- [Query-Meetings](query.md#query-meetings-alias-qm) (`qm`)
		- [Query-Specs](query.md#query-specs) (`qs`)
		- [Workspace](workspace.md#extraction-engine-by-profile)
		- [Open TDoc](utils.md#open)
		- [Checkout Specs](utils.md#checkout-spec)

docs/workspace.md

0 → 100644

+170 −0

Original line number	Diff line number	Diff line
		# Workspace Management

		Workspaces organize document extraction artifacts under `~/.3gpp-crawler/wiki/<workspace>/sources/<doc-id>/`. Each workspace member (TDoc, Spec, or other document) gets a dedicated source directory with extracted Markdown, JSON, and PDF files.

		## Commands

		### `workspace create`

		Create a new workspace.

		```bash
		3gpp-crawler workspace create my-project
		```

		### `workspace list`

		Display all existing workspaces.

		```bash
		3gpp-crawler workspace list
		```

		### `workspace activate`

		Set a workspace as the default target for subsequent commands.

		```bash
		3gpp-crawler workspace activate my-project
		```

		### `workspace add`

		Add documents to a workspace. Accepts TDoc IDs and spec numbers.

		```bash
		# Add a TDoc
		3gpp-crawler workspace add S4-250638

		# Add a spec (resolves latest release by default)
		3gpp-crawler workspace add 26260 --kind spec

		# Add a spec with explicit release
		3gpp-crawler workspace add 26260 --kind spec --release 18.0

		# Add multiple items
		3gpp-crawler workspace add 26260 26261 --kind spec --release 18.0
		```

		Spec members added without `--release` resolve to the latest available version from the database. If the database has no version information, the spec is auto-crawled from 3GPP.

		### `workspace members`

		List all members of a workspace.

		```bash
		3gpp-crawler workspace members
		```

		### `workspace process`

		Extract structured data from all workspace members. This is the core pipeline — each member's source document is downloaded, optionally converted to PDF, and then extracted to Markdown (and JSON for Docling profiles).

		```bash
		# Process with markdown-only profile (pymupdf4llm, fastest)
		3gpp-crawler workspace process --profile markdown-only

		# Process with Docling (default, includes JSON output)
		3gpp-crawler workspace process --profile default

		# Force re-extract even if output exists
		3gpp-crawler workspace process --force
		```

		By default (`--skip-existing`), existing output is preserved. Use `--force` to overwrite.

		### `workspace clear-invalid`

		Remove members whose source path no longer exists on disk.

		```bash
		3gpp-crawler workspace clear-invalid
		```

		### `workspace delete`

		Permanently delete a workspace and all associated files.

		```bash
		3gpp-crawler workspace delete my-project --force
		```

		---

		## Extraction Engine by Profile

		The extraction pipeline selects the appropriate engine based on the chosen profile:

		\| Profile \| PDF step \| Extraction engine \| Output files \| Use case \|
		\|---------\|----------\|-------------------\|--------------\|----------\|
		\| pdf-only \| LibreOffice (if not already PDF) \| — \| `.pdf` \| Raw document, no extraction \|
		\| markdown-only \| LibreOffice → PDF (always) \| pymupdf4llm `to_markdown()` \| `.md` \| Fast layout-aware Markdown, no ML, good for most docs \|
		\| default \| LibreOffice → PDF (unless `--docx-direct`) \| Docling \| `.md` + `.json` \| Structured extraction with tables, figures, metadata \|
		\| advanced \| LibreOffice → PDF (unless `--docx-direct`) \| Docling \| `.md` + `.json` \| Same as default + picture descriptions, code/formula enrichment \|

		### Engine boundaries

		- Docling is NEVER involved in `markdown-only` or `pdf-only` profiles.
		- pymupdf4llm only runs in `markdown-only`.
		- LibreOffice is the universal PDF converter used by all profiles when the source file is an Office format (`.docx`, `.doc`, `.pptx`).

		### `--docx-direct` flag

		Skips the LibreOffice → PDF conversion step for default and advanced profiles when the source is a `.docx`/`.doc` file. The document is fed straight to Docling.

		> Note: `--docx-direct` has no effect on `markdown-only` — pymupdf4llm always requires PDF input, so LibreOffice conversion is mandatory.

		---

		## Image handling

		Controlled by the `--figures` option:

		\| `--figures` value \| markdown-only (pymupdf4llm) \| default/advanced (Docling) \|
		\|-------------------\|-----------------------------\|----------------------------\|
		\| `embed` (default) \| Images embedded as base64 in `.md` \| `ImageRefMode.PLACEHOLDER` — images in JSON \|
		\| `reference` \| Images extracted to `./media/` next to `.md` \| `ImageRefMode.REFERENCED` — images referenced from JSON \|

		---

		## Spec source directory naming

		Spec source directories include the release version suffix:

		```
		~/.3gpp-crawler/wiki/my-project/sources/26260-REL18.0.0/
		~/.3gpp-crawler/wiki/my-project/sources/26131-REL19.0.0/
		```

		Members added with `--release 18.0` use the explicit release. Members added without `--release` resolve to the latest available version — if the database has no version data, the spec metadata is auto-crawled from 3GPP before extraction.

		---

		## Idempotent processing

		`workspace process` is idempotent by default (`--skip-existing`):

		1. Fast-path skip: Before entering `convert_for_wiki`, `_should_skip_member()` checks if output artifacts already exist on disk.
		2. Per-profile guard: Inside `convert_for_wiki`, each profile checks for existing output (`.md` for markdown-only, `.md` + `.json` for Docling) and returns early.
		3. Override: `--force` re-extracts regardless of existing output.

		---

		## Example workflow

		```bash
		# 1. Create and activate a workspace
		3gpp-crawler workspace create my-project
		3gpp-crawler workspace activate my-project

		# 2. Add documents
		3gpp-crawler workspace add S4-250638
		3gpp-crawler workspace add 26260 --kind spec --release 18.0

		# 3. Extract
		3gpp-crawler workspace process --profile markdown-only

		# Output lands in:
		# ~/.3gpp-crawler/wiki/my-project/sources/S4-250638/
		# ~/.3gpp-crawler/wiki/my-project/sources/26260-REL18.0.0/
		```

Original line number	Diff line number	Diff line
		@@ -8,6 +8,7 @@ Welcome to the documentation for 3gpp-crawler, a command-line tool for query

		- [Crawl Documentation](crawl.md) – How to fetch metadata from 3GPP servers and portal.
		- [Query Documentation](query.md) – How to search and display stored metadata.
		- [Workspace Documentation](workspace.md) – Workspace management, extraction profiles, and engine architecture.
		- [Utility Documentation](utils.md) – File access, spec handling, and database inspection.
		- [WhatIsWhatTheSpec](whatthespec.md) – Understanding the primary WhatTheSpec data source and 3GPP fallback.
		- [Development Guide](development.md) – Setup, testing, and contribution guidelines.
		@@ -22,6 +23,7 @@ Welcome to the documentation for 3gpp-crawler, a command-line tool for query
		- [Query-TDocs](query.md#query-tdocs-alias-qt) (`qt`)
		- [Query-Meetings](query.md#query-meetings-alias-qm) (`qm`)
		- [Query-Specs](query.md#query-specs) (`qs`)
		- [Workspace](workspace.md#extraction-engine-by-profile)
		- [Open TDoc](utils.md#open)
		- [Checkout Specs](utils.md#checkout-spec)