refactor(scripts): Clean up and optimize PDF processing scripts (c48b579c) · Commits · Jan Reimes / 3gpp-crawler

.config/skills/documents/docx/SKILL.md

+44 −22

Original line number	Diff line number	Diff line
		---
		name: docx
		description: "Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks"
		license: Proprietary. LICENSE.txt has complete terms
		---
		______________________________________________________________________

		## name: docx description: "Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks" license: Proprietary. LICENSE.txt has complete terms

		# DOCX creation, editing, and analysis

		@@ -13,12 +11,15 @@ A user may ask you to create, edit, or analyze the contents of a .docx file. A .
		## Workflow Decision Tree

		### Reading/Analyzing Content

		Use "Text extraction" or "Raw XML access" sections below

		### Creating New Document

		Use "Creating a new Word document" workflow

		### Editing Existing Document

		- Your own document + simple changes
		Use "Basic OOXML editing" workflow

		@@ -31,6 +32,7 @@ Use "Creating a new Word document" workflow
		## Reading and analyzing content

		### Text extraction

		If you just need to read the text contents of a document, you should convert the document to markdown using pandoc. Pandoc provides excellent support for preserving document structure and can show tracked changes:

		```bash
		@@ -40,35 +42,40 @@ pandoc --track-changes=all path-to-file.docx -o output.md
		```

		### Raw XML access

		You need raw XML access for: comments, complex formatting, document structure, embedded media, and metadata. For any of these features, you'll need to unpack a document and read its raw XML contents.

		#### Unpacking a file

		`python ooxml/scripts/unpack.py <office_file> <output_directory>`

		#### Key file structures
		* `word/document.xml` - Main document contents
		* `word/comments.xml` - Comments referenced in document.xml
		* `word/media/` - Embedded images and media files
		* Tracked changes use `<w:ins>` (insertions) and `<w:del>` (deletions) tags

		- `word/document.xml` - Main document contents
		- `word/comments.xml` - Comments referenced in document.xml
		- `word/media/` - Embedded images and media files
		- Tracked changes use `<w:ins>` (insertions) and `<w:del>` (deletions) tags

		## Creating a new Word document

		When creating a new Word document from scratch, use docx-js, which allows you to create Word documents using JavaScript/TypeScript.

		### Workflow

		1. MANDATORY - READ ENTIRE FILE: Read [`docx-js.md`](docx-js.md) (~500 lines) completely from start to finish. NEVER set any range limits when reading this file. Read the full file content for detailed syntax, critical formatting rules, and best practices before proceeding with document creation.
		2. Create a JavaScript/TypeScript file using Document, Paragraph, TextRun components (You can assume all dependencies are installed, but if not, refer to the dependencies section below)
		3. Export as .docx using Packer.toBuffer()
		1. Create a JavaScript/TypeScript file using Document, Paragraph, TextRun components (You can assume all dependencies are installed, but if not, refer to the dependencies section below)
		1. Export as .docx using Packer.toBuffer()

		## Editing an existing Word document

		When editing an existing Word document, use the Document library (a Python library for OOXML manipulation). The library automatically handles infrastructure setup and provides methods for document manipulation. For complex scenarios, you can access the underlying DOM directly through the library.

		### Workflow

		1. MANDATORY - READ ENTIRE FILE: Read [`ooxml.md`](ooxml.md) (~600 lines) completely from start to finish. NEVER set any range limits when reading this file. Read the full file content for the Document library API and XML patterns for directly editing document files.
		2. Unpack the document: `python ooxml/scripts/unpack.py <office_file> <output_directory>`
		3. Create and run a Python script using the Document library (see "Document Library" section in ooxml.md)
		4. Pack the final document: `python ooxml/scripts/pack.py <input_directory> <office_file>`
		1. Unpack the document: `python ooxml/scripts/unpack.py <office_file> <output_directory>`
		1. Create and run a Python script using the Document library (see "Document Library" section in ooxml.md)
		1. Pack the final document: `python ooxml/scripts/pack.py <input_directory> <office_file>`

		The Document library provides both high-level methods for common operations and direct DOM access for complex scenarios.

		@@ -82,6 +89,7 @@ This workflow allows you to plan comprehensive tracked changes using markdown be
		When implementing tracked changes, only mark text that actually changes. Repeating unchanged text makes edits harder to review and appears unprofessional. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]. Preserve the original run's RSID for unchanged text by extracting the `<w:r>` element from the original and reusing it.

		Example - Changing "30 days" to "60 days" in a sentence:

		```python
		# BAD - Replaces entire sentence
		'<w:del><w:r><w:delText>The term is 30 days.</w:delText></w:r></w:del><w:ins><w:r><w:t>The term is 60 days.</w:t></w:r></w:ins>'
		@@ -93,13 +101,15 @@ Example - Changing "30 days" to "60 days" in a sentence:
		### Tracked changes workflow

		1. Get markdown representation: Convert document to markdown with tracked changes preserved:

		```bash
		pandoc --track-changes=all path-to-file.docx -o current.md
		```

		2. Identify and group changes: Review the document and identify ALL changes needed, organizing them into logical batches:
		1. Identify and group changes: Review the document and identify ALL changes needed, organizing them into logical batches:

		Location methods (for finding changes in XML):

		- Section/heading numbers (e.g., "Section 3.2", "Article IV")
		- Paragraph identifiers if numbered
		- Grep patterns with unique surrounding text
		@@ -107,22 +117,26 @@ Example - Changing "30 days" to "60 days" in a sentence:
		- DO NOT use markdown line numbers - they don't map to XML structure

		Batch organization (group 3-10 related changes per batch):

		- By section: "Batch 1: Section 2 amendments", "Batch 2: Section 5 updates"
		- By type: "Batch 1: Date corrections", "Batch 2: Party name changes"
		- By complexity: Start with simple text replacements, then tackle complex structural changes
		- Sequential: "Batch 1: Pages 1-3", "Batch 2: Pages 4-6"

		3. Read documentation and unpack:
		1. Read documentation and unpack:

		- MANDATORY - READ ENTIRE FILE: Read [`ooxml.md`](ooxml.md) (~600 lines) completely from start to finish. NEVER set any range limits when reading this file. Pay special attention to the "Document Library" and "Tracked Change Patterns" sections.
		- Unpack the document: `python ooxml/scripts/unpack.py <file.docx> <dir>`
		- Note the suggested RSID: The unpack script will suggest an RSID to use for your tracked changes. Copy this RSID for use in step 4b.

		4. Implement changes in batches: Group changes logically (by section, by type, or by proximity) and implement them together in a single script. This approach:
		1. Implement changes in batches: Group changes logically (by section, by type, or by proximity) and implement them together in a single script. This approach:

		- Makes debugging easier (smaller batch = easier to isolate errors)
		- Allows incremental progress
		- Maintains efficiency (batch size of 3-10 changes works well)

		Suggested batch groupings:

		- By document section (e.g., "Section 3 changes", "Definitions", "Termination clause")
		- By change type (e.g., "Date changes", "Party name updates", "Legal term replacements")
		- By proximity (e.g., "Changes on pages 1-3", "Changes in first half of document")
		@@ -135,12 +149,14 @@ Example - Changing "30 days" to "60 days" in a sentence:

		Note: Always grep `word/document.xml` immediately before writing a script to get current line numbers and verify text content. Line numbers change after each script run.

		5. Pack the document: After all batches are complete, convert the unpacked directory back to .docx:
		1. Pack the document: After all batches are complete, convert the unpacked directory back to .docx:

		```bash
		python ooxml/scripts/pack.py unpacked reviewed-document.docx
		```

		6. Final verification: Do a comprehensive check of the complete document:
		1. Final verification: Do a comprehensive check of the complete document:

		- Convert final document to markdown:
		```bash
		pandoc --track-changes=all reviewed-document.docx -o verification.md
		@@ -152,23 +168,26 @@ Example - Changing "30 days" to "60 days" in a sentence:
		```
		- Check that no unintended changes were introduced


		## Converting Documents to Images

		To visually analyze Word documents, convert them to images using a two-step process:

		1. Convert DOCX to PDF:

		```bash
		soffice --headless --convert-to pdf document.docx
		```

		2. Convert PDF pages to JPEG images:
		1. Convert PDF pages to JPEG images:

		```bash
		pdftoppm -jpeg -r 150 document.pdf page
		```

		This creates files like `page-1.jpg`, `page-2.jpg`, etc.

		Options:

		- `-r 150`: Sets resolution to 150 DPI (adjust for quality/size balance)
		- `-jpeg`: Output JPEG format (use `-png` for PNG if preferred)
		- `-f N`: First page to convert (e.g., `-f 2` starts from page 2)
		@@ -176,12 +195,15 @@ Options:
		- `page`: Prefix for output files

		Example for specific range:

		```bash
		pdftoppm -jpeg -r 150 -f 2 -l 5 document.pdf page # Converts only pages 2-5
		```

		## Code Style Guidelines

		IMPORTANT: When generating code for DOCX operations:

		- Write concise code
		- Avoid verbose variable names and redundant operations
		- Avoid unnecessary print statements

.config/skills/documents/docx/docx-js.md

+18 −4

Original line number	Diff line number	Diff line
		@@ -5,6 +5,7 @@ Generate .docx files with JavaScript/TypeScript.
		Important: Read this entire document before starting. Critical formatting rules and common pitfalls are covered throughout - skipping sections may result in corrupted files or rendering issues.

		## Setup

		Assumes docx is already installed globally
		If not installed: `npm install -g docx`

		@@ -22,6 +23,7 @@ Packer.toBlob(doc).then(blob => { /* download logic */ }); // Browser
		```

		## Text & Formatting

		```javascript
		// IMPORTANT: Never use \n for line breaks - always use separate Paragraph elements
		// ❌ WRONG: new TextRun("Line 1\nLine 2")
		@@ -90,11 +92,13 @@ const doc = new Document({
		```

		Professional Font Combinations:

		- Arial (Headers) + Arial (Body) - Most universally supported, clean and professional
		- Times New Roman (Headers) + Arial (Body) - Classic serif headers with modern sans-serif body
		- Georgia (Headers) + Verdana (Body) - Optimized for screen reading, elegant contrast

		Key Styling Principles:

		- Override built-in styles: Use exact IDs like "Heading1", "Heading2", "Heading3" to override Word's built-in heading styles
		- HeadingLevel constants: `HeadingLevel.HEADING_1` uses "Heading1" style, `HeadingLevel.HEADING_2` uses "Heading2" style, etc.
		- Include outlineLevel: Set `outlineLevel: 0` for H1, `outlineLevel: 1` for H2, etc. to ensure TOC works correctly
		@@ -105,8 +109,8 @@ const doc = new Document({
		- Use colors sparingly: Default to black (000000) and shades of gray for titles and headings (heading 1, heading 2, etc.)
		- Set consistent margins (1440 = 1 inch is standard)


		## Lists (ALWAYS USE PROPER LISTS - NEVER USE UNICODE BULLETS)

		```javascript
		// Bullets - ALWAYS use the numbering config, NOT unicode symbols
		// CRITICAL: Use LevelFormat.BULLET constant, NOT the string "bullet"
		@@ -156,6 +160,7 @@ const doc = new Document({
		```

		## Tables

		```javascript
		// Complete table with margins, borders, headers, and bullet points
		const tableBorder = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
		@@ -218,15 +223,18 @@ new Table({
		```

		IMPORTANT: Table Width & Borders

		- Use BOTH `columnWidths: [width1, width2, ...]` array AND `width: { size: X, type: WidthType.DXA }` on each cell
		- Values in DXA (twentieths of a point): 1440 = 1 inch, Letter usable width = 9360 DXA (with 1" margins)
		- Apply borders to individual `TableCell` elements, NOT the `Table` itself

		Precomputed Column Widths (Letter size with 1" margins = 9360 DXA total):

		- 2 columns: `columnWidths: [4680, 4680]` (equal width)
		- 3 columns: `columnWidths: [3120, 3120, 3120]` (equal width)

		## Links & Navigation

		```javascript
		// TOC (requires headings) - CRITICAL: Use HeadingLevel only, NOT custom styles
		// ❌ WRONG: new Paragraph({ heading: HeadingLevel.HEADING_1, style: "customHeader", children: [new TextRun("Title")] })
		@@ -255,6 +263,7 @@ new Paragraph({
		```

		## Images & Media

		```javascript
		// Basic image with sizing & positioning
		// CRITICAL: Always specify 'type' parameter - it's REQUIRED for ImageRun
		@@ -270,6 +279,7 @@ new Paragraph({
		```

		## Page Breaks

		```javascript
		// Manual page break
		new Paragraph({ children: [new PageBreak()] }),
		@@ -286,6 +296,7 @@ new Paragraph({
		```

		## Headers/Footers & Page Setup

		```javascript
		const doc = new Document({
		sections: [{
		@@ -314,6 +325,7 @@ const doc = new Document({
		```

		## Tabs

		```javascript
		new Paragraph({
		tabStops: [
		@@ -326,6 +338,7 @@ new Paragraph({
		```

		## Constants & Quick Reference

		- Underlines: `SINGLE`, `DOUBLE`, `WAVY`, `DASH`
		- Borders: `SINGLE`, `DOUBLE`, `DASHED`, `DOTTED`
		- Numbering: `DECIMAL` (1,2,3), `UPPER_ROMAN` (I,II,III), `LOWER_LETTER` (a,b,c)
		@@ -333,6 +346,7 @@ new Paragraph({
		- Symbols: `"2022"` (•), `"00A9"` (©), `"00AE"` (®), `"2122"` (™), `"00B0"` (°), `"F070"` (✓), `"F0FC"` (✗)

		## Critical Issues & Common Mistakes

		- CRITICAL: PageBreak must ALWAYS be inside a Paragraph - standalone PageBreak creates invalid XML that Word cannot open
		- ALWAYS use ShadingType.CLEAR for table cell shading - Never use ShadingType.SOLID (causes black background).
		- Measurements in DXA (1440 = 1 inch) \| Each table cell needs ≥1 Paragraph \| TOC requires HeadingLevel styles only
		@@ -340,7 +354,7 @@ new Paragraph({
		- ALWAYS set a default font using `styles.default.document.run.font` - Arial recommended
		- ALWAYS use columnWidths array for tables + individual cell widths for compatibility
		- NEVER use unicode symbols for bullets - always use proper numbering configuration with `LevelFormat.BULLET` constant (NOT the string "bullet")
		- NEVER use \n for line breaks anywhere - always use separate Paragraph elements for each line
		- NEVER use \\n for line breaks anywhere - always use separate Paragraph elements for each line
		- ALWAYS use TextRun objects within Paragraph children - never use text property directly on Paragraph
		- CRITICAL for images: ImageRun REQUIRES `type` parameter - always specify "png", "jpg", "jpeg", "gif", "bmp", or "svg"
		- CRITICAL for bullets: Must use `LevelFormat.BULLET` constant, not string "bullet", and include `text: "•"` for the bullet character

.config/skills/documents/docx/ooxml.md

+27 −2

Original line number	Diff line number	Diff line
		# Office Open XML Technical Reference

		Important: Read this entire document before starting. This document covers:

		- [Technical Guidelines](#technical-guidelines) - Schema compliance rules and validation requirements
		- [Document Content Patterns](#document-content-patterns) - XML patterns for headings, lists, tables, formatting, etc.
		- [Document Library (Python)](#document-library-python) - Recommended approach for OOXML manipulation with automatic infrastructure setup
		@@ -9,6 +10,7 @@
		## Technical Guidelines

		### Schema Compliance

		- Element ordering in `<w:pPr>`: `<w:pStyle>`, `<w:numPr>`, `<w:spacing>`, `<w:ind>`, `<w:jc>`
		- Whitespace: Add `xml:space='preserve'` to `<w:t>` elements with leading/trailing spaces
		- Unicode: Escape characters in ASCII content: `"` becomes `“`
		@@ -22,6 +24,7 @@
		## Document Content Patterns

		### Basic Structure

		```xml
		<w:p>
		<w:r><w:t>Text content</w:t></w:r>
		@@ -29,6 +32,7 @@
		```

		### Headings and Styles

		```xml
		<w:p>
		<w:pPr>
		@@ -45,6 +49,7 @@
		```

		### Text Formatting

		```xml
		<!-- Bold -->
		<w:r><w:rPr><w:b/><w:bCs/></w:rPr><w:t>Bold</w:t></w:r>
		@@ -57,6 +62,7 @@
		```

		### Lists

		```xml
		<!-- Numbered list -->
		<w:p>
		@@ -91,6 +97,7 @@
		```

		### Tables

		```xml
		<w:tbl>
		<w:tblPr>
		@@ -114,6 +121,7 @@
		```

		### Layout

		```xml
		<!-- Page break before new section (common pattern) -->
		<w:p>
		@@ -162,18 +170,21 @@
		When adding content, update these files:

		`word/_rels/document.xml.rels`:

		```xml
		<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/numbering" Target="numbering.xml"/>
		<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
		```

		`[Content_Types].xml`:

		```xml
		<Default Extension="png" ContentType="image/png"/>
		<Override PartName="/word/numbering.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml"/>
		```

		### Images

		CRITICAL: Calculate dimensions to prevent page overflow and maintain aspect ratio.

		```xml
		@@ -218,6 +229,7 @@ When adding content, update these files:
		IMPORTANT: All hyperlinks (both internal and external) require the Hyperlink style to be defined in styles.xml. Without this style, links will look like regular text instead of blue underlined clickable links.

		External Links:

		```xml
		<!-- In document.xml -->
		<w:hyperlink r:id="rId5">
		@@ -250,6 +262,7 @@ When adding content, update these files:
		```

		Hyperlink Style (required in styles.xml):

		```xml
		<w:style w:type="character" w:styleId="Hyperlink">
		<w:name w:val="Hyperlink"/>
		@@ -268,12 +281,14 @@ When adding content, update these files:
		Use the Document class from `scripts/document.py` for all tracked changes and comments. It automatically handles infrastructure setup (people.xml, RSIDs, settings.xml, comment files, relationships, content types). Only use direct XML manipulation for complex scenarios not supported by the library.

		Working with Unicode and Entities:

		- Searching: Both entity notation and Unicode characters work - `contains="“Company"` and `contains="\u201cCompany"` find the same text
		- Replacing: Use either entities (`“`) or Unicode (`\u201c`) - both work and will be converted appropriately based on the file's encoding (ascii → entities, utf-8 → Unicode)

		### Initialization

		Find the docx skill root (directory containing `scripts/` and `ooxml/`):

		```bash
		# Search for document.py to locate the skill root
		# Note: /mnt/skills is used here as an example; check your context for the actual location
		@@ -283,11 +298,13 @@ find /mnt/skills -name "document.py" -path "/docx/scripts/" 2>/dev/null \| head
		```

		Run your script with PYTHONPATH set to the docx skill root:

		```bash
		PYTHONPATH=/mnt/skills/docx python your_script.py
		```

		In your script, import from the skill root:

		```python
		from scripts.document import Document, DocxXMLEditor

		@@ -311,6 +328,7 @@ doc = Document('unpacked', rsid="07DC5ECB")
		Attribute Handling: The Document class auto-injects attributes (w:id, w:date, w:rsidR, w:rsidDel, w16du:dateUtc, xml:space) into new elements. When preserving unchanged text from the original document, copy the original `<w:r>` element with its existing attributes to maintain document integrity.

		Method Selection Guide:

		- Adding your own changes to regular text: Use `replace_node()` with `<w:del>`/`<w:ins>` tags, or `suggest_deletion()` for removing entire `<w:r>` or `<w:p>` elements
		- Partially modifying another author's tracked change: Use `replace_node()` to nest your changes inside their `<w:ins>`/`<w:del>`
		- Completely rejecting another author's insertion: Use `revert_insertion()` on the `<w:ins>` element (NOT `suggest_deletion()`)
		@@ -556,7 +574,9 @@ nodes = doc["word/document.xml"].insert_after(nodes[-1], "<w:r><w:t>C</w:t></w:r
		Use the Document class above for all tracked changes. The patterns below are for reference when constructing replacement XML strings.

		### Validation Rules

		The validator checks that the document text matches the original after reverting Claude's changes. This means:

		- NEVER modify text inside another author's `<w:ins>` or `<w:del>` tags
		- ALWAYS use nested deletions to remove another author's insertions
		- Every edit must be properly tracked with `<w:ins>` or `<w:del>` tags
		@@ -564,10 +584,12 @@ The validator checks that the document text matches the original after reverting
		### Tracked Change Patterns

		CRITICAL RULES:

		1. Never modify the content inside another author's tracked changes. Always use nested deletions.
		2. XML Structure: Always place `<w:del>` and `<w:ins>` at paragraph level containing complete `<w:r>` elements. Never nest inside `<w:r>` elements - this creates invalid XML that breaks document processing.
		1. XML Structure: Always place `<w:del>` and `<w:ins>` at paragraph level containing complete `<w:r>` elements. Never nest inside `<w:r>` elements - this creates invalid XML that breaks document processing.

		Text Insertion:

		```xml
		<w:ins w:id="1" w:author="Claude" w:date="2025-07-30T23:05:00Z" w16du:dateUtc="2025-07-31T06:05:00Z">
		<w:r w:rsidR="00792858">
		@@ -577,6 +599,7 @@ The validator checks that the document text matches the original after reverting
		```

		Text Deletion:

		```xml
		<w:del w:id="2" w:author="Claude" w:date="2025-07-30T23:05:00Z" w16du:dateUtc="2025-07-31T06:05:00Z">
		<w:r w:rsidDel="00792858">
		@@ -586,6 +609,7 @@ The validator checks that the document text matches the original after reverting
		```

		Deleting Another Author's Insertion (MUST use nested structure):

		```xml
		<!-- Nest deletion inside the original insertion -->
		<w:ins w:author="Jane Smith" w:id="16">
		@@ -599,6 +623,7 @@ The validator checks that the document text matches the original after reverting
		```

		Restoring Another Author's Deletion:

		```xml
		<!-- Leave their deletion unchanged, add new insertion after it -->
		<w:del w:author="Jane Smith" w:id="50">

.config/skills/documents/docx/ooxml/scripts/pack.py

+4 −9

Original line number	Diff line number	Diff line
		@@ -11,10 +11,11 @@ import shutil
		import subprocess
		import sys
		import tempfile
		import defusedxml.minidom
		import zipfile
		from pathlib import Path

		import defusedxml.minidom


		def main():
		parser = argparse.ArgumentParser(description="Pack a directory into an Office file")
		@@ -24,9 +25,7 @@ def main():
		args = parser.parse_args()

		try:
		success = pack_document(
		args.input_directory, args.output_file, validate=not args.force
		)
		success = pack_document(args.input_directory, args.output_file, validate=not args.force)

		# Show warning if validation was skipped
		if args.force:
		@@ -143,11 +142,7 @@ def condense_xml(xml_file):

		# Remove whitespace-only text nodes and comment nodes
		for child in list(element.childNodes):
		if (
		child.nodeType == child.TEXT_NODE
		and child.nodeValue
		and child.nodeValue.strip() == ""
		) or child.nodeType == child.COMMENT_NODE:
		if (child.nodeType == child.TEXT_NODE and child.nodeValue and child.nodeValue.strip() == "") or child.nodeType == child.COMMENT_NODE:
		element.removeChild(child)

		# Write back the condensed XML

.config/skills/documents/docx/ooxml/scripts/unpack.py

+2 −1

Original line number	Diff line number	Diff line
		@@ -3,10 +3,11 @@

		import random
		import sys
		import defusedxml.minidom
		import zipfile
		from pathlib import Path

		import defusedxml.minidom

		# Get command line arguments
		assert len(sys.argv) == 3, "Usage: python unpack.py <office_file> <output_dir>"
		input_file, output_dir = sys.argv[1], sys.argv[2]