+1
−0
Loading
Added extraction support for legacy Office 2003 .doc files using the doc2txt package (which wraps the antiword CLI tool). Changes: - Added extract_doc_to_markdown() function for .doc file extraction - Updated extract_from_folder() to detect and route both .doc and .docx files - .doc files use doc2txt/antiword, .docx files use kreuzberg - Both extraction paths support change detection and artifact caching - Added doc2txt import (package already in dependencies) - Updated module docstring and exports The implementation follows the same pattern as DOCX extraction: - Hash-based idempotency (skip if content unchanged) - Artifact caching in .ai folder - Status tracking in LanceDB