Files
ZLD_POC/docs/superpowers/specs/2026-04-14-mineru-ai-word-parse-design.md
2026-04-15 17:18:49 +08:00

5.1 KiB

MinerU AI Word Parse Design

Goal

Replace the current AI text extraction pipeline with a MinerU-backed flow:

  1. Accept an Illustrator .ai file and a Word .docx file.
  2. Convert or normalize the .ai file into a PDF preview artifact.
  3. Upload the PDF artifact to MinerU for document parsing.
  4. Read MinerU JSON output blocks and their bounding boxes.
  5. Compare MinerU text output against the Word document text.
  6. Return field results that the existing React preview can highlight on the right side.

Non-Goals

  • Do not keep the old parse_ai_document(...).fields text extraction as the source of validation fields.
  • Do not expose the MinerU API key to the frontend.
  • Do not require a public callback URL; use polling because this is a local backend.
  • Do not add a new manual annotation workflow.

Backend Flow

The /api/process endpoint keeps its current two-file upload contract: ai_file and word_file.

For each job, the backend creates the existing runtime upload/output directories. The .ai file is converted into a PDF preview artifact using the existing backend.app.ai_parser.parse_ai_document conversion behavior. The resulting preview.pdf is copied into the job output directory and returned as the preview URL.

The backend then submits that PDF preview artifact to MinerU using the documented local-file upload flow:

  1. POST https://mineru.net/api/v4/file-urls/batch with one file entry and model_version: "vlm".
  2. PUT the generated upload URL with the PDF bytes.
  3. Poll GET https://mineru.net/api/v4/extract-results/batch/{batch_id} until the single file reaches done, failed, or a timeout.
  4. When done, download full_zip_url into the job output directory.
  5. Extract the zip into the job output directory and locate the structured JSON output.

The API token is read from MINERU_API_KEY. If it is missing, the backend returns a clear configuration error instead of attempting the request.

MinerU JSON Mapping

The primary parser reads MinerU middle.json-style output because the sample JSON contains:

  • pdf_info[]
  • page_idx
  • page_size
  • para_blocks[]
  • discarded_blocks[]
  • block-level bbox: [x0, y0, x1, y1]
  • nested lines[].spans[] with content, html, and span-level bbox

Each top-level para_blocks item becomes one validation result. For blocks with nested line/span content, the backend concatenates text-like span content. Table spans with html are converted to readable text by stripping tags and HTML entities. If a block has no readable text, it can still be returned as empty_or_garbled when useful, but empty decorative blocks should be skipped.

Coordinate mapping:

  • MinerU uses pixel-like page coordinates with origin at the top-left.
  • The frontend preview expects top-left coordinates named x0_pt, top_pt, x1_pt, and bottom_pt.
  • The backend returns MinerU coordinates directly as field coordinates and sets preview pageWidthPt/pageHeightPt from page_size, because the frontend scales both preview and overlay from the same coordinate system.

For multi-page output, page is page_idx + 1.

Word Comparison

The Word document remains the validation baseline. The backend uses the existing extract_word_text and validate_field_against_word behavior:

  • MinerU block text is normalized and compared against the Word full text.
  • The result status remains matched, unmatched, or empty_or_garbled.
  • The response keeps a fields array compatible with the current React UI.

This preserves the existing sidebar and highlighter behavior while changing the field source from old AI PDF text extraction to MinerU OCR/layout extraction.

Frontend Contract

The current ProcessResponse shape should remain mostly compatible:

  • preview.type: pdf
  • preview.url: generated PDF preview URL
  • preview.pageWidthPt: MinerU page width
  • preview.pageHeightPt: MinerU page height
  • fields[]: validation blocks with text, status, reason, matched excerpt, page, and coordinates

Small frontend changes may be needed to make optional typography metadata safe because MinerU blocks do not provide Illustrator font names or font sizes.

The right preview continues to render preview.pdf and draw overlay rectangles from fields[].

Error Handling

Return actionable API errors for:

  • Unsupported upload types.
  • .ai to PDF conversion failure.
  • Missing MINERU_API_KEY.
  • MinerU upload URL request failure.
  • MinerU upload PUT failure.
  • MinerU polling timeout.
  • MinerU task failure, including err_msg when present.
  • Missing structured JSON in the downloaded zip.

The API key must not be logged or included in response payloads.

Testing

Backend tests should cover:

  • MinerU JSON block extraction from a sample local JSON file.
  • HTML table text conversion.
  • Coordinate mapping from MinerU bbox into field coordinates.
  • Word comparison integration using mocked MinerU results.
  • MinerU client control flow with mocked HTTP responses.

Manual verification should run the backend and frontend locally with MINERU_API_KEY set, upload the sample .ai and .docx, and confirm that result cards appear and corresponding boxes highlight on the right preview.