109 lines
5.1 KiB
Markdown
109 lines
5.1 KiB
Markdown
# MinerU AI Word Parse Design
|
|
|
|
## Goal
|
|
|
|
Replace the current AI text extraction pipeline with a MinerU-backed flow:
|
|
|
|
1. Accept an Illustrator `.ai` file and a Word `.docx` file.
|
|
2. Convert or normalize the `.ai` file into a PDF preview artifact.
|
|
3. Upload the PDF artifact to MinerU for document parsing.
|
|
4. Read MinerU JSON output blocks and their bounding boxes.
|
|
5. Compare MinerU text output against the Word document text.
|
|
6. Return field results that the existing React preview can highlight on the right side.
|
|
|
|
## Non-Goals
|
|
|
|
- Do not keep the old `parse_ai_document(...).fields` text extraction as the source of validation fields.
|
|
- Do not expose the MinerU API key to the frontend.
|
|
- Do not require a public callback URL; use polling because this is a local backend.
|
|
- Do not add a new manual annotation workflow.
|
|
|
|
## Backend Flow
|
|
|
|
The `/api/process` endpoint keeps its current two-file upload contract: `ai_file` and `word_file`.
|
|
|
|
For each job, the backend creates the existing runtime upload/output directories. The `.ai` file is converted into a PDF preview artifact using the existing `backend.app.ai_parser.parse_ai_document` conversion behavior. The resulting `preview.pdf` is copied into the job output directory and returned as the preview URL.
|
|
|
|
The backend then submits that PDF preview artifact to MinerU using the documented local-file upload flow:
|
|
|
|
1. `POST https://mineru.net/api/v4/file-urls/batch` with one file entry and `model_version: "vlm"`.
|
|
2. `PUT` the generated upload URL with the PDF bytes.
|
|
3. Poll `GET https://mineru.net/api/v4/extract-results/batch/{batch_id}` until the single file reaches `done`, `failed`, or a timeout.
|
|
4. When `done`, download `full_zip_url` into the job output directory.
|
|
5. Extract the zip into the job output directory and locate the structured JSON output.
|
|
|
|
The API token is read from `MINERU_API_KEY`. If it is missing, the backend returns a clear configuration error instead of attempting the request.
|
|
|
|
## MinerU JSON Mapping
|
|
|
|
The primary parser reads MinerU `middle.json`-style output because the sample JSON contains:
|
|
|
|
- `pdf_info[]`
|
|
- `page_idx`
|
|
- `page_size`
|
|
- `para_blocks[]`
|
|
- `discarded_blocks[]`
|
|
- block-level `bbox: [x0, y0, x1, y1]`
|
|
- nested `lines[].spans[]` with `content`, `html`, and span-level `bbox`
|
|
|
|
Each top-level `para_blocks` item becomes one validation result. For blocks with nested line/span content, the backend concatenates text-like span content. Table spans with `html` are converted to readable text by stripping tags and HTML entities. If a block has no readable text, it can still be returned as `empty_or_garbled` when useful, but empty decorative blocks should be skipped.
|
|
|
|
Coordinate mapping:
|
|
|
|
- MinerU uses pixel-like page coordinates with origin at the top-left.
|
|
- The frontend preview expects top-left coordinates named `x0_pt`, `top_pt`, `x1_pt`, and `bottom_pt`.
|
|
- The backend returns MinerU coordinates directly as field coordinates and sets preview `pageWidthPt/pageHeightPt` from `page_size`, because the frontend scales both preview and overlay from the same coordinate system.
|
|
|
|
For multi-page output, `page` is `page_idx + 1`.
|
|
|
|
## Word Comparison
|
|
|
|
The Word document remains the validation baseline. The backend uses the existing `extract_word_text` and `validate_field_against_word` behavior:
|
|
|
|
- MinerU block text is normalized and compared against the Word full text.
|
|
- The result status remains `matched`, `unmatched`, or `empty_or_garbled`.
|
|
- The response keeps a `fields` array compatible with the current React UI.
|
|
|
|
This preserves the existing sidebar and highlighter behavior while changing the field source from old AI PDF text extraction to MinerU OCR/layout extraction.
|
|
|
|
## Frontend Contract
|
|
|
|
The current `ProcessResponse` shape should remain mostly compatible:
|
|
|
|
- `preview.type`: `pdf`
|
|
- `preview.url`: generated PDF preview URL
|
|
- `preview.pageWidthPt`: MinerU page width
|
|
- `preview.pageHeightPt`: MinerU page height
|
|
- `fields[]`: validation blocks with text, status, reason, matched excerpt, page, and coordinates
|
|
|
|
Small frontend changes may be needed to make optional typography metadata safe because MinerU blocks do not provide Illustrator font names or font sizes.
|
|
|
|
The right preview continues to render `preview.pdf` and draw overlay rectangles from `fields[]`.
|
|
|
|
## Error Handling
|
|
|
|
Return actionable API errors for:
|
|
|
|
- Unsupported upload types.
|
|
- `.ai` to PDF conversion failure.
|
|
- Missing `MINERU_API_KEY`.
|
|
- MinerU upload URL request failure.
|
|
- MinerU upload PUT failure.
|
|
- MinerU polling timeout.
|
|
- MinerU task failure, including `err_msg` when present.
|
|
- Missing structured JSON in the downloaded zip.
|
|
|
|
The API key must not be logged or included in response payloads.
|
|
|
|
## Testing
|
|
|
|
Backend tests should cover:
|
|
|
|
- MinerU JSON block extraction from a sample local JSON file.
|
|
- HTML table text conversion.
|
|
- Coordinate mapping from MinerU bbox into field coordinates.
|
|
- Word comparison integration using mocked MinerU results.
|
|
- MinerU client control flow with mocked HTTP responses.
|
|
|
|
Manual verification should run the backend and frontend locally with `MINERU_API_KEY` set, upload the sample `.ai` and `.docx`, and confirm that result cards appear and corresponding boxes highlight on the right preview.
|