# MinerU AI Word Parse Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Replace the old Illustrator text extraction source with MinerU JSON blocks, then compare those blocks against the uploaded Word document and highlight MinerU bounding boxes in the existing preview. **Architecture:** Keep `/api/process` and the current React response shape. Add a focused MinerU JSON mapper and a small MinerU HTTP client, then update `backend/app/pipeline.py` so it converts `.ai` to `preview.pdf`, sends that PDF to MinerU, maps returned blocks to fields, and validates each field against Word text. Frontend changes are limited to making font metadata optional and copy text match the new MinerU-backed flow. **Tech Stack:** Python 3, FastAPI, stdlib `urllib`, stdlib `zipfile`, `python-docx`, `pypdf`, React, TypeScript, Vite, pytest. --- ## File Structure - Create `backend/app/mineru_parser.py`: Convert MinerU `middle.json`-style data into normalized field dictionaries with text and bbox coordinates. - Create `backend/app/mineru_client.py`: Submit a local PDF to MinerU, poll for completion, download and extract the result zip, and load structured JSON. - Modify `backend/app/pipeline.py`: Use AI-to-PDF preview conversion, MinerU parsing, and Word validation instead of old AI text fields. - Modify `frontend/src/types.ts`: Make font fields optional because MinerU output does not provide Illustrator font metadata. - Modify `frontend/src/App.tsx`: Keep UI behavior, adjust product copy/status copy to MinerU-backed OCR/layout results, and avoid unsafe numeric formatting on optional font metadata. - Create `tests/backend/test_mineru_parser.py`: Unit tests for sample JSON extraction, HTML table conversion, bbox mapping, and empty block handling. - Create `tests/backend/test_mineru_client.py`: Unit tests for MinerU HTTP client success/failure control flow with mocked `urllib`. - Modify `tests/backend/test_pipeline.py`: Mock MinerU calls and assert Word validation plus preview/highlight payload. - Modify `tests/backend/test_api.py`: Mock MinerU calls for endpoint tests and add missing-token failure coverage. ## Task 1: MinerU JSON Mapper **Files:** - Create: `backend/app/mineru_parser.py` - Test: `tests/backend/test_mineru_parser.py` - [ ] **Step 1: Write failing parser tests** ```python # tests/backend/test_mineru_parser.py from __future__ import annotations from backend.app.mineru_parser import parse_mineru_fields def test_parse_mineru_fields_extracts_text_and_bbox() -> None: payload = { "pdf_info": [ { "page_idx": 0, "page_size": [2772, 1961], "para_blocks": [ { "bbox": [704, 134, 2106, 229], "type": "title", "lines": [ { "spans": [ { "type": "text", "content": "食品名称:天问礼品粽", "bbox": [704, 134, 2106, 229], } ] } ], } ], } ] } parsed = parse_mineru_fields(payload) assert parsed.page_width == 2772 assert parsed.page_height == 1961 assert parsed.fields == [ { "page": 1, "text": "食品名称:天问礼品粽", "font_name": "", "font_size_pt": None, "font_height_mm": None, "x0_pt": 704.0, "top_pt": 134.0, "x1_pt": 2106.0, "bottom_pt": 229.0, } ] def test_parse_mineru_fields_turns_table_html_into_text() -> None: payload = { "pdf_info": [ { "page_idx": 0, "page_size": [1000, 800], "para_blocks": [ { "bbox": [10, 20, 300, 200], "type": "table", "lines": [ { "spans": [ { "type": "table", "html": "
品种规格
黑猪肉粽130克×1
", } ] } ], } ], } ] } parsed = parse_mineru_fields(payload) assert parsed.fields[0]["text"] == "品种 规格 黑猪肉粽 130克×1" def test_parse_mineru_fields_skips_empty_decorative_blocks() -> None: payload = { "pdf_info": [ { "page_idx": 0, "page_size": [1000, 800], "para_blocks": [ {"bbox": [1, 2, 3, 4], "type": "image", "lines": [{"spans": [{"type": "image"}]}]}, {"bbox": [5, 6, 7, 8], "type": "text", "lines": [{"spans": [{"content": " "}]}]}, ], } ] } parsed = parse_mineru_fields(payload) assert parsed.fields == [] ``` - [ ] **Step 2: Run parser tests and verify they fail** Run: `pytest tests/backend/test_mineru_parser.py -v` Expected: FAIL with `ModuleNotFoundError: No module named 'backend.app.mineru_parser'`. - [ ] **Step 3: Implement the MinerU parser** ```python # backend/app/mineru_parser.py from __future__ import annotations import html import re from dataclasses import dataclass from typing import Any @dataclass(slots=True) class ParsedMineruDocument: page_width: float page_height: float fields: list[dict] TAG_RE = re.compile(r"<[^>]+>") WHITESPACE_RE = re.compile(r"\s+") def _clean_text(value: str) -> str: without_tags = TAG_RE.sub(" ", html.unescape(value)) return WHITESPACE_RE.sub(" ", without_tags).strip() def _span_text(span: dict[str, Any]) -> str: if isinstance(span.get("content"), str): return _clean_text(span["content"]) if isinstance(span.get("html"), str): return _clean_text(span["html"]) return "" def _block_text(block: dict[str, Any]) -> str: pieces: list[str] = [] for line in block.get("lines") or []: for span in line.get("spans") or []: text = _span_text(span) if text: pieces.append(text) if not pieces and isinstance(block.get("text"), str): pieces.append(_clean_text(block["text"])) return WHITESPACE_RE.sub(" ", " ".join(pieces)).strip() def _bbox(block: dict[str, Any]) -> tuple[float, float, float, float] | None: raw_bbox = block.get("bbox") if not isinstance(raw_bbox, list) or len(raw_bbox) != 4: return None try: x0, y0, x1, y1 = [float(value) for value in raw_bbox] except (TypeError, ValueError): return None if x1 <= x0 or y1 <= y0: return None return x0, y0, x1, y1 def _page_size(page: dict[str, Any]) -> tuple[float, float]: raw_size = page.get("page_size") if isinstance(raw_size, list) and len(raw_size) >= 2: return float(raw_size[0]), float(raw_size[1]) return 1.0, 1.0 def parse_mineru_fields(payload: dict[str, Any]) -> ParsedMineruDocument: pages = payload.get("pdf_info") if not isinstance(pages, list) or not pages: raise ValueError("MinerU JSON does not contain pdf_info pages") first_width, first_height = _page_size(pages[0]) fields: list[dict] = [] for page in pages: page_number = int(page.get("page_idx", 0)) + 1 for block in page.get("para_blocks") or []: text = _block_text(block) box = _bbox(block) if not text or box is None: continue x0, y0, x1, y1 = box fields.append( { "page": page_number, "text": text, "font_name": "", "font_size_pt": None, "font_height_mm": None, "x0_pt": x0, "top_pt": y0, "x1_pt": x1, "bottom_pt": y1, } ) return ParsedMineruDocument(page_width=first_width, page_height=first_height, fields=fields) ``` - [ ] **Step 4: Run parser tests and verify they pass** Run: `pytest tests/backend/test_mineru_parser.py -v` Expected: PASS. ## Task 2: MinerU HTTP Client **Files:** - Create: `backend/app/mineru_client.py` - Test: `tests/backend/test_mineru_client.py` - [ ] **Step 1: Write failing MinerU client tests** ```python # tests/backend/test_mineru_client.py from __future__ import annotations import io import json import zipfile from pathlib import Path from urllib.error import HTTPError import pytest from backend.app import mineru_client from backend.app.mineru_client import MineruClient, MineruClientError class FakeResponse: def __init__(self, status: int, body: bytes): self.status = status self._body = body def read(self) -> bytes: return self._body def __enter__(self) -> "FakeResponse": return self def __exit__(self, *_args: object) -> None: return None def _zip_with_json() -> bytes: buffer = io.BytesIO() with zipfile.ZipFile(buffer, "w") as archive: archive.writestr("demo_middle.json", json.dumps({"pdf_info": [{"page_idx": 0, "page_size": [1, 1], "para_blocks": []}]})) return buffer.getvalue() def test_submit_pdf_downloads_and_loads_structured_json(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None: calls: list[str] = [] def fake_urlopen(request, timeout=0): url = request.full_url if hasattr(request, "full_url") else request calls.append(str(url)) if str(url).endswith("/api/v4/file-urls/batch"): return FakeResponse(200, json.dumps({"code": 0, "data": {"batch_id": "batch-1", "file_urls": ["https://upload.example/file"]}}).encode()) if str(url) == "https://upload.example/file": return FakeResponse(200, b"") if str(url).endswith("/api/v4/extract-results/batch/batch-1"): return FakeResponse(200, json.dumps({"code": 0, "data": {"extract_result": [{"state": "done", "full_zip_url": "https://download.example/result.zip"}]}}).encode()) if str(url) == "https://download.example/result.zip": return FakeResponse(200, _zip_with_json()) raise AssertionError(f"unexpected URL {url}") monkeypatch.setattr(mineru_client.request, "urlopen", fake_urlopen) pdf_path = tmp_path / "preview.pdf" pdf_path.write_bytes(b"%PDF-1.7") payload = MineruClient(api_key="secret", poll_interval_seconds=0, max_polls=1).parse_pdf(pdf_path, tmp_path) assert payload["pdf_info"][0]["page_size"] == [1, 1] assert calls == [ "https://mineru.net/api/v4/file-urls/batch", "https://upload.example/file", "https://mineru.net/api/v4/extract-results/batch/batch-1", "https://download.example/result.zip", ] assert (tmp_path / "mineru_result.zip").exists() def test_submit_pdf_raises_on_failed_task(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None: def fake_urlopen(request, timeout=0): url = request.full_url if hasattr(request, "full_url") else request if str(url).endswith("/api/v4/file-urls/batch"): return FakeResponse(200, json.dumps({"code": 0, "data": {"batch_id": "batch-1", "file_urls": ["https://upload.example/file"]}}).encode()) if str(url) == "https://upload.example/file": return FakeResponse(200, b"") if str(url).endswith("/api/v4/extract-results/batch/batch-1"): return FakeResponse(200, json.dumps({"code": 0, "data": {"extract_result": [{"state": "failed", "err_msg": "bad pdf"}]}}).encode()) raise AssertionError(f"unexpected URL {url}") monkeypatch.setattr(mineru_client.request, "urlopen", fake_urlopen) pdf_path = tmp_path / "preview.pdf" pdf_path.write_bytes(b"%PDF-1.7") with pytest.raises(MineruClientError, match="bad pdf"): MineruClient(api_key="secret", poll_interval_seconds=0, max_polls=1).parse_pdf(pdf_path, tmp_path) ``` - [ ] **Step 2: Run client tests and verify they fail** Run: `pytest tests/backend/test_mineru_client.py -v` Expected: FAIL with `ModuleNotFoundError: No module named 'backend.app.mineru_client'`. - [ ] **Step 3: Implement the MinerU client** ```python # backend/app/mineru_client.py from __future__ import annotations import json import time import zipfile from pathlib import Path from urllib import request from urllib.error import HTTPError, URLError class MineruClientError(RuntimeError): pass class MineruClient: def __init__(self, api_key: str, poll_interval_seconds: float = 2.0, max_polls: int = 90) -> None: self.api_key = api_key self.poll_interval_seconds = poll_interval_seconds self.max_polls = max_polls def parse_pdf(self, pdf_path: Path, output_dir: Path) -> dict: output_dir.mkdir(parents=True, exist_ok=True) batch_id, upload_url = self._request_upload_url(pdf_path.name) self._upload_file(upload_url, pdf_path) zip_url = self._poll_result(batch_id) zip_path = self._download_zip(zip_url, output_dir) extract_dir = output_dir / "mineru_result" self._extract_zip(zip_path, extract_dir) return self._load_structured_json(extract_dir) def _headers(self) -> dict[str, str]: return {"Authorization": f"Bearer {self.api_key}", "Accept": "*/*"} def _json_request(self, url: str, method: str = "GET", payload: dict | None = None) -> dict: body = None if payload is None else json.dumps(payload).encode("utf-8") headers = self._headers() if payload is not None: headers["Content-Type"] = "application/json" req = request.Request(url, data=body, headers=headers, method=method) try: with request.urlopen(req, timeout=30) as response: data = json.loads(response.read().decode("utf-8")) except (HTTPError, URLError, TimeoutError, json.JSONDecodeError) as exc: raise MineruClientError(f"MinerU request failed: {exc}") from exc if data.get("code") != 0: raise MineruClientError(str(data.get("msg") or "MinerU API returned an error")) return data def _request_upload_url(self, filename: str) -> tuple[str, str]: data = self._json_request( "https://mineru.net/api/v4/file-urls/batch", method="POST", payload={"files": [{"name": filename, "data_id": filename}], "model_version": "vlm"}, ) batch_id = data["data"]["batch_id"] file_urls = data["data"]["file_urls"] if not file_urls: raise MineruClientError("MinerU did not return an upload URL") return batch_id, file_urls[0] def _upload_file(self, upload_url: str, pdf_path: Path) -> None: req = request.Request(upload_url, data=pdf_path.read_bytes(), method="PUT") try: with request.urlopen(req, timeout=120) as response: if response.status >= 400: raise MineruClientError(f"MinerU upload failed with HTTP {response.status}") except (HTTPError, URLError, TimeoutError) as exc: raise MineruClientError(f"MinerU upload failed: {exc}") from exc def _poll_result(self, batch_id: str) -> str: url = f"https://mineru.net/api/v4/extract-results/batch/{batch_id}" for _attempt in range(self.max_polls): data = self._json_request(url) extract_result = data.get("data", {}).get("extract_result") if isinstance(extract_result, list): result = extract_result[0] if extract_result else {} else: result = extract_result or {} state = result.get("state") if state == "done": zip_url = result.get("full_zip_url") if not zip_url: raise MineruClientError("MinerU finished without full_zip_url") return zip_url if state == "failed": raise MineruClientError(str(result.get("err_msg") or "MinerU parsing failed")) time.sleep(self.poll_interval_seconds) raise MineruClientError("MinerU polling timed out") def _download_zip(self, zip_url: str, output_dir: Path) -> Path: target = output_dir / "mineru_result.zip" req = request.Request(zip_url, headers={"Accept": "*/*"}, method="GET") try: with request.urlopen(req, timeout=120) as response: target.write_bytes(response.read()) except (HTTPError, URLError, TimeoutError) as exc: raise MineruClientError(f"MinerU zip download failed: {exc}") from exc return target def _extract_zip(self, zip_path: Path, extract_dir: Path) -> None: extract_dir.mkdir(parents=True, exist_ok=True) with zipfile.ZipFile(zip_path) as archive: archive.extractall(extract_dir) def _load_structured_json(self, extract_dir: Path) -> dict: candidates = sorted(extract_dir.rglob("*middle.json")) + sorted(extract_dir.rglob("*_model.json")) if not candidates: raise MineruClientError("MinerU result zip did not contain structured JSON") return json.loads(candidates[0].read_text(encoding="utf-8")) ``` - [ ] **Step 4: Run client tests and verify they pass** Run: `pytest tests/backend/test_mineru_client.py -v` Expected: PASS. ## Task 3: Pipeline Integration **Files:** - Modify: `backend/app/pipeline.py` - Modify: `tests/backend/test_pipeline.py` - Modify: `tests/backend/test_api.py` - [ ] **Step 1: Write failing pipeline tests with a mocked MinerU document** ```python # tests/backend/test_pipeline.py from pathlib import Path import pytest from backend.app import pipeline from backend.app.pipeline import process_files WORKDIR = Path("/Users/icemilk/Workspace/zld_POC") AI_FILE = WORKDIR / "【2026-04-09】端午 - 背标 - 天问.ai" DOCX_FILE = WORKDIR / "天问礼品粽【260331】.docx" OUTPUT_DIR = WORKDIR / ".tmp_test_output" def test_process_files_builds_preview_and_mineru_field_results(monkeypatch: pytest.MonkeyPatch) -> None: def fake_parse_with_mineru(_preview_path: Path, _output_dir: Path): return { "pdf_info": [ { "page_idx": 0, "page_size": [2772, 1961], "para_blocks": [ { "bbox": [704, 134, 2106, 229], "lines": [{"spans": [{"content": "食品名称:天问礼品粽"}]}], }, { "bbox": [10, 20, 40, 60], "lines": [{"spans": [{"content": "Word中不存在的内容"}]}], }, ], } ] } monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", fake_parse_with_mineru) result = process_files(AI_FILE, DOCX_FILE, OUTPUT_DIR, job_id="test-job") assert result["preview"]["type"] == "pdf" assert result["preview"]["url"] == "/api/files/test-job/preview.pdf" assert result["preview"]["pageWidthPt"] == 2772 assert result["preview"]["pageHeightPt"] == 1961 assert result["fields"][0]["text"] == "食品名称:天问礼品粽" assert result["fields"][0]["validation_status"] == "matched" assert result["fields"][0]["x0_pt"] == 704.0 assert any(field["validation_status"] == "unmatched" for field in result["fields"]) assert (OUTPUT_DIR / "preview.pdf").exists() ``` - [ ] **Step 2: Replace API tests with mocked MinerU coverage** ```python # tests/backend/test_api.py from pathlib import Path import pytest from fastapi.testclient import TestClient from backend.app import pipeline from backend.app.main import app WORKDIR = Path("/Users/icemilk/Workspace/zld_POC") AI_FILE = WORKDIR / "【2026-04-09】端午 - 背标 - 天问.ai" DOCX_FILE = WORKDIR / "天问礼品粽【260331】.docx" client = TestClient(app) def fake_mineru_payload() -> dict: return { "pdf_info": [ { "page_idx": 0, "page_size": [2772, 1961], "para_blocks": [ { "bbox": [704, 134, 2106, 229], "lines": [{"spans": [{"content": "食品名称:天问礼品粽"}]}], } ], } ] } def test_process_endpoint_returns_preview_and_fields(monkeypatch: pytest.MonkeyPatch) -> None: monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", lambda _preview_path, _output_dir: fake_mineru_payload()) with AI_FILE.open("rb") as ai_fp, DOCX_FILE.open("rb") as docx_fp: response = client.post( "/api/process", files={ "ai_file": (AI_FILE.name, ai_fp, "application/postscript"), "word_file": ( DOCX_FILE.name, docx_fp, "application/vnd.openxmlformats-officedocument.wordprocessingml.document", ), }, ) assert response.status_code == 200 payload = response.json() assert payload["preview"]["type"] == "pdf" assert payload["preview"]["pageWidthPt"] == 2772 assert payload["fields"] assert payload["fields"][0]["text"] == "食品名称:天问礼品粽" def test_process_endpoint_uses_default_sample_files_when_uploads_are_missing(monkeypatch: pytest.MonkeyPatch) -> None: monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", lambda _preview_path, _output_dir: fake_mineru_payload()) response = client.post("/api/process") assert response.status_code == 200 payload = response.json() assert payload["preview"]["type"] == "pdf" assert payload["fields"] assert any(field["text"] for field in payload["fields"]) def test_process_endpoint_surfaces_missing_mineru_key(monkeypatch: pytest.MonkeyPatch) -> None: def fake_parse_with_mineru(_preview_path, _output_dir): raise RuntimeError("MINERU_API_KEY is required") monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", fake_parse_with_mineru) response = client.post("/api/process") assert response.status_code == 500 assert response.json()["detail"] == "MINERU_API_KEY is required" ``` - [ ] **Step 3: Run integration tests and verify they fail** Run: `pytest tests/backend/test_pipeline.py tests/backend/test_api.py -v` Expected: FAIL because `pipeline._parse_preview_with_mineru` does not exist and `process_files` still uses `ai_document.fields`. - [ ] **Step 4: Update `backend/app/pipeline.py`** ```python # backend/app/pipeline.py from __future__ import annotations import os import shutil from pathlib import Path from backend.app.ai_parser import parse_ai_document from backend.app.mineru_client import MineruClient from backend.app.mineru_parser import parse_mineru_fields from backend.app.text_validation import validate_field_against_word from backend.app.word_parser import extract_word_text def _sort_key(field: dict) -> tuple[int, int, float, float]: status_rank = {"matched": 0, "unmatched": 1, "empty_or_garbled": 2} return ( status_rank.get(field["validation_status"], 9), field["page"], field["top_pt"], field["x0_pt"], ) def _parse_preview_with_mineru(preview_path: Path, output_dir: Path) -> dict: api_key = os.environ.get("MINERU_API_KEY", "").strip() if not api_key: raise RuntimeError("MINERU_API_KEY is required") return MineruClient(api_key=api_key).parse_pdf(preview_path, output_dir / "mineru") def process_files(ai_path: Path, word_path: Path, output_dir: Path, job_id: str | None = None) -> dict: output_dir.mkdir(parents=True, exist_ok=True) ai_document = parse_ai_document(ai_path, output_dir / "parsed") word_text = extract_word_text(word_path) preview_filename = "preview.pdf" preview_target = output_dir / preview_filename if ai_document.preview_path != preview_target: shutil.copy2(ai_document.preview_path, preview_target) mineru_payload = _parse_preview_with_mineru(preview_target, output_dir) mineru_document = parse_mineru_fields(mineru_payload) fields: list[dict] = [] for index, field in enumerate(mineru_document.fields, start=1): validation = validate_field_against_word(field["text"], word_text) fields.append( { "id": f"field-{index}", **field, "normalized_text": validation.normalized_text, "validation_status": validation.status, "validation_reason": validation.reason, "matched_excerpt": validation.matched_excerpt, } ) fields.sort(key=_sort_key) preview_url = f"/api/files/{job_id}/{preview_filename}" if job_id else preview_filename return { "preview": { "type": "pdf", "url": preview_url, "pageWidthPt": mineru_document.page_width, "pageHeightPt": mineru_document.page_height, }, "fields": fields, } ``` - [ ] **Step 5: Run integration tests and verify they pass** Run: `pytest tests/backend/test_pipeline.py tests/backend/test_api.py -v` Expected: PASS. ## Task 4: Frontend Type and Copy Compatibility **Files:** - Modify: `frontend/src/types.ts` - Modify: `frontend/src/App.tsx` - [ ] **Step 1: Update TypeScript types** ```ts // frontend/src/types.ts export type FieldResult = { id: string page: number text: string font_name?: string | null font_size_pt?: number | null font_height_mm?: number | null x0_pt: number top_pt: number x1_pt: number bottom_pt: number normalized_text: string validation_status: ValidationStatus validation_reason: string matched_excerpt: string | null } ``` - [ ] **Step 2: Update `App.tsx` display guards and copy** ```tsx // Replace the hero copy with:

上传 Illustrator 源文件与 Word 校对稿,系统会将设计文件转换为 PDF 后交给 MinerU 解析, 再把识别出的版面文字与 Word 内容逐块比对。

// Replace the font metadata rendering with:
第 {field.page} 页 {field.font_name ? {field.font_name} : null} {typeof field.font_size_pt === 'number' ? {field.font_size_pt} pt : null} {typeof field.font_height_mm === 'number' ? {field.font_height_mm.toFixed(1)} mm : null}
``` - [ ] **Step 3: Run frontend type check** Run: `cd frontend && npm run build` Expected: PASS. ## Task 5: Full Verification **Files:** - No new files. - [ ] **Step 1: Run backend tests** Run: `pytest tests/backend/test_mineru_parser.py tests/backend/test_mineru_client.py tests/backend/test_pipeline.py tests/backend/test_api.py -v` Expected: PASS. - [ ] **Step 2: Run frontend build** Run: `cd frontend && npm run build` Expected: PASS. - [ ] **Step 3: Run local manual verification with the real MinerU API** Set `MINERU_API_KEY` in the shell environment, then run the backend: ```bash ./scripts/start_backend.sh ``` Run frontend in another terminal: ```bash ./scripts/start_frontend.sh ``` Open the frontend, upload the sample `.ai` and `.docx`, click `开始解析`, and verify: - The request completes without leaking the token to browser requests. - The right preview shows the PDF. - The left result list contains MinerU-derived text blocks. - Clicking a result card highlights the corresponding MinerU bbox on the right preview. - Blocks found in the Word document show `校验成功`; missing blocks show `校验失败`. - [ ] **Step 4: Skip commit in this workspace** This project directory is not a git repository, so do not run `git commit`. Report the changed file list in the final response instead.