Files

ZYYC12330 bbb4dd43b3 Initial commit: 包装审核 POC、Docker 与前后端

Made-with: Cursor

2026-04-15 17:18:49 +08:00

28 KiB

Raw Blame History

MinerU AI Word Parse Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Replace the old Illustrator text extraction source with MinerU JSON blocks, then compare those blocks against the uploaded Word document and highlight MinerU bounding boxes in the existing preview.

Architecture: Keep /api/process and the current React response shape. Add a focused MinerU JSON mapper and a small MinerU HTTP client, then update backend/app/pipeline.py so it converts .ai to preview.pdf, sends that PDF to MinerU, maps returned blocks to fields, and validates each field against Word text. Frontend changes are limited to making font metadata optional and copy text match the new MinerU-backed flow.

Tech Stack: Python 3, FastAPI, stdlib urllib, stdlib zipfile, python-docx, pypdf, React, TypeScript, Vite, pytest.

File Structure

Create backend/app/mineru_parser.py: Convert MinerU middle.json-style data into normalized field dictionaries with text and bbox coordinates.
Create backend/app/mineru_client.py: Submit a local PDF to MinerU, poll for completion, download and extract the result zip, and load structured JSON.
Modify backend/app/pipeline.py: Use AI-to-PDF preview conversion, MinerU parsing, and Word validation instead of old AI text fields.
Modify frontend/src/types.ts: Make font fields optional because MinerU output does not provide Illustrator font metadata.
Modify frontend/src/App.tsx: Keep UI behavior, adjust product copy/status copy to MinerU-backed OCR/layout results, and avoid unsafe numeric formatting on optional font metadata.
Create tests/backend/test_mineru_parser.py: Unit tests for sample JSON extraction, HTML table conversion, bbox mapping, and empty block handling.
Create tests/backend/test_mineru_client.py: Unit tests for MinerU HTTP client success/failure control flow with mocked urllib.
Modify tests/backend/test_pipeline.py: Mock MinerU calls and assert Word validation plus preview/highlight payload.
Modify tests/backend/test_api.py: Mock MinerU calls for endpoint tests and add missing-token failure coverage.

Task 1: MinerU JSON Mapper

Files:

Create: backend/app/mineru_parser.py
Test: tests/backend/test_mineru_parser.py
Step 1: Write failing parser tests

# tests/backend/test_mineru_parser.py
from __future__ import annotations

from backend.app.mineru_parser import parse_mineru_fields


def test_parse_mineru_fields_extracts_text_and_bbox() -> None:
    payload = {
        "pdf_info": [
            {
                "page_idx": 0,
                "page_size": [2772, 1961],
                "para_blocks": [
                    {
                        "bbox": [704, 134, 2106, 229],
                        "type": "title",
                        "lines": [
                            {
                                "spans": [
                                    {
                                        "type": "text",
                                        "content": "食品名称:天问礼品粽",
                                        "bbox": [704, 134, 2106, 229],
                                    }
                                ]
                            }
                        ],
                    }
                ],
            }
        ]
    }

    parsed = parse_mineru_fields(payload)

    assert parsed.page_width == 2772
    assert parsed.page_height == 1961
    assert parsed.fields == [
        {
            "page": 1,
            "text": "食品名称:天问礼品粽",
            "font_name": "",
            "font_size_pt": None,
            "font_height_mm": None,
            "x0_pt": 704.0,
            "top_pt": 134.0,
            "x1_pt": 2106.0,
            "bottom_pt": 229.0,
        }
    ]


def test_parse_mineru_fields_turns_table_html_into_text() -> None:
    payload = {
        "pdf_info": [
            {
                "page_idx": 0,
                "page_size": [1000, 800],
                "para_blocks": [
                    {
                        "bbox": [10, 20, 300, 200],
                        "type": "table",
                        "lines": [
                            {
                                "spans": [
                                    {
                                        "type": "table",
                                        "html": "<table><tr><td>品种</td><td>规格</td></tr><tr><td>黑猪肉粽</td><td>130克×1</td></tr></table>",
                                    }
                                ]
                            }
                        ],
                    }
                ],
            }
        ]
    }

    parsed = parse_mineru_fields(payload)

    assert parsed.fields[0]["text"] == "品种 规格 黑猪肉粽 130克×1"


def test_parse_mineru_fields_skips_empty_decorative_blocks() -> None:
    payload = {
        "pdf_info": [
            {
                "page_idx": 0,
                "page_size": [1000, 800],
                "para_blocks": [
                    {"bbox": [1, 2, 3, 4], "type": "image", "lines": [{"spans": [{"type": "image"}]}]},
                    {"bbox": [5, 6, 7, 8], "type": "text", "lines": [{"spans": [{"content": "  "}]}]},
                ],
            }
        ]
    }

    parsed = parse_mineru_fields(payload)

    assert parsed.fields == []

Step 2: Run parser tests and verify they fail

Run: pytest tests/backend/test_mineru_parser.py -v

Expected: FAIL with ModuleNotFoundError: No module named 'backend.app.mineru_parser'.

Step 3: Implement the MinerU parser

# backend/app/mineru_parser.py
from __future__ import annotations

import html
import re
from dataclasses import dataclass
from typing import Any


@dataclass(slots=True)
class ParsedMineruDocument:
    page_width: float
    page_height: float
    fields: list[dict]


TAG_RE = re.compile(r"<[^>]+>")
WHITESPACE_RE = re.compile(r"\s+")


def _clean_text(value: str) -> str:
    without_tags = TAG_RE.sub(" ", html.unescape(value))
    return WHITESPACE_RE.sub(" ", without_tags).strip()


def _span_text(span: dict[str, Any]) -> str:
    if isinstance(span.get("content"), str):
        return _clean_text(span["content"])
    if isinstance(span.get("html"), str):
        return _clean_text(span["html"])
    return ""


def _block_text(block: dict[str, Any]) -> str:
    pieces: list[str] = []
    for line in block.get("lines") or []:
        for span in line.get("spans") or []:
            text = _span_text(span)
            if text:
                pieces.append(text)
    if not pieces and isinstance(block.get("text"), str):
        pieces.append(_clean_text(block["text"]))
    return WHITESPACE_RE.sub(" ", " ".join(pieces)).strip()


def _bbox(block: dict[str, Any]) -> tuple[float, float, float, float] | None:
    raw_bbox = block.get("bbox")
    if not isinstance(raw_bbox, list) or len(raw_bbox) != 4:
        return None
    try:
        x0, y0, x1, y1 = [float(value) for value in raw_bbox]
    except (TypeError, ValueError):
        return None
    if x1 <= x0 or y1 <= y0:
        return None
    return x0, y0, x1, y1


def _page_size(page: dict[str, Any]) -> tuple[float, float]:
    raw_size = page.get("page_size")
    if isinstance(raw_size, list) and len(raw_size) >= 2:
        return float(raw_size[0]), float(raw_size[1])
    return 1.0, 1.0


def parse_mineru_fields(payload: dict[str, Any]) -> ParsedMineruDocument:
    pages = payload.get("pdf_info")
    if not isinstance(pages, list) or not pages:
        raise ValueError("MinerU JSON does not contain pdf_info pages")

    first_width, first_height = _page_size(pages[0])
    fields: list[dict] = []

    for page in pages:
        page_number = int(page.get("page_idx", 0)) + 1
        for block in page.get("para_blocks") or []:
            text = _block_text(block)
            box = _bbox(block)
            if not text or box is None:
                continue
            x0, y0, x1, y1 = box
            fields.append(
                {
                    "page": page_number,
                    "text": text,
                    "font_name": "",
                    "font_size_pt": None,
                    "font_height_mm": None,
                    "x0_pt": x0,
                    "top_pt": y0,
                    "x1_pt": x1,
                    "bottom_pt": y1,
                }
            )

    return ParsedMineruDocument(page_width=first_width, page_height=first_height, fields=fields)

Step 4: Run parser tests and verify they pass

Run: pytest tests/backend/test_mineru_parser.py -v

Expected: PASS.

Task 2: MinerU HTTP Client

Files:

Create: backend/app/mineru_client.py
Test: tests/backend/test_mineru_client.py
Step 1: Write failing MinerU client tests

# tests/backend/test_mineru_client.py
from __future__ import annotations

import io
import json
import zipfile
from pathlib import Path
from urllib.error import HTTPError

import pytest

from backend.app import mineru_client
from backend.app.mineru_client import MineruClient, MineruClientError


class FakeResponse:
    def __init__(self, status: int, body: bytes):
        self.status = status
        self._body = body

    def read(self) -> bytes:
        return self._body

    def __enter__(self) -> "FakeResponse":
        return self

    def __exit__(self, *_args: object) -> None:
        return None


def _zip_with_json() -> bytes:
    buffer = io.BytesIO()
    with zipfile.ZipFile(buffer, "w") as archive:
        archive.writestr("demo_middle.json", json.dumps({"pdf_info": [{"page_idx": 0, "page_size": [1, 1], "para_blocks": []}]}))
    return buffer.getvalue()


def test_submit_pdf_downloads_and_loads_structured_json(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
    calls: list[str] = []

    def fake_urlopen(request, timeout=0):
        url = request.full_url if hasattr(request, "full_url") else request
        calls.append(str(url))
        if str(url).endswith("/api/v4/file-urls/batch"):
            return FakeResponse(200, json.dumps({"code": 0, "data": {"batch_id": "batch-1", "file_urls": ["https://upload.example/file"]}}).encode())
        if str(url) == "https://upload.example/file":
            return FakeResponse(200, b"")
        if str(url).endswith("/api/v4/extract-results/batch/batch-1"):
            return FakeResponse(200, json.dumps({"code": 0, "data": {"extract_result": [{"state": "done", "full_zip_url": "https://download.example/result.zip"}]}}).encode())
        if str(url) == "https://download.example/result.zip":
            return FakeResponse(200, _zip_with_json())
        raise AssertionError(f"unexpected URL {url}")

    monkeypatch.setattr(mineru_client.request, "urlopen", fake_urlopen)
    pdf_path = tmp_path / "preview.pdf"
    pdf_path.write_bytes(b"%PDF-1.7")

    payload = MineruClient(api_key="secret", poll_interval_seconds=0, max_polls=1).parse_pdf(pdf_path, tmp_path)

    assert payload["pdf_info"][0]["page_size"] == [1, 1]
    assert calls == [
        "https://mineru.net/api/v4/file-urls/batch",
        "https://upload.example/file",
        "https://mineru.net/api/v4/extract-results/batch/batch-1",
        "https://download.example/result.zip",
    ]
    assert (tmp_path / "mineru_result.zip").exists()


def test_submit_pdf_raises_on_failed_task(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
    def fake_urlopen(request, timeout=0):
        url = request.full_url if hasattr(request, "full_url") else request
        if str(url).endswith("/api/v4/file-urls/batch"):
            return FakeResponse(200, json.dumps({"code": 0, "data": {"batch_id": "batch-1", "file_urls": ["https://upload.example/file"]}}).encode())
        if str(url) == "https://upload.example/file":
            return FakeResponse(200, b"")
        if str(url).endswith("/api/v4/extract-results/batch/batch-1"):
            return FakeResponse(200, json.dumps({"code": 0, "data": {"extract_result": [{"state": "failed", "err_msg": "bad pdf"}]}}).encode())
        raise AssertionError(f"unexpected URL {url}")

    monkeypatch.setattr(mineru_client.request, "urlopen", fake_urlopen)
    pdf_path = tmp_path / "preview.pdf"
    pdf_path.write_bytes(b"%PDF-1.7")

    with pytest.raises(MineruClientError, match="bad pdf"):
        MineruClient(api_key="secret", poll_interval_seconds=0, max_polls=1).parse_pdf(pdf_path, tmp_path)

Step 2: Run client tests and verify they fail

Run: pytest tests/backend/test_mineru_client.py -v

Expected: FAIL with ModuleNotFoundError: No module named 'backend.app.mineru_client'.

Step 3: Implement the MinerU client

# backend/app/mineru_client.py
from __future__ import annotations

import json
import time
import zipfile
from pathlib import Path
from urllib import request
from urllib.error import HTTPError, URLError


class MineruClientError(RuntimeError):
    pass


class MineruClient:
    def __init__(self, api_key: str, poll_interval_seconds: float = 2.0, max_polls: int = 90) -> None:
        self.api_key = api_key
        self.poll_interval_seconds = poll_interval_seconds
        self.max_polls = max_polls

    def parse_pdf(self, pdf_path: Path, output_dir: Path) -> dict:
        output_dir.mkdir(parents=True, exist_ok=True)
        batch_id, upload_url = self._request_upload_url(pdf_path.name)
        self._upload_file(upload_url, pdf_path)
        zip_url = self._poll_result(batch_id)
        zip_path = self._download_zip(zip_url, output_dir)
        extract_dir = output_dir / "mineru_result"
        self._extract_zip(zip_path, extract_dir)
        return self._load_structured_json(extract_dir)

    def _headers(self) -> dict[str, str]:
        return {"Authorization": f"Bearer {self.api_key}", "Accept": "*/*"}

    def _json_request(self, url: str, method: str = "GET", payload: dict | None = None) -> dict:
        body = None if payload is None else json.dumps(payload).encode("utf-8")
        headers = self._headers()
        if payload is not None:
            headers["Content-Type"] = "application/json"
        req = request.Request(url, data=body, headers=headers, method=method)
        try:
            with request.urlopen(req, timeout=30) as response:
                data = json.loads(response.read().decode("utf-8"))
        except (HTTPError, URLError, TimeoutError, json.JSONDecodeError) as exc:
            raise MineruClientError(f"MinerU request failed: {exc}") from exc
        if data.get("code") != 0:
            raise MineruClientError(str(data.get("msg") or "MinerU API returned an error"))
        return data

    def _request_upload_url(self, filename: str) -> tuple[str, str]:
        data = self._json_request(
            "https://mineru.net/api/v4/file-urls/batch",
            method="POST",
            payload={"files": [{"name": filename, "data_id": filename}], "model_version": "vlm"},
        )
        batch_id = data["data"]["batch_id"]
        file_urls = data["data"]["file_urls"]
        if not file_urls:
            raise MineruClientError("MinerU did not return an upload URL")
        return batch_id, file_urls[0]

    def _upload_file(self, upload_url: str, pdf_path: Path) -> None:
        req = request.Request(upload_url, data=pdf_path.read_bytes(), method="PUT")
        try:
            with request.urlopen(req, timeout=120) as response:
                if response.status >= 400:
                    raise MineruClientError(f"MinerU upload failed with HTTP {response.status}")
        except (HTTPError, URLError, TimeoutError) as exc:
            raise MineruClientError(f"MinerU upload failed: {exc}") from exc

    def _poll_result(self, batch_id: str) -> str:
        url = f"https://mineru.net/api/v4/extract-results/batch/{batch_id}"
        for _attempt in range(self.max_polls):
            data = self._json_request(url)
            extract_result = data.get("data", {}).get("extract_result")
            if isinstance(extract_result, list):
                result = extract_result[0] if extract_result else {}
            else:
                result = extract_result or {}
            state = result.get("state")
            if state == "done":
                zip_url = result.get("full_zip_url")
                if not zip_url:
                    raise MineruClientError("MinerU finished without full_zip_url")
                return zip_url
            if state == "failed":
                raise MineruClientError(str(result.get("err_msg") or "MinerU parsing failed"))
            time.sleep(self.poll_interval_seconds)
        raise MineruClientError("MinerU polling timed out")

    def _download_zip(self, zip_url: str, output_dir: Path) -> Path:
        target = output_dir / "mineru_result.zip"
        req = request.Request(zip_url, headers={"Accept": "*/*"}, method="GET")
        try:
            with request.urlopen(req, timeout=120) as response:
                target.write_bytes(response.read())
        except (HTTPError, URLError, TimeoutError) as exc:
            raise MineruClientError(f"MinerU zip download failed: {exc}") from exc
        return target

    def _extract_zip(self, zip_path: Path, extract_dir: Path) -> None:
        extract_dir.mkdir(parents=True, exist_ok=True)
        with zipfile.ZipFile(zip_path) as archive:
            archive.extractall(extract_dir)

    def _load_structured_json(self, extract_dir: Path) -> dict:
        candidates = sorted(extract_dir.rglob("*middle.json")) + sorted(extract_dir.rglob("*_model.json"))
        if not candidates:
            raise MineruClientError("MinerU result zip did not contain structured JSON")
        return json.loads(candidates[0].read_text(encoding="utf-8"))

Step 4: Run client tests and verify they pass

Run: pytest tests/backend/test_mineru_client.py -v

Expected: PASS.

Task 3: Pipeline Integration

Files:

Modify: backend/app/pipeline.py
Modify: tests/backend/test_pipeline.py
Modify: tests/backend/test_api.py
Step 1: Write failing pipeline tests with a mocked MinerU document

# tests/backend/test_pipeline.py
from pathlib import Path

import pytest

from backend.app import pipeline
from backend.app.pipeline import process_files


WORKDIR = Path("/Users/icemilk/Workspace/zld_POC")
AI_FILE = WORKDIR / "【2026-04-09】端午 - 背标 - 天问.ai"
DOCX_FILE = WORKDIR / "天问礼品粽【260331】.docx"
OUTPUT_DIR = WORKDIR / ".tmp_test_output"


def test_process_files_builds_preview_and_mineru_field_results(monkeypatch: pytest.MonkeyPatch) -> None:
    def fake_parse_with_mineru(_preview_path: Path, _output_dir: Path):
        return {
            "pdf_info": [
                {
                    "page_idx": 0,
                    "page_size": [2772, 1961],
                    "para_blocks": [
                        {
                            "bbox": [704, 134, 2106, 229],
                            "lines": [{"spans": [{"content": "食品名称:天问礼品粽"}]}],
                        },
                        {
                            "bbox": [10, 20, 40, 60],
                            "lines": [{"spans": [{"content": "Word中不存在的内容"}]}],
                        },
                    ],
                }
            ]
        }

    monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", fake_parse_with_mineru)

    result = process_files(AI_FILE, DOCX_FILE, OUTPUT_DIR, job_id="test-job")

    assert result["preview"]["type"] == "pdf"
    assert result["preview"]["url"] == "/api/files/test-job/preview.pdf"
    assert result["preview"]["pageWidthPt"] == 2772
    assert result["preview"]["pageHeightPt"] == 1961
    assert result["fields"][0]["text"] == "食品名称:天问礼品粽"
    assert result["fields"][0]["validation_status"] == "matched"
    assert result["fields"][0]["x0_pt"] == 704.0
    assert any(field["validation_status"] == "unmatched" for field in result["fields"])
    assert (OUTPUT_DIR / "preview.pdf").exists()

Step 2: Replace API tests with mocked MinerU coverage

# tests/backend/test_api.py
from pathlib import Path

import pytest
from fastapi.testclient import TestClient

from backend.app import pipeline
from backend.app.main import app


WORKDIR = Path("/Users/icemilk/Workspace/zld_POC")
AI_FILE = WORKDIR / "【2026-04-09】端午 - 背标 - 天问.ai"
DOCX_FILE = WORKDIR / "天问礼品粽【260331】.docx"

client = TestClient(app)


def fake_mineru_payload() -> dict:
    return {
        "pdf_info": [
            {
                "page_idx": 0,
                "page_size": [2772, 1961],
                "para_blocks": [
                    {
                        "bbox": [704, 134, 2106, 229],
                        "lines": [{"spans": [{"content": "食品名称:天问礼品粽"}]}],
                    }
                ],
            }
        ]
    }


def test_process_endpoint_returns_preview_and_fields(monkeypatch: pytest.MonkeyPatch) -> None:
    monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", lambda _preview_path, _output_dir: fake_mineru_payload())

    with AI_FILE.open("rb") as ai_fp, DOCX_FILE.open("rb") as docx_fp:
        response = client.post(
            "/api/process",
            files={
                "ai_file": (AI_FILE.name, ai_fp, "application/postscript"),
                "word_file": (
                    DOCX_FILE.name,
                    docx_fp,
                    "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
                ),
            },
        )

    assert response.status_code == 200

    payload = response.json()
    assert payload["preview"]["type"] == "pdf"
    assert payload["preview"]["pageWidthPt"] == 2772
    assert payload["fields"]
    assert payload["fields"][0]["text"] == "食品名称:天问礼品粽"


def test_process_endpoint_uses_default_sample_files_when_uploads_are_missing(monkeypatch: pytest.MonkeyPatch) -> None:
    monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", lambda _preview_path, _output_dir: fake_mineru_payload())

    response = client.post("/api/process")

    assert response.status_code == 200

    payload = response.json()
    assert payload["preview"]["type"] == "pdf"
    assert payload["fields"]
    assert any(field["text"] for field in payload["fields"])


def test_process_endpoint_surfaces_missing_mineru_key(monkeypatch: pytest.MonkeyPatch) -> None:
    def fake_parse_with_mineru(_preview_path, _output_dir):
        raise RuntimeError("MINERU_API_KEY is required")

    monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", fake_parse_with_mineru)

    response = client.post("/api/process")

    assert response.status_code == 500
    assert response.json()["detail"] == "MINERU_API_KEY is required"

Step 3: Run integration tests and verify they fail

Run: pytest tests/backend/test_pipeline.py tests/backend/test_api.py -v

Expected: FAIL because pipeline._parse_preview_with_mineru does not exist and process_files still uses ai_document.fields.

Step 4: Update backend/app/pipeline.py

# backend/app/pipeline.py
from __future__ import annotations

import os
import shutil
from pathlib import Path

from backend.app.ai_parser import parse_ai_document
from backend.app.mineru_client import MineruClient
from backend.app.mineru_parser import parse_mineru_fields
from backend.app.text_validation import validate_field_against_word
from backend.app.word_parser import extract_word_text


def _sort_key(field: dict) -> tuple[int, int, float, float]:
    status_rank = {"matched": 0, "unmatched": 1, "empty_or_garbled": 2}
    return (
        status_rank.get(field["validation_status"], 9),
        field["page"],
        field["top_pt"],
        field["x0_pt"],
    )


def _parse_preview_with_mineru(preview_path: Path, output_dir: Path) -> dict:
    api_key = os.environ.get("MINERU_API_KEY", "").strip()
    if not api_key:
        raise RuntimeError("MINERU_API_KEY is required")
    return MineruClient(api_key=api_key).parse_pdf(preview_path, output_dir / "mineru")


def process_files(ai_path: Path, word_path: Path, output_dir: Path, job_id: str | None = None) -> dict:
    output_dir.mkdir(parents=True, exist_ok=True)
    ai_document = parse_ai_document(ai_path, output_dir / "parsed")
    word_text = extract_word_text(word_path)

    preview_filename = "preview.pdf"
    preview_target = output_dir / preview_filename
    if ai_document.preview_path != preview_target:
        shutil.copy2(ai_document.preview_path, preview_target)

    mineru_payload = _parse_preview_with_mineru(preview_target, output_dir)
    mineru_document = parse_mineru_fields(mineru_payload)

    fields: list[dict] = []
    for index, field in enumerate(mineru_document.fields, start=1):
        validation = validate_field_against_word(field["text"], word_text)
        fields.append(
            {
                "id": f"field-{index}",
                **field,
                "normalized_text": validation.normalized_text,
                "validation_status": validation.status,
                "validation_reason": validation.reason,
                "matched_excerpt": validation.matched_excerpt,
            }
        )

    fields.sort(key=_sort_key)

    preview_url = f"/api/files/{job_id}/{preview_filename}" if job_id else preview_filename
    return {
        "preview": {
            "type": "pdf",
            "url": preview_url,
            "pageWidthPt": mineru_document.page_width,
            "pageHeightPt": mineru_document.page_height,
        },
        "fields": fields,
    }

Step 5: Run integration tests and verify they pass

Run: pytest tests/backend/test_pipeline.py tests/backend/test_api.py -v

Expected: PASS.

Task 4: Frontend Type and Copy Compatibility

Files:

Modify: frontend/src/types.ts
Modify: frontend/src/App.tsx
Step 1: Update TypeScript types

// frontend/src/types.ts
export type FieldResult = {
  id: string
  page: number
  text: string
  font_name?: string | null
  font_size_pt?: number | null
  font_height_mm?: number | null
  x0_pt: number
  top_pt: number
  x1_pt: number
  bottom_pt: number
  normalized_text: string
  validation_status: ValidationStatus
  validation_reason: string
  matched_excerpt: string | null
}

Step 2: Update App.tsx display guards and copy

// Replace the hero copy with:
<p className="hero-copy">
  上传 Illustrator 源文件与 Word 校对稿，系统会将设计文件转换为 PDF 后交给 MinerU 解析，
  再把识别出的版面文字与 Word 内容逐块比对。
</p>

// Replace the font metadata rendering with:
<div className="field-meta">
  <span>第 {field.page} 页</span>
  {field.font_name ? <span>{field.font_name}</span> : null}
  {typeof field.font_size_pt === 'number' ? <span>{field.font_size_pt} pt</span> : null}
  {typeof field.font_height_mm === 'number' ? <span>{field.font_height_mm.toFixed(1)} mm</span> : null}
</div>

Step 3: Run frontend type check

Run: cd frontend && npm run build

Expected: PASS.

Task 5: Full Verification

Files:

No new files.
Step 1: Run backend tests

Run: pytest tests/backend/test_mineru_parser.py tests/backend/test_mineru_client.py tests/backend/test_pipeline.py tests/backend/test_api.py -v

Expected: PASS.

Step 2: Run frontend build

Run: cd frontend && npm run build

Expected: PASS.

Step 3: Run local manual verification with the real MinerU API

Set MINERU_API_KEY in the shell environment, then run the backend:

./scripts/start_backend.sh

Run frontend in another terminal:

./scripts/start_frontend.sh

Open the frontend, upload the sample .ai and .docx, click 开始解析, and verify:

The request completes without leaking the token to browser requests.
The right preview shows the PDF.
The left result list contains MinerU-derived text blocks.
Clicking a result card highlights the corresponding MinerU bbox on the right preview.
Blocks found in the Word document show 校验成功; missing blocks show 校验失败.
Step 4: Skip commit in this workspace

This project directory is not a git repository, so do not run git commit. Report the changed file list in the final response instead.

28 KiB Raw Blame History Unescape Escape

MinerU AI Word Parse Implementation Plan

File Structure

Task 1: MinerU JSON Mapper

Task 2: MinerU HTTP Client

Task 3: Pipeline Integration

Task 4: Frontend Type and Copy Compatibility

Task 5: Full Verification

28 KiB

Raw Blame History