Files
ZLD_POC/docs/superpowers/plans/2026-04-14-mineru-ai-word-parse.md
2026-04-15 17:18:49 +08:00

802 lines
28 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# MinerU AI Word Parse Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Replace the old Illustrator text extraction source with MinerU JSON blocks, then compare those blocks against the uploaded Word document and highlight MinerU bounding boxes in the existing preview.
**Architecture:** Keep `/api/process` and the current React response shape. Add a focused MinerU JSON mapper and a small MinerU HTTP client, then update `backend/app/pipeline.py` so it converts `.ai` to `preview.pdf`, sends that PDF to MinerU, maps returned blocks to fields, and validates each field against Word text. Frontend changes are limited to making font metadata optional and copy text match the new MinerU-backed flow.
**Tech Stack:** Python 3, FastAPI, stdlib `urllib`, stdlib `zipfile`, `python-docx`, `pypdf`, React, TypeScript, Vite, pytest.
---
## File Structure
- Create `backend/app/mineru_parser.py`: Convert MinerU `middle.json`-style data into normalized field dictionaries with text and bbox coordinates.
- Create `backend/app/mineru_client.py`: Submit a local PDF to MinerU, poll for completion, download and extract the result zip, and load structured JSON.
- Modify `backend/app/pipeline.py`: Use AI-to-PDF preview conversion, MinerU parsing, and Word validation instead of old AI text fields.
- Modify `frontend/src/types.ts`: Make font fields optional because MinerU output does not provide Illustrator font metadata.
- Modify `frontend/src/App.tsx`: Keep UI behavior, adjust product copy/status copy to MinerU-backed OCR/layout results, and avoid unsafe numeric formatting on optional font metadata.
- Create `tests/backend/test_mineru_parser.py`: Unit tests for sample JSON extraction, HTML table conversion, bbox mapping, and empty block handling.
- Create `tests/backend/test_mineru_client.py`: Unit tests for MinerU HTTP client success/failure control flow with mocked `urllib`.
- Modify `tests/backend/test_pipeline.py`: Mock MinerU calls and assert Word validation plus preview/highlight payload.
- Modify `tests/backend/test_api.py`: Mock MinerU calls for endpoint tests and add missing-token failure coverage.
## Task 1: MinerU JSON Mapper
**Files:**
- Create: `backend/app/mineru_parser.py`
- Test: `tests/backend/test_mineru_parser.py`
- [ ] **Step 1: Write failing parser tests**
```python
# tests/backend/test_mineru_parser.py
from __future__ import annotations
from backend.app.mineru_parser import parse_mineru_fields
def test_parse_mineru_fields_extracts_text_and_bbox() -> None:
payload = {
"pdf_info": [
{
"page_idx": 0,
"page_size": [2772, 1961],
"para_blocks": [
{
"bbox": [704, 134, 2106, 229],
"type": "title",
"lines": [
{
"spans": [
{
"type": "text",
"content": "食品名称:天问礼品粽",
"bbox": [704, 134, 2106, 229],
}
]
}
],
}
],
}
]
}
parsed = parse_mineru_fields(payload)
assert parsed.page_width == 2772
assert parsed.page_height == 1961
assert parsed.fields == [
{
"page": 1,
"text": "食品名称:天问礼品粽",
"font_name": "",
"font_size_pt": None,
"font_height_mm": None,
"x0_pt": 704.0,
"top_pt": 134.0,
"x1_pt": 2106.0,
"bottom_pt": 229.0,
}
]
def test_parse_mineru_fields_turns_table_html_into_text() -> None:
payload = {
"pdf_info": [
{
"page_idx": 0,
"page_size": [1000, 800],
"para_blocks": [
{
"bbox": [10, 20, 300, 200],
"type": "table",
"lines": [
{
"spans": [
{
"type": "table",
"html": "<table><tr><td>品种</td><td>规格</td></tr><tr><td>黑猪肉粽</td><td>130克×1</td></tr></table>",
}
]
}
],
}
],
}
]
}
parsed = parse_mineru_fields(payload)
assert parsed.fields[0]["text"] == "品种 规格 黑猪肉粽 130克×1"
def test_parse_mineru_fields_skips_empty_decorative_blocks() -> None:
payload = {
"pdf_info": [
{
"page_idx": 0,
"page_size": [1000, 800],
"para_blocks": [
{"bbox": [1, 2, 3, 4], "type": "image", "lines": [{"spans": [{"type": "image"}]}]},
{"bbox": [5, 6, 7, 8], "type": "text", "lines": [{"spans": [{"content": " "}]}]},
],
}
]
}
parsed = parse_mineru_fields(payload)
assert parsed.fields == []
```
- [ ] **Step 2: Run parser tests and verify they fail**
Run: `pytest tests/backend/test_mineru_parser.py -v`
Expected: FAIL with `ModuleNotFoundError: No module named 'backend.app.mineru_parser'`.
- [ ] **Step 3: Implement the MinerU parser**
```python
# backend/app/mineru_parser.py
from __future__ import annotations
import html
import re
from dataclasses import dataclass
from typing import Any
@dataclass(slots=True)
class ParsedMineruDocument:
page_width: float
page_height: float
fields: list[dict]
TAG_RE = re.compile(r"<[^>]+>")
WHITESPACE_RE = re.compile(r"\s+")
def _clean_text(value: str) -> str:
without_tags = TAG_RE.sub(" ", html.unescape(value))
return WHITESPACE_RE.sub(" ", without_tags).strip()
def _span_text(span: dict[str, Any]) -> str:
if isinstance(span.get("content"), str):
return _clean_text(span["content"])
if isinstance(span.get("html"), str):
return _clean_text(span["html"])
return ""
def _block_text(block: dict[str, Any]) -> str:
pieces: list[str] = []
for line in block.get("lines") or []:
for span in line.get("spans") or []:
text = _span_text(span)
if text:
pieces.append(text)
if not pieces and isinstance(block.get("text"), str):
pieces.append(_clean_text(block["text"]))
return WHITESPACE_RE.sub(" ", " ".join(pieces)).strip()
def _bbox(block: dict[str, Any]) -> tuple[float, float, float, float] | None:
raw_bbox = block.get("bbox")
if not isinstance(raw_bbox, list) or len(raw_bbox) != 4:
return None
try:
x0, y0, x1, y1 = [float(value) for value in raw_bbox]
except (TypeError, ValueError):
return None
if x1 <= x0 or y1 <= y0:
return None
return x0, y0, x1, y1
def _page_size(page: dict[str, Any]) -> tuple[float, float]:
raw_size = page.get("page_size")
if isinstance(raw_size, list) and len(raw_size) >= 2:
return float(raw_size[0]), float(raw_size[1])
return 1.0, 1.0
def parse_mineru_fields(payload: dict[str, Any]) -> ParsedMineruDocument:
pages = payload.get("pdf_info")
if not isinstance(pages, list) or not pages:
raise ValueError("MinerU JSON does not contain pdf_info pages")
first_width, first_height = _page_size(pages[0])
fields: list[dict] = []
for page in pages:
page_number = int(page.get("page_idx", 0)) + 1
for block in page.get("para_blocks") or []:
text = _block_text(block)
box = _bbox(block)
if not text or box is None:
continue
x0, y0, x1, y1 = box
fields.append(
{
"page": page_number,
"text": text,
"font_name": "",
"font_size_pt": None,
"font_height_mm": None,
"x0_pt": x0,
"top_pt": y0,
"x1_pt": x1,
"bottom_pt": y1,
}
)
return ParsedMineruDocument(page_width=first_width, page_height=first_height, fields=fields)
```
- [ ] **Step 4: Run parser tests and verify they pass**
Run: `pytest tests/backend/test_mineru_parser.py -v`
Expected: PASS.
## Task 2: MinerU HTTP Client
**Files:**
- Create: `backend/app/mineru_client.py`
- Test: `tests/backend/test_mineru_client.py`
- [ ] **Step 1: Write failing MinerU client tests**
```python
# tests/backend/test_mineru_client.py
from __future__ import annotations
import io
import json
import zipfile
from pathlib import Path
from urllib.error import HTTPError
import pytest
from backend.app import mineru_client
from backend.app.mineru_client import MineruClient, MineruClientError
class FakeResponse:
def __init__(self, status: int, body: bytes):
self.status = status
self._body = body
def read(self) -> bytes:
return self._body
def __enter__(self) -> "FakeResponse":
return self
def __exit__(self, *_args: object) -> None:
return None
def _zip_with_json() -> bytes:
buffer = io.BytesIO()
with zipfile.ZipFile(buffer, "w") as archive:
archive.writestr("demo_middle.json", json.dumps({"pdf_info": [{"page_idx": 0, "page_size": [1, 1], "para_blocks": []}]}))
return buffer.getvalue()
def test_submit_pdf_downloads_and_loads_structured_json(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
calls: list[str] = []
def fake_urlopen(request, timeout=0):
url = request.full_url if hasattr(request, "full_url") else request
calls.append(str(url))
if str(url).endswith("/api/v4/file-urls/batch"):
return FakeResponse(200, json.dumps({"code": 0, "data": {"batch_id": "batch-1", "file_urls": ["https://upload.example/file"]}}).encode())
if str(url) == "https://upload.example/file":
return FakeResponse(200, b"")
if str(url).endswith("/api/v4/extract-results/batch/batch-1"):
return FakeResponse(200, json.dumps({"code": 0, "data": {"extract_result": [{"state": "done", "full_zip_url": "https://download.example/result.zip"}]}}).encode())
if str(url) == "https://download.example/result.zip":
return FakeResponse(200, _zip_with_json())
raise AssertionError(f"unexpected URL {url}")
monkeypatch.setattr(mineru_client.request, "urlopen", fake_urlopen)
pdf_path = tmp_path / "preview.pdf"
pdf_path.write_bytes(b"%PDF-1.7")
payload = MineruClient(api_key="secret", poll_interval_seconds=0, max_polls=1).parse_pdf(pdf_path, tmp_path)
assert payload["pdf_info"][0]["page_size"] == [1, 1]
assert calls == [
"https://mineru.net/api/v4/file-urls/batch",
"https://upload.example/file",
"https://mineru.net/api/v4/extract-results/batch/batch-1",
"https://download.example/result.zip",
]
assert (tmp_path / "mineru_result.zip").exists()
def test_submit_pdf_raises_on_failed_task(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
def fake_urlopen(request, timeout=0):
url = request.full_url if hasattr(request, "full_url") else request
if str(url).endswith("/api/v4/file-urls/batch"):
return FakeResponse(200, json.dumps({"code": 0, "data": {"batch_id": "batch-1", "file_urls": ["https://upload.example/file"]}}).encode())
if str(url) == "https://upload.example/file":
return FakeResponse(200, b"")
if str(url).endswith("/api/v4/extract-results/batch/batch-1"):
return FakeResponse(200, json.dumps({"code": 0, "data": {"extract_result": [{"state": "failed", "err_msg": "bad pdf"}]}}).encode())
raise AssertionError(f"unexpected URL {url}")
monkeypatch.setattr(mineru_client.request, "urlopen", fake_urlopen)
pdf_path = tmp_path / "preview.pdf"
pdf_path.write_bytes(b"%PDF-1.7")
with pytest.raises(MineruClientError, match="bad pdf"):
MineruClient(api_key="secret", poll_interval_seconds=0, max_polls=1).parse_pdf(pdf_path, tmp_path)
```
- [ ] **Step 2: Run client tests and verify they fail**
Run: `pytest tests/backend/test_mineru_client.py -v`
Expected: FAIL with `ModuleNotFoundError: No module named 'backend.app.mineru_client'`.
- [ ] **Step 3: Implement the MinerU client**
```python
# backend/app/mineru_client.py
from __future__ import annotations
import json
import time
import zipfile
from pathlib import Path
from urllib import request
from urllib.error import HTTPError, URLError
class MineruClientError(RuntimeError):
pass
class MineruClient:
def __init__(self, api_key: str, poll_interval_seconds: float = 2.0, max_polls: int = 90) -> None:
self.api_key = api_key
self.poll_interval_seconds = poll_interval_seconds
self.max_polls = max_polls
def parse_pdf(self, pdf_path: Path, output_dir: Path) -> dict:
output_dir.mkdir(parents=True, exist_ok=True)
batch_id, upload_url = self._request_upload_url(pdf_path.name)
self._upload_file(upload_url, pdf_path)
zip_url = self._poll_result(batch_id)
zip_path = self._download_zip(zip_url, output_dir)
extract_dir = output_dir / "mineru_result"
self._extract_zip(zip_path, extract_dir)
return self._load_structured_json(extract_dir)
def _headers(self) -> dict[str, str]:
return {"Authorization": f"Bearer {self.api_key}", "Accept": "*/*"}
def _json_request(self, url: str, method: str = "GET", payload: dict | None = None) -> dict:
body = None if payload is None else json.dumps(payload).encode("utf-8")
headers = self._headers()
if payload is not None:
headers["Content-Type"] = "application/json"
req = request.Request(url, data=body, headers=headers, method=method)
try:
with request.urlopen(req, timeout=30) as response:
data = json.loads(response.read().decode("utf-8"))
except (HTTPError, URLError, TimeoutError, json.JSONDecodeError) as exc:
raise MineruClientError(f"MinerU request failed: {exc}") from exc
if data.get("code") != 0:
raise MineruClientError(str(data.get("msg") or "MinerU API returned an error"))
return data
def _request_upload_url(self, filename: str) -> tuple[str, str]:
data = self._json_request(
"https://mineru.net/api/v4/file-urls/batch",
method="POST",
payload={"files": [{"name": filename, "data_id": filename}], "model_version": "vlm"},
)
batch_id = data["data"]["batch_id"]
file_urls = data["data"]["file_urls"]
if not file_urls:
raise MineruClientError("MinerU did not return an upload URL")
return batch_id, file_urls[0]
def _upload_file(self, upload_url: str, pdf_path: Path) -> None:
req = request.Request(upload_url, data=pdf_path.read_bytes(), method="PUT")
try:
with request.urlopen(req, timeout=120) as response:
if response.status >= 400:
raise MineruClientError(f"MinerU upload failed with HTTP {response.status}")
except (HTTPError, URLError, TimeoutError) as exc:
raise MineruClientError(f"MinerU upload failed: {exc}") from exc
def _poll_result(self, batch_id: str) -> str:
url = f"https://mineru.net/api/v4/extract-results/batch/{batch_id}"
for _attempt in range(self.max_polls):
data = self._json_request(url)
extract_result = data.get("data", {}).get("extract_result")
if isinstance(extract_result, list):
result = extract_result[0] if extract_result else {}
else:
result = extract_result or {}
state = result.get("state")
if state == "done":
zip_url = result.get("full_zip_url")
if not zip_url:
raise MineruClientError("MinerU finished without full_zip_url")
return zip_url
if state == "failed":
raise MineruClientError(str(result.get("err_msg") or "MinerU parsing failed"))
time.sleep(self.poll_interval_seconds)
raise MineruClientError("MinerU polling timed out")
def _download_zip(self, zip_url: str, output_dir: Path) -> Path:
target = output_dir / "mineru_result.zip"
req = request.Request(zip_url, headers={"Accept": "*/*"}, method="GET")
try:
with request.urlopen(req, timeout=120) as response:
target.write_bytes(response.read())
except (HTTPError, URLError, TimeoutError) as exc:
raise MineruClientError(f"MinerU zip download failed: {exc}") from exc
return target
def _extract_zip(self, zip_path: Path, extract_dir: Path) -> None:
extract_dir.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(zip_path) as archive:
archive.extractall(extract_dir)
def _load_structured_json(self, extract_dir: Path) -> dict:
candidates = sorted(extract_dir.rglob("*middle.json")) + sorted(extract_dir.rglob("*_model.json"))
if not candidates:
raise MineruClientError("MinerU result zip did not contain structured JSON")
return json.loads(candidates[0].read_text(encoding="utf-8"))
```
- [ ] **Step 4: Run client tests and verify they pass**
Run: `pytest tests/backend/test_mineru_client.py -v`
Expected: PASS.
## Task 3: Pipeline Integration
**Files:**
- Modify: `backend/app/pipeline.py`
- Modify: `tests/backend/test_pipeline.py`
- Modify: `tests/backend/test_api.py`
- [ ] **Step 1: Write failing pipeline tests with a mocked MinerU document**
```python
# tests/backend/test_pipeline.py
from pathlib import Path
import pytest
from backend.app import pipeline
from backend.app.pipeline import process_files
WORKDIR = Path("/Users/icemilk/Workspace/zld_POC")
AI_FILE = WORKDIR / "【2026-04-09】端午 - 背标 - 天问.ai"
DOCX_FILE = WORKDIR / "天问礼品粽【260331】.docx"
OUTPUT_DIR = WORKDIR / ".tmp_test_output"
def test_process_files_builds_preview_and_mineru_field_results(monkeypatch: pytest.MonkeyPatch) -> None:
def fake_parse_with_mineru(_preview_path: Path, _output_dir: Path):
return {
"pdf_info": [
{
"page_idx": 0,
"page_size": [2772, 1961],
"para_blocks": [
{
"bbox": [704, 134, 2106, 229],
"lines": [{"spans": [{"content": "食品名称:天问礼品粽"}]}],
},
{
"bbox": [10, 20, 40, 60],
"lines": [{"spans": [{"content": "Word中不存在的内容"}]}],
},
],
}
]
}
monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", fake_parse_with_mineru)
result = process_files(AI_FILE, DOCX_FILE, OUTPUT_DIR, job_id="test-job")
assert result["preview"]["type"] == "pdf"
assert result["preview"]["url"] == "/api/files/test-job/preview.pdf"
assert result["preview"]["pageWidthPt"] == 2772
assert result["preview"]["pageHeightPt"] == 1961
assert result["fields"][0]["text"] == "食品名称:天问礼品粽"
assert result["fields"][0]["validation_status"] == "matched"
assert result["fields"][0]["x0_pt"] == 704.0
assert any(field["validation_status"] == "unmatched" for field in result["fields"])
assert (OUTPUT_DIR / "preview.pdf").exists()
```
- [ ] **Step 2: Replace API tests with mocked MinerU coverage**
```python
# tests/backend/test_api.py
from pathlib import Path
import pytest
from fastapi.testclient import TestClient
from backend.app import pipeline
from backend.app.main import app
WORKDIR = Path("/Users/icemilk/Workspace/zld_POC")
AI_FILE = WORKDIR / "【2026-04-09】端午 - 背标 - 天问.ai"
DOCX_FILE = WORKDIR / "天问礼品粽【260331】.docx"
client = TestClient(app)
def fake_mineru_payload() -> dict:
return {
"pdf_info": [
{
"page_idx": 0,
"page_size": [2772, 1961],
"para_blocks": [
{
"bbox": [704, 134, 2106, 229],
"lines": [{"spans": [{"content": "食品名称:天问礼品粽"}]}],
}
],
}
]
}
def test_process_endpoint_returns_preview_and_fields(monkeypatch: pytest.MonkeyPatch) -> None:
monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", lambda _preview_path, _output_dir: fake_mineru_payload())
with AI_FILE.open("rb") as ai_fp, DOCX_FILE.open("rb") as docx_fp:
response = client.post(
"/api/process",
files={
"ai_file": (AI_FILE.name, ai_fp, "application/postscript"),
"word_file": (
DOCX_FILE.name,
docx_fp,
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
),
},
)
assert response.status_code == 200
payload = response.json()
assert payload["preview"]["type"] == "pdf"
assert payload["preview"]["pageWidthPt"] == 2772
assert payload["fields"]
assert payload["fields"][0]["text"] == "食品名称:天问礼品粽"
def test_process_endpoint_uses_default_sample_files_when_uploads_are_missing(monkeypatch: pytest.MonkeyPatch) -> None:
monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", lambda _preview_path, _output_dir: fake_mineru_payload())
response = client.post("/api/process")
assert response.status_code == 200
payload = response.json()
assert payload["preview"]["type"] == "pdf"
assert payload["fields"]
assert any(field["text"] for field in payload["fields"])
def test_process_endpoint_surfaces_missing_mineru_key(monkeypatch: pytest.MonkeyPatch) -> None:
def fake_parse_with_mineru(_preview_path, _output_dir):
raise RuntimeError("MINERU_API_KEY is required")
monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", fake_parse_with_mineru)
response = client.post("/api/process")
assert response.status_code == 500
assert response.json()["detail"] == "MINERU_API_KEY is required"
```
- [ ] **Step 3: Run integration tests and verify they fail**
Run: `pytest tests/backend/test_pipeline.py tests/backend/test_api.py -v`
Expected: FAIL because `pipeline._parse_preview_with_mineru` does not exist and `process_files` still uses `ai_document.fields`.
- [ ] **Step 4: Update `backend/app/pipeline.py`**
```python
# backend/app/pipeline.py
from __future__ import annotations
import os
import shutil
from pathlib import Path
from backend.app.ai_parser import parse_ai_document
from backend.app.mineru_client import MineruClient
from backend.app.mineru_parser import parse_mineru_fields
from backend.app.text_validation import validate_field_against_word
from backend.app.word_parser import extract_word_text
def _sort_key(field: dict) -> tuple[int, int, float, float]:
status_rank = {"matched": 0, "unmatched": 1, "empty_or_garbled": 2}
return (
status_rank.get(field["validation_status"], 9),
field["page"],
field["top_pt"],
field["x0_pt"],
)
def _parse_preview_with_mineru(preview_path: Path, output_dir: Path) -> dict:
api_key = os.environ.get("MINERU_API_KEY", "").strip()
if not api_key:
raise RuntimeError("MINERU_API_KEY is required")
return MineruClient(api_key=api_key).parse_pdf(preview_path, output_dir / "mineru")
def process_files(ai_path: Path, word_path: Path, output_dir: Path, job_id: str | None = None) -> dict:
output_dir.mkdir(parents=True, exist_ok=True)
ai_document = parse_ai_document(ai_path, output_dir / "parsed")
word_text = extract_word_text(word_path)
preview_filename = "preview.pdf"
preview_target = output_dir / preview_filename
if ai_document.preview_path != preview_target:
shutil.copy2(ai_document.preview_path, preview_target)
mineru_payload = _parse_preview_with_mineru(preview_target, output_dir)
mineru_document = parse_mineru_fields(mineru_payload)
fields: list[dict] = []
for index, field in enumerate(mineru_document.fields, start=1):
validation = validate_field_against_word(field["text"], word_text)
fields.append(
{
"id": f"field-{index}",
**field,
"normalized_text": validation.normalized_text,
"validation_status": validation.status,
"validation_reason": validation.reason,
"matched_excerpt": validation.matched_excerpt,
}
)
fields.sort(key=_sort_key)
preview_url = f"/api/files/{job_id}/{preview_filename}" if job_id else preview_filename
return {
"preview": {
"type": "pdf",
"url": preview_url,
"pageWidthPt": mineru_document.page_width,
"pageHeightPt": mineru_document.page_height,
},
"fields": fields,
}
```
- [ ] **Step 5: Run integration tests and verify they pass**
Run: `pytest tests/backend/test_pipeline.py tests/backend/test_api.py -v`
Expected: PASS.
## Task 4: Frontend Type and Copy Compatibility
**Files:**
- Modify: `frontend/src/types.ts`
- Modify: `frontend/src/App.tsx`
- [ ] **Step 1: Update TypeScript types**
```ts
// frontend/src/types.ts
export type FieldResult = {
id: string
page: number
text: string
font_name?: string | null
font_size_pt?: number | null
font_height_mm?: number | null
x0_pt: number
top_pt: number
x1_pt: number
bottom_pt: number
normalized_text: string
validation_status: ValidationStatus
validation_reason: string
matched_excerpt: string | null
}
```
- [ ] **Step 2: Update `App.tsx` display guards and copy**
```tsx
// Replace the hero copy with:
<p className="hero-copy">
Illustrator Word 稿 PDF MinerU
Word
</p>
// Replace the font metadata rendering with:
<div className="field-meta">
<span> {field.page} </span>
{field.font_name ? <span>{field.font_name}</span> : null}
{typeof field.font_size_pt === 'number' ? <span>{field.font_size_pt} pt</span> : null}
{typeof field.font_height_mm === 'number' ? <span>{field.font_height_mm.toFixed(1)} mm</span> : null}
</div>
```
- [ ] **Step 3: Run frontend type check**
Run: `cd frontend && npm run build`
Expected: PASS.
## Task 5: Full Verification
**Files:**
- No new files.
- [ ] **Step 1: Run backend tests**
Run: `pytest tests/backend/test_mineru_parser.py tests/backend/test_mineru_client.py tests/backend/test_pipeline.py tests/backend/test_api.py -v`
Expected: PASS.
- [ ] **Step 2: Run frontend build**
Run: `cd frontend && npm run build`
Expected: PASS.
- [ ] **Step 3: Run local manual verification with the real MinerU API**
Set `MINERU_API_KEY` in the shell environment, then run the backend:
```bash
./scripts/start_backend.sh
```
Run frontend in another terminal:
```bash
./scripts/start_frontend.sh
```
Open the frontend, upload the sample `.ai` and `.docx`, click `开始解析`, and verify:
- The request completes without leaking the token to browser requests.
- The right preview shows the PDF.
- The left result list contains MinerU-derived text blocks.
- Clicking a result card highlights the corresponding MinerU bbox on the right preview.
- Blocks found in the Word document show `校验成功`; missing blocks show `校验失败`.
- [ ] **Step 4: Skip commit in this workspace**
This project directory is not a git repository, so do not run `git commit`. Report the changed file list in the final response instead.