Initial commit: 包装审核 POC、Docker 与前后端
Made-with: Cursor
This commit is contained in:
801
docs/superpowers/plans/2026-04-14-mineru-ai-word-parse.md
Normal file
801
docs/superpowers/plans/2026-04-14-mineru-ai-word-parse.md
Normal file
@@ -0,0 +1,801 @@
|
||||
# MinerU AI Word Parse Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Replace the old Illustrator text extraction source with MinerU JSON blocks, then compare those blocks against the uploaded Word document and highlight MinerU bounding boxes in the existing preview.
|
||||
|
||||
**Architecture:** Keep `/api/process` and the current React response shape. Add a focused MinerU JSON mapper and a small MinerU HTTP client, then update `backend/app/pipeline.py` so it converts `.ai` to `preview.pdf`, sends that PDF to MinerU, maps returned blocks to fields, and validates each field against Word text. Frontend changes are limited to making font metadata optional and copy text match the new MinerU-backed flow.
|
||||
|
||||
**Tech Stack:** Python 3, FastAPI, stdlib `urllib`, stdlib `zipfile`, `python-docx`, `pypdf`, React, TypeScript, Vite, pytest.
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
- Create `backend/app/mineru_parser.py`: Convert MinerU `middle.json`-style data into normalized field dictionaries with text and bbox coordinates.
|
||||
- Create `backend/app/mineru_client.py`: Submit a local PDF to MinerU, poll for completion, download and extract the result zip, and load structured JSON.
|
||||
- Modify `backend/app/pipeline.py`: Use AI-to-PDF preview conversion, MinerU parsing, and Word validation instead of old AI text fields.
|
||||
- Modify `frontend/src/types.ts`: Make font fields optional because MinerU output does not provide Illustrator font metadata.
|
||||
- Modify `frontend/src/App.tsx`: Keep UI behavior, adjust product copy/status copy to MinerU-backed OCR/layout results, and avoid unsafe numeric formatting on optional font metadata.
|
||||
- Create `tests/backend/test_mineru_parser.py`: Unit tests for sample JSON extraction, HTML table conversion, bbox mapping, and empty block handling.
|
||||
- Create `tests/backend/test_mineru_client.py`: Unit tests for MinerU HTTP client success/failure control flow with mocked `urllib`.
|
||||
- Modify `tests/backend/test_pipeline.py`: Mock MinerU calls and assert Word validation plus preview/highlight payload.
|
||||
- Modify `tests/backend/test_api.py`: Mock MinerU calls for endpoint tests and add missing-token failure coverage.
|
||||
|
||||
## Task 1: MinerU JSON Mapper
|
||||
|
||||
**Files:**
|
||||
- Create: `backend/app/mineru_parser.py`
|
||||
- Test: `tests/backend/test_mineru_parser.py`
|
||||
|
||||
- [ ] **Step 1: Write failing parser tests**
|
||||
|
||||
```python
|
||||
# tests/backend/test_mineru_parser.py
|
||||
from __future__ import annotations
|
||||
|
||||
from backend.app.mineru_parser import parse_mineru_fields
|
||||
|
||||
|
||||
def test_parse_mineru_fields_extracts_text_and_bbox() -> None:
|
||||
payload = {
|
||||
"pdf_info": [
|
||||
{
|
||||
"page_idx": 0,
|
||||
"page_size": [2772, 1961],
|
||||
"para_blocks": [
|
||||
{
|
||||
"bbox": [704, 134, 2106, 229],
|
||||
"type": "title",
|
||||
"lines": [
|
||||
{
|
||||
"spans": [
|
||||
{
|
||||
"type": "text",
|
||||
"content": "食品名称:天问礼品粽",
|
||||
"bbox": [704, 134, 2106, 229],
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
}
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
parsed = parse_mineru_fields(payload)
|
||||
|
||||
assert parsed.page_width == 2772
|
||||
assert parsed.page_height == 1961
|
||||
assert parsed.fields == [
|
||||
{
|
||||
"page": 1,
|
||||
"text": "食品名称:天问礼品粽",
|
||||
"font_name": "",
|
||||
"font_size_pt": None,
|
||||
"font_height_mm": None,
|
||||
"x0_pt": 704.0,
|
||||
"top_pt": 134.0,
|
||||
"x1_pt": 2106.0,
|
||||
"bottom_pt": 229.0,
|
||||
}
|
||||
]
|
||||
|
||||
|
||||
def test_parse_mineru_fields_turns_table_html_into_text() -> None:
|
||||
payload = {
|
||||
"pdf_info": [
|
||||
{
|
||||
"page_idx": 0,
|
||||
"page_size": [1000, 800],
|
||||
"para_blocks": [
|
||||
{
|
||||
"bbox": [10, 20, 300, 200],
|
||||
"type": "table",
|
||||
"lines": [
|
||||
{
|
||||
"spans": [
|
||||
{
|
||||
"type": "table",
|
||||
"html": "<table><tr><td>品种</td><td>规格</td></tr><tr><td>黑猪肉粽</td><td>130克×1</td></tr></table>",
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
}
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
parsed = parse_mineru_fields(payload)
|
||||
|
||||
assert parsed.fields[0]["text"] == "品种 规格 黑猪肉粽 130克×1"
|
||||
|
||||
|
||||
def test_parse_mineru_fields_skips_empty_decorative_blocks() -> None:
|
||||
payload = {
|
||||
"pdf_info": [
|
||||
{
|
||||
"page_idx": 0,
|
||||
"page_size": [1000, 800],
|
||||
"para_blocks": [
|
||||
{"bbox": [1, 2, 3, 4], "type": "image", "lines": [{"spans": [{"type": "image"}]}]},
|
||||
{"bbox": [5, 6, 7, 8], "type": "text", "lines": [{"spans": [{"content": " "}]}]},
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
parsed = parse_mineru_fields(payload)
|
||||
|
||||
assert parsed.fields == []
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run parser tests and verify they fail**
|
||||
|
||||
Run: `pytest tests/backend/test_mineru_parser.py -v`
|
||||
|
||||
Expected: FAIL with `ModuleNotFoundError: No module named 'backend.app.mineru_parser'`.
|
||||
|
||||
- [ ] **Step 3: Implement the MinerU parser**
|
||||
|
||||
```python
|
||||
# backend/app/mineru_parser.py
|
||||
from __future__ import annotations
|
||||
|
||||
import html
|
||||
import re
|
||||
from dataclasses import dataclass
|
||||
from typing import Any
|
||||
|
||||
|
||||
@dataclass(slots=True)
|
||||
class ParsedMineruDocument:
|
||||
page_width: float
|
||||
page_height: float
|
||||
fields: list[dict]
|
||||
|
||||
|
||||
TAG_RE = re.compile(r"<[^>]+>")
|
||||
WHITESPACE_RE = re.compile(r"\s+")
|
||||
|
||||
|
||||
def _clean_text(value: str) -> str:
|
||||
without_tags = TAG_RE.sub(" ", html.unescape(value))
|
||||
return WHITESPACE_RE.sub(" ", without_tags).strip()
|
||||
|
||||
|
||||
def _span_text(span: dict[str, Any]) -> str:
|
||||
if isinstance(span.get("content"), str):
|
||||
return _clean_text(span["content"])
|
||||
if isinstance(span.get("html"), str):
|
||||
return _clean_text(span["html"])
|
||||
return ""
|
||||
|
||||
|
||||
def _block_text(block: dict[str, Any]) -> str:
|
||||
pieces: list[str] = []
|
||||
for line in block.get("lines") or []:
|
||||
for span in line.get("spans") or []:
|
||||
text = _span_text(span)
|
||||
if text:
|
||||
pieces.append(text)
|
||||
if not pieces and isinstance(block.get("text"), str):
|
||||
pieces.append(_clean_text(block["text"]))
|
||||
return WHITESPACE_RE.sub(" ", " ".join(pieces)).strip()
|
||||
|
||||
|
||||
def _bbox(block: dict[str, Any]) -> tuple[float, float, float, float] | None:
|
||||
raw_bbox = block.get("bbox")
|
||||
if not isinstance(raw_bbox, list) or len(raw_bbox) != 4:
|
||||
return None
|
||||
try:
|
||||
x0, y0, x1, y1 = [float(value) for value in raw_bbox]
|
||||
except (TypeError, ValueError):
|
||||
return None
|
||||
if x1 <= x0 or y1 <= y0:
|
||||
return None
|
||||
return x0, y0, x1, y1
|
||||
|
||||
|
||||
def _page_size(page: dict[str, Any]) -> tuple[float, float]:
|
||||
raw_size = page.get("page_size")
|
||||
if isinstance(raw_size, list) and len(raw_size) >= 2:
|
||||
return float(raw_size[0]), float(raw_size[1])
|
||||
return 1.0, 1.0
|
||||
|
||||
|
||||
def parse_mineru_fields(payload: dict[str, Any]) -> ParsedMineruDocument:
|
||||
pages = payload.get("pdf_info")
|
||||
if not isinstance(pages, list) or not pages:
|
||||
raise ValueError("MinerU JSON does not contain pdf_info pages")
|
||||
|
||||
first_width, first_height = _page_size(pages[0])
|
||||
fields: list[dict] = []
|
||||
|
||||
for page in pages:
|
||||
page_number = int(page.get("page_idx", 0)) + 1
|
||||
for block in page.get("para_blocks") or []:
|
||||
text = _block_text(block)
|
||||
box = _bbox(block)
|
||||
if not text or box is None:
|
||||
continue
|
||||
x0, y0, x1, y1 = box
|
||||
fields.append(
|
||||
{
|
||||
"page": page_number,
|
||||
"text": text,
|
||||
"font_name": "",
|
||||
"font_size_pt": None,
|
||||
"font_height_mm": None,
|
||||
"x0_pt": x0,
|
||||
"top_pt": y0,
|
||||
"x1_pt": x1,
|
||||
"bottom_pt": y1,
|
||||
}
|
||||
)
|
||||
|
||||
return ParsedMineruDocument(page_width=first_width, page_height=first_height, fields=fields)
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run parser tests and verify they pass**
|
||||
|
||||
Run: `pytest tests/backend/test_mineru_parser.py -v`
|
||||
|
||||
Expected: PASS.
|
||||
|
||||
## Task 2: MinerU HTTP Client
|
||||
|
||||
**Files:**
|
||||
- Create: `backend/app/mineru_client.py`
|
||||
- Test: `tests/backend/test_mineru_client.py`
|
||||
|
||||
- [ ] **Step 1: Write failing MinerU client tests**
|
||||
|
||||
```python
|
||||
# tests/backend/test_mineru_client.py
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
import json
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from urllib.error import HTTPError
|
||||
|
||||
import pytest
|
||||
|
||||
from backend.app import mineru_client
|
||||
from backend.app.mineru_client import MineruClient, MineruClientError
|
||||
|
||||
|
||||
class FakeResponse:
|
||||
def __init__(self, status: int, body: bytes):
|
||||
self.status = status
|
||||
self._body = body
|
||||
|
||||
def read(self) -> bytes:
|
||||
return self._body
|
||||
|
||||
def __enter__(self) -> "FakeResponse":
|
||||
return self
|
||||
|
||||
def __exit__(self, *_args: object) -> None:
|
||||
return None
|
||||
|
||||
|
||||
def _zip_with_json() -> bytes:
|
||||
buffer = io.BytesIO()
|
||||
with zipfile.ZipFile(buffer, "w") as archive:
|
||||
archive.writestr("demo_middle.json", json.dumps({"pdf_info": [{"page_idx": 0, "page_size": [1, 1], "para_blocks": []}]}))
|
||||
return buffer.getvalue()
|
||||
|
||||
|
||||
def test_submit_pdf_downloads_and_loads_structured_json(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
|
||||
calls: list[str] = []
|
||||
|
||||
def fake_urlopen(request, timeout=0):
|
||||
url = request.full_url if hasattr(request, "full_url") else request
|
||||
calls.append(str(url))
|
||||
if str(url).endswith("/api/v4/file-urls/batch"):
|
||||
return FakeResponse(200, json.dumps({"code": 0, "data": {"batch_id": "batch-1", "file_urls": ["https://upload.example/file"]}}).encode())
|
||||
if str(url) == "https://upload.example/file":
|
||||
return FakeResponse(200, b"")
|
||||
if str(url).endswith("/api/v4/extract-results/batch/batch-1"):
|
||||
return FakeResponse(200, json.dumps({"code": 0, "data": {"extract_result": [{"state": "done", "full_zip_url": "https://download.example/result.zip"}]}}).encode())
|
||||
if str(url) == "https://download.example/result.zip":
|
||||
return FakeResponse(200, _zip_with_json())
|
||||
raise AssertionError(f"unexpected URL {url}")
|
||||
|
||||
monkeypatch.setattr(mineru_client.request, "urlopen", fake_urlopen)
|
||||
pdf_path = tmp_path / "preview.pdf"
|
||||
pdf_path.write_bytes(b"%PDF-1.7")
|
||||
|
||||
payload = MineruClient(api_key="secret", poll_interval_seconds=0, max_polls=1).parse_pdf(pdf_path, tmp_path)
|
||||
|
||||
assert payload["pdf_info"][0]["page_size"] == [1, 1]
|
||||
assert calls == [
|
||||
"https://mineru.net/api/v4/file-urls/batch",
|
||||
"https://upload.example/file",
|
||||
"https://mineru.net/api/v4/extract-results/batch/batch-1",
|
||||
"https://download.example/result.zip",
|
||||
]
|
||||
assert (tmp_path / "mineru_result.zip").exists()
|
||||
|
||||
|
||||
def test_submit_pdf_raises_on_failed_task(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
|
||||
def fake_urlopen(request, timeout=0):
|
||||
url = request.full_url if hasattr(request, "full_url") else request
|
||||
if str(url).endswith("/api/v4/file-urls/batch"):
|
||||
return FakeResponse(200, json.dumps({"code": 0, "data": {"batch_id": "batch-1", "file_urls": ["https://upload.example/file"]}}).encode())
|
||||
if str(url) == "https://upload.example/file":
|
||||
return FakeResponse(200, b"")
|
||||
if str(url).endswith("/api/v4/extract-results/batch/batch-1"):
|
||||
return FakeResponse(200, json.dumps({"code": 0, "data": {"extract_result": [{"state": "failed", "err_msg": "bad pdf"}]}}).encode())
|
||||
raise AssertionError(f"unexpected URL {url}")
|
||||
|
||||
monkeypatch.setattr(mineru_client.request, "urlopen", fake_urlopen)
|
||||
pdf_path = tmp_path / "preview.pdf"
|
||||
pdf_path.write_bytes(b"%PDF-1.7")
|
||||
|
||||
with pytest.raises(MineruClientError, match="bad pdf"):
|
||||
MineruClient(api_key="secret", poll_interval_seconds=0, max_polls=1).parse_pdf(pdf_path, tmp_path)
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run client tests and verify they fail**
|
||||
|
||||
Run: `pytest tests/backend/test_mineru_client.py -v`
|
||||
|
||||
Expected: FAIL with `ModuleNotFoundError: No module named 'backend.app.mineru_client'`.
|
||||
|
||||
- [ ] **Step 3: Implement the MinerU client**
|
||||
|
||||
```python
|
||||
# backend/app/mineru_client.py
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import time
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from urllib import request
|
||||
from urllib.error import HTTPError, URLError
|
||||
|
||||
|
||||
class MineruClientError(RuntimeError):
|
||||
pass
|
||||
|
||||
|
||||
class MineruClient:
|
||||
def __init__(self, api_key: str, poll_interval_seconds: float = 2.0, max_polls: int = 90) -> None:
|
||||
self.api_key = api_key
|
||||
self.poll_interval_seconds = poll_interval_seconds
|
||||
self.max_polls = max_polls
|
||||
|
||||
def parse_pdf(self, pdf_path: Path, output_dir: Path) -> dict:
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
batch_id, upload_url = self._request_upload_url(pdf_path.name)
|
||||
self._upload_file(upload_url, pdf_path)
|
||||
zip_url = self._poll_result(batch_id)
|
||||
zip_path = self._download_zip(zip_url, output_dir)
|
||||
extract_dir = output_dir / "mineru_result"
|
||||
self._extract_zip(zip_path, extract_dir)
|
||||
return self._load_structured_json(extract_dir)
|
||||
|
||||
def _headers(self) -> dict[str, str]:
|
||||
return {"Authorization": f"Bearer {self.api_key}", "Accept": "*/*"}
|
||||
|
||||
def _json_request(self, url: str, method: str = "GET", payload: dict | None = None) -> dict:
|
||||
body = None if payload is None else json.dumps(payload).encode("utf-8")
|
||||
headers = self._headers()
|
||||
if payload is not None:
|
||||
headers["Content-Type"] = "application/json"
|
||||
req = request.Request(url, data=body, headers=headers, method=method)
|
||||
try:
|
||||
with request.urlopen(req, timeout=30) as response:
|
||||
data = json.loads(response.read().decode("utf-8"))
|
||||
except (HTTPError, URLError, TimeoutError, json.JSONDecodeError) as exc:
|
||||
raise MineruClientError(f"MinerU request failed: {exc}") from exc
|
||||
if data.get("code") != 0:
|
||||
raise MineruClientError(str(data.get("msg") or "MinerU API returned an error"))
|
||||
return data
|
||||
|
||||
def _request_upload_url(self, filename: str) -> tuple[str, str]:
|
||||
data = self._json_request(
|
||||
"https://mineru.net/api/v4/file-urls/batch",
|
||||
method="POST",
|
||||
payload={"files": [{"name": filename, "data_id": filename}], "model_version": "vlm"},
|
||||
)
|
||||
batch_id = data["data"]["batch_id"]
|
||||
file_urls = data["data"]["file_urls"]
|
||||
if not file_urls:
|
||||
raise MineruClientError("MinerU did not return an upload URL")
|
||||
return batch_id, file_urls[0]
|
||||
|
||||
def _upload_file(self, upload_url: str, pdf_path: Path) -> None:
|
||||
req = request.Request(upload_url, data=pdf_path.read_bytes(), method="PUT")
|
||||
try:
|
||||
with request.urlopen(req, timeout=120) as response:
|
||||
if response.status >= 400:
|
||||
raise MineruClientError(f"MinerU upload failed with HTTP {response.status}")
|
||||
except (HTTPError, URLError, TimeoutError) as exc:
|
||||
raise MineruClientError(f"MinerU upload failed: {exc}") from exc
|
||||
|
||||
def _poll_result(self, batch_id: str) -> str:
|
||||
url = f"https://mineru.net/api/v4/extract-results/batch/{batch_id}"
|
||||
for _attempt in range(self.max_polls):
|
||||
data = self._json_request(url)
|
||||
extract_result = data.get("data", {}).get("extract_result")
|
||||
if isinstance(extract_result, list):
|
||||
result = extract_result[0] if extract_result else {}
|
||||
else:
|
||||
result = extract_result or {}
|
||||
state = result.get("state")
|
||||
if state == "done":
|
||||
zip_url = result.get("full_zip_url")
|
||||
if not zip_url:
|
||||
raise MineruClientError("MinerU finished without full_zip_url")
|
||||
return zip_url
|
||||
if state == "failed":
|
||||
raise MineruClientError(str(result.get("err_msg") or "MinerU parsing failed"))
|
||||
time.sleep(self.poll_interval_seconds)
|
||||
raise MineruClientError("MinerU polling timed out")
|
||||
|
||||
def _download_zip(self, zip_url: str, output_dir: Path) -> Path:
|
||||
target = output_dir / "mineru_result.zip"
|
||||
req = request.Request(zip_url, headers={"Accept": "*/*"}, method="GET")
|
||||
try:
|
||||
with request.urlopen(req, timeout=120) as response:
|
||||
target.write_bytes(response.read())
|
||||
except (HTTPError, URLError, TimeoutError) as exc:
|
||||
raise MineruClientError(f"MinerU zip download failed: {exc}") from exc
|
||||
return target
|
||||
|
||||
def _extract_zip(self, zip_path: Path, extract_dir: Path) -> None:
|
||||
extract_dir.mkdir(parents=True, exist_ok=True)
|
||||
with zipfile.ZipFile(zip_path) as archive:
|
||||
archive.extractall(extract_dir)
|
||||
|
||||
def _load_structured_json(self, extract_dir: Path) -> dict:
|
||||
candidates = sorted(extract_dir.rglob("*middle.json")) + sorted(extract_dir.rglob("*_model.json"))
|
||||
if not candidates:
|
||||
raise MineruClientError("MinerU result zip did not contain structured JSON")
|
||||
return json.loads(candidates[0].read_text(encoding="utf-8"))
|
||||
```
|
||||
|
||||
- [ ] **Step 4: Run client tests and verify they pass**
|
||||
|
||||
Run: `pytest tests/backend/test_mineru_client.py -v`
|
||||
|
||||
Expected: PASS.
|
||||
|
||||
## Task 3: Pipeline Integration
|
||||
|
||||
**Files:**
|
||||
- Modify: `backend/app/pipeline.py`
|
||||
- Modify: `tests/backend/test_pipeline.py`
|
||||
- Modify: `tests/backend/test_api.py`
|
||||
|
||||
- [ ] **Step 1: Write failing pipeline tests with a mocked MinerU document**
|
||||
|
||||
```python
|
||||
# tests/backend/test_pipeline.py
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from backend.app import pipeline
|
||||
from backend.app.pipeline import process_files
|
||||
|
||||
|
||||
WORKDIR = Path("/Users/icemilk/Workspace/zld_POC")
|
||||
AI_FILE = WORKDIR / "【2026-04-09】端午 - 背标 - 天问.ai"
|
||||
DOCX_FILE = WORKDIR / "天问礼品粽【260331】.docx"
|
||||
OUTPUT_DIR = WORKDIR / ".tmp_test_output"
|
||||
|
||||
|
||||
def test_process_files_builds_preview_and_mineru_field_results(monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
def fake_parse_with_mineru(_preview_path: Path, _output_dir: Path):
|
||||
return {
|
||||
"pdf_info": [
|
||||
{
|
||||
"page_idx": 0,
|
||||
"page_size": [2772, 1961],
|
||||
"para_blocks": [
|
||||
{
|
||||
"bbox": [704, 134, 2106, 229],
|
||||
"lines": [{"spans": [{"content": "食品名称:天问礼品粽"}]}],
|
||||
},
|
||||
{
|
||||
"bbox": [10, 20, 40, 60],
|
||||
"lines": [{"spans": [{"content": "Word中不存在的内容"}]}],
|
||||
},
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", fake_parse_with_mineru)
|
||||
|
||||
result = process_files(AI_FILE, DOCX_FILE, OUTPUT_DIR, job_id="test-job")
|
||||
|
||||
assert result["preview"]["type"] == "pdf"
|
||||
assert result["preview"]["url"] == "/api/files/test-job/preview.pdf"
|
||||
assert result["preview"]["pageWidthPt"] == 2772
|
||||
assert result["preview"]["pageHeightPt"] == 1961
|
||||
assert result["fields"][0]["text"] == "食品名称:天问礼品粽"
|
||||
assert result["fields"][0]["validation_status"] == "matched"
|
||||
assert result["fields"][0]["x0_pt"] == 704.0
|
||||
assert any(field["validation_status"] == "unmatched" for field in result["fields"])
|
||||
assert (OUTPUT_DIR / "preview.pdf").exists()
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Replace API tests with mocked MinerU coverage**
|
||||
|
||||
```python
|
||||
# tests/backend/test_api.py
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
from fastapi.testclient import TestClient
|
||||
|
||||
from backend.app import pipeline
|
||||
from backend.app.main import app
|
||||
|
||||
|
||||
WORKDIR = Path("/Users/icemilk/Workspace/zld_POC")
|
||||
AI_FILE = WORKDIR / "【2026-04-09】端午 - 背标 - 天问.ai"
|
||||
DOCX_FILE = WORKDIR / "天问礼品粽【260331】.docx"
|
||||
|
||||
client = TestClient(app)
|
||||
|
||||
|
||||
def fake_mineru_payload() -> dict:
|
||||
return {
|
||||
"pdf_info": [
|
||||
{
|
||||
"page_idx": 0,
|
||||
"page_size": [2772, 1961],
|
||||
"para_blocks": [
|
||||
{
|
||||
"bbox": [704, 134, 2106, 229],
|
||||
"lines": [{"spans": [{"content": "食品名称:天问礼品粽"}]}],
|
||||
}
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
def test_process_endpoint_returns_preview_and_fields(monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", lambda _preview_path, _output_dir: fake_mineru_payload())
|
||||
|
||||
with AI_FILE.open("rb") as ai_fp, DOCX_FILE.open("rb") as docx_fp:
|
||||
response = client.post(
|
||||
"/api/process",
|
||||
files={
|
||||
"ai_file": (AI_FILE.name, ai_fp, "application/postscript"),
|
||||
"word_file": (
|
||||
DOCX_FILE.name,
|
||||
docx_fp,
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
),
|
||||
},
|
||||
)
|
||||
|
||||
assert response.status_code == 200
|
||||
|
||||
payload = response.json()
|
||||
assert payload["preview"]["type"] == "pdf"
|
||||
assert payload["preview"]["pageWidthPt"] == 2772
|
||||
assert payload["fields"]
|
||||
assert payload["fields"][0]["text"] == "食品名称:天问礼品粽"
|
||||
|
||||
|
||||
def test_process_endpoint_uses_default_sample_files_when_uploads_are_missing(monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", lambda _preview_path, _output_dir: fake_mineru_payload())
|
||||
|
||||
response = client.post("/api/process")
|
||||
|
||||
assert response.status_code == 200
|
||||
|
||||
payload = response.json()
|
||||
assert payload["preview"]["type"] == "pdf"
|
||||
assert payload["fields"]
|
||||
assert any(field["text"] for field in payload["fields"])
|
||||
|
||||
|
||||
def test_process_endpoint_surfaces_missing_mineru_key(monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
def fake_parse_with_mineru(_preview_path, _output_dir):
|
||||
raise RuntimeError("MINERU_API_KEY is required")
|
||||
|
||||
monkeypatch.setattr(pipeline, "_parse_preview_with_mineru", fake_parse_with_mineru)
|
||||
|
||||
response = client.post("/api/process")
|
||||
|
||||
assert response.status_code == 500
|
||||
assert response.json()["detail"] == "MINERU_API_KEY is required"
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Run integration tests and verify they fail**
|
||||
|
||||
Run: `pytest tests/backend/test_pipeline.py tests/backend/test_api.py -v`
|
||||
|
||||
Expected: FAIL because `pipeline._parse_preview_with_mineru` does not exist and `process_files` still uses `ai_document.fields`.
|
||||
|
||||
- [ ] **Step 4: Update `backend/app/pipeline.py`**
|
||||
|
||||
```python
|
||||
# backend/app/pipeline.py
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
|
||||
from backend.app.ai_parser import parse_ai_document
|
||||
from backend.app.mineru_client import MineruClient
|
||||
from backend.app.mineru_parser import parse_mineru_fields
|
||||
from backend.app.text_validation import validate_field_against_word
|
||||
from backend.app.word_parser import extract_word_text
|
||||
|
||||
|
||||
def _sort_key(field: dict) -> tuple[int, int, float, float]:
|
||||
status_rank = {"matched": 0, "unmatched": 1, "empty_or_garbled": 2}
|
||||
return (
|
||||
status_rank.get(field["validation_status"], 9),
|
||||
field["page"],
|
||||
field["top_pt"],
|
||||
field["x0_pt"],
|
||||
)
|
||||
|
||||
|
||||
def _parse_preview_with_mineru(preview_path: Path, output_dir: Path) -> dict:
|
||||
api_key = os.environ.get("MINERU_API_KEY", "").strip()
|
||||
if not api_key:
|
||||
raise RuntimeError("MINERU_API_KEY is required")
|
||||
return MineruClient(api_key=api_key).parse_pdf(preview_path, output_dir / "mineru")
|
||||
|
||||
|
||||
def process_files(ai_path: Path, word_path: Path, output_dir: Path, job_id: str | None = None) -> dict:
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
ai_document = parse_ai_document(ai_path, output_dir / "parsed")
|
||||
word_text = extract_word_text(word_path)
|
||||
|
||||
preview_filename = "preview.pdf"
|
||||
preview_target = output_dir / preview_filename
|
||||
if ai_document.preview_path != preview_target:
|
||||
shutil.copy2(ai_document.preview_path, preview_target)
|
||||
|
||||
mineru_payload = _parse_preview_with_mineru(preview_target, output_dir)
|
||||
mineru_document = parse_mineru_fields(mineru_payload)
|
||||
|
||||
fields: list[dict] = []
|
||||
for index, field in enumerate(mineru_document.fields, start=1):
|
||||
validation = validate_field_against_word(field["text"], word_text)
|
||||
fields.append(
|
||||
{
|
||||
"id": f"field-{index}",
|
||||
**field,
|
||||
"normalized_text": validation.normalized_text,
|
||||
"validation_status": validation.status,
|
||||
"validation_reason": validation.reason,
|
||||
"matched_excerpt": validation.matched_excerpt,
|
||||
}
|
||||
)
|
||||
|
||||
fields.sort(key=_sort_key)
|
||||
|
||||
preview_url = f"/api/files/{job_id}/{preview_filename}" if job_id else preview_filename
|
||||
return {
|
||||
"preview": {
|
||||
"type": "pdf",
|
||||
"url": preview_url,
|
||||
"pageWidthPt": mineru_document.page_width,
|
||||
"pageHeightPt": mineru_document.page_height,
|
||||
},
|
||||
"fields": fields,
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 5: Run integration tests and verify they pass**
|
||||
|
||||
Run: `pytest tests/backend/test_pipeline.py tests/backend/test_api.py -v`
|
||||
|
||||
Expected: PASS.
|
||||
|
||||
## Task 4: Frontend Type and Copy Compatibility
|
||||
|
||||
**Files:**
|
||||
- Modify: `frontend/src/types.ts`
|
||||
- Modify: `frontend/src/App.tsx`
|
||||
|
||||
- [ ] **Step 1: Update TypeScript types**
|
||||
|
||||
```ts
|
||||
// frontend/src/types.ts
|
||||
export type FieldResult = {
|
||||
id: string
|
||||
page: number
|
||||
text: string
|
||||
font_name?: string | null
|
||||
font_size_pt?: number | null
|
||||
font_height_mm?: number | null
|
||||
x0_pt: number
|
||||
top_pt: number
|
||||
x1_pt: number
|
||||
bottom_pt: number
|
||||
normalized_text: string
|
||||
validation_status: ValidationStatus
|
||||
validation_reason: string
|
||||
matched_excerpt: string | null
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Update `App.tsx` display guards and copy**
|
||||
|
||||
```tsx
|
||||
// Replace the hero copy with:
|
||||
<p className="hero-copy">
|
||||
上传 Illustrator 源文件与 Word 校对稿,系统会将设计文件转换为 PDF 后交给 MinerU 解析,
|
||||
再把识别出的版面文字与 Word 内容逐块比对。
|
||||
</p>
|
||||
|
||||
// Replace the font metadata rendering with:
|
||||
<div className="field-meta">
|
||||
<span>第 {field.page} 页</span>
|
||||
{field.font_name ? <span>{field.font_name}</span> : null}
|
||||
{typeof field.font_size_pt === 'number' ? <span>{field.font_size_pt} pt</span> : null}
|
||||
{typeof field.font_height_mm === 'number' ? <span>{field.font_height_mm.toFixed(1)} mm</span> : null}
|
||||
</div>
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Run frontend type check**
|
||||
|
||||
Run: `cd frontend && npm run build`
|
||||
|
||||
Expected: PASS.
|
||||
|
||||
## Task 5: Full Verification
|
||||
|
||||
**Files:**
|
||||
- No new files.
|
||||
|
||||
- [ ] **Step 1: Run backend tests**
|
||||
|
||||
Run: `pytest tests/backend/test_mineru_parser.py tests/backend/test_mineru_client.py tests/backend/test_pipeline.py tests/backend/test_api.py -v`
|
||||
|
||||
Expected: PASS.
|
||||
|
||||
- [ ] **Step 2: Run frontend build**
|
||||
|
||||
Run: `cd frontend && npm run build`
|
||||
|
||||
Expected: PASS.
|
||||
|
||||
- [ ] **Step 3: Run local manual verification with the real MinerU API**
|
||||
|
||||
Set `MINERU_API_KEY` in the shell environment, then run the backend:
|
||||
|
||||
```bash
|
||||
./scripts/start_backend.sh
|
||||
```
|
||||
|
||||
Run frontend in another terminal:
|
||||
|
||||
```bash
|
||||
./scripts/start_frontend.sh
|
||||
```
|
||||
|
||||
Open the frontend, upload the sample `.ai` and `.docx`, click `开始解析`, and verify:
|
||||
|
||||
- The request completes without leaking the token to browser requests.
|
||||
- The right preview shows the PDF.
|
||||
- The left result list contains MinerU-derived text blocks.
|
||||
- Clicking a result card highlights the corresponding MinerU bbox on the right preview.
|
||||
- Blocks found in the Word document show `校验成功`; missing blocks show `校验失败`.
|
||||
|
||||
- [ ] **Step 4: Skip commit in this workspace**
|
||||
|
||||
This project directory is not a git repository, so do not run `git commit`. Report the changed file list in the final response instead.
|
||||
108
docs/superpowers/specs/2026-04-14-mineru-ai-word-parse-design.md
Normal file
108
docs/superpowers/specs/2026-04-14-mineru-ai-word-parse-design.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# MinerU AI Word Parse Design
|
||||
|
||||
## Goal
|
||||
|
||||
Replace the current AI text extraction pipeline with a MinerU-backed flow:
|
||||
|
||||
1. Accept an Illustrator `.ai` file and a Word `.docx` file.
|
||||
2. Convert or normalize the `.ai` file into a PDF preview artifact.
|
||||
3. Upload the PDF artifact to MinerU for document parsing.
|
||||
4. Read MinerU JSON output blocks and their bounding boxes.
|
||||
5. Compare MinerU text output against the Word document text.
|
||||
6. Return field results that the existing React preview can highlight on the right side.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not keep the old `parse_ai_document(...).fields` text extraction as the source of validation fields.
|
||||
- Do not expose the MinerU API key to the frontend.
|
||||
- Do not require a public callback URL; use polling because this is a local backend.
|
||||
- Do not add a new manual annotation workflow.
|
||||
|
||||
## Backend Flow
|
||||
|
||||
The `/api/process` endpoint keeps its current two-file upload contract: `ai_file` and `word_file`.
|
||||
|
||||
For each job, the backend creates the existing runtime upload/output directories. The `.ai` file is converted into a PDF preview artifact using the existing `backend.app.ai_parser.parse_ai_document` conversion behavior. The resulting `preview.pdf` is copied into the job output directory and returned as the preview URL.
|
||||
|
||||
The backend then submits that PDF preview artifact to MinerU using the documented local-file upload flow:
|
||||
|
||||
1. `POST https://mineru.net/api/v4/file-urls/batch` with one file entry and `model_version: "vlm"`.
|
||||
2. `PUT` the generated upload URL with the PDF bytes.
|
||||
3. Poll `GET https://mineru.net/api/v4/extract-results/batch/{batch_id}` until the single file reaches `done`, `failed`, or a timeout.
|
||||
4. When `done`, download `full_zip_url` into the job output directory.
|
||||
5. Extract the zip into the job output directory and locate the structured JSON output.
|
||||
|
||||
The API token is read from `MINERU_API_KEY`. If it is missing, the backend returns a clear configuration error instead of attempting the request.
|
||||
|
||||
## MinerU JSON Mapping
|
||||
|
||||
The primary parser reads MinerU `middle.json`-style output because the sample JSON contains:
|
||||
|
||||
- `pdf_info[]`
|
||||
- `page_idx`
|
||||
- `page_size`
|
||||
- `para_blocks[]`
|
||||
- `discarded_blocks[]`
|
||||
- block-level `bbox: [x0, y0, x1, y1]`
|
||||
- nested `lines[].spans[]` with `content`, `html`, and span-level `bbox`
|
||||
|
||||
Each top-level `para_blocks` item becomes one validation result. For blocks with nested line/span content, the backend concatenates text-like span content. Table spans with `html` are converted to readable text by stripping tags and HTML entities. If a block has no readable text, it can still be returned as `empty_or_garbled` when useful, but empty decorative blocks should be skipped.
|
||||
|
||||
Coordinate mapping:
|
||||
|
||||
- MinerU uses pixel-like page coordinates with origin at the top-left.
|
||||
- The frontend preview expects top-left coordinates named `x0_pt`, `top_pt`, `x1_pt`, and `bottom_pt`.
|
||||
- The backend returns MinerU coordinates directly as field coordinates and sets preview `pageWidthPt/pageHeightPt` from `page_size`, because the frontend scales both preview and overlay from the same coordinate system.
|
||||
|
||||
For multi-page output, `page` is `page_idx + 1`.
|
||||
|
||||
## Word Comparison
|
||||
|
||||
The Word document remains the validation baseline. The backend uses the existing `extract_word_text` and `validate_field_against_word` behavior:
|
||||
|
||||
- MinerU block text is normalized and compared against the Word full text.
|
||||
- The result status remains `matched`, `unmatched`, or `empty_or_garbled`.
|
||||
- The response keeps a `fields` array compatible with the current React UI.
|
||||
|
||||
This preserves the existing sidebar and highlighter behavior while changing the field source from old AI PDF text extraction to MinerU OCR/layout extraction.
|
||||
|
||||
## Frontend Contract
|
||||
|
||||
The current `ProcessResponse` shape should remain mostly compatible:
|
||||
|
||||
- `preview.type`: `pdf`
|
||||
- `preview.url`: generated PDF preview URL
|
||||
- `preview.pageWidthPt`: MinerU page width
|
||||
- `preview.pageHeightPt`: MinerU page height
|
||||
- `fields[]`: validation blocks with text, status, reason, matched excerpt, page, and coordinates
|
||||
|
||||
Small frontend changes may be needed to make optional typography metadata safe because MinerU blocks do not provide Illustrator font names or font sizes.
|
||||
|
||||
The right preview continues to render `preview.pdf` and draw overlay rectangles from `fields[]`.
|
||||
|
||||
## Error Handling
|
||||
|
||||
Return actionable API errors for:
|
||||
|
||||
- Unsupported upload types.
|
||||
- `.ai` to PDF conversion failure.
|
||||
- Missing `MINERU_API_KEY`.
|
||||
- MinerU upload URL request failure.
|
||||
- MinerU upload PUT failure.
|
||||
- MinerU polling timeout.
|
||||
- MinerU task failure, including `err_msg` when present.
|
||||
- Missing structured JSON in the downloaded zip.
|
||||
|
||||
The API key must not be logged or included in response payloads.
|
||||
|
||||
## Testing
|
||||
|
||||
Backend tests should cover:
|
||||
|
||||
- MinerU JSON block extraction from a sample local JSON file.
|
||||
- HTML table text conversion.
|
||||
- Coordinate mapping from MinerU bbox into field coordinates.
|
||||
- Word comparison integration using mocked MinerU results.
|
||||
- MinerU client control flow with mocked HTTP responses.
|
||||
|
||||
Manual verification should run the backend and frontend locally with `MINERU_API_KEY` set, upload the sample `.ai` and `.docx`, and confirm that result cards appear and corresponding boxes highlight on the right preview.
|
||||
Reference in New Issue
Block a user