This commit is contained in:
xunbu
2026-01-01 01:20:54 +08:00
parent 1de50cb232
commit ed8461efa2
4 changed files with 626 additions and 942 deletions

485
README.md
View File

@@ -46,11 +46,10 @@
## Integration Packages
For users who want to get started quickly, we provide integration packages on [GitHub Releases](https://github.com/xunbu/docutranslate/releases). simply download, unzip, and enter your AI platform API-Key to start using it.
For users who want to get started quickly, we provide integration packages on [GitHub Releases](https://github.com/xunbu/docutranslate/releases). Simply download, unzip, and enter your AI platform API-Key to start using it.
- **DocuTranslate**: Standard version. Uses the online `minerU` engine to parse PDF documents. Choose this version if you do not need local PDF parsing (Recommended).
- **DocuTranslate_full**: Full version. Includes the built-in `docling` local PDF parsing engine. Choose this version if you need to parse PDFs locally.
> Version 1.5.1 and later support calling a locally deployed mineru service.
- **DocuTranslate**: Standard version. Uses `minerU` (online or locally deployed) for PDF parsing. Supports local minerU API calls. (Recommended)
- **DocuTranslate_full**: Full version. Includes the built-in `docling` local PDF parsing engine. Choose this version if you need offline PDF parsing without minerU.
## Installation
@@ -99,33 +98,13 @@ docker run -d -p 8010:8010 xunbu/docutranslate:latest
## Core Concept: Workflow
The core of the new DocuTranslate is the **Workflow**. Each workflow is a complete end-to-end translation pipeline designed for a specific file type. Instead of interacting with a massive class, you select and configure a suitable workflow based on your file type.
DocuTranslate uses a **Workflow** system - each workflow is a complete translation pipeline for a specific file type.
**The basic usage flow is as follows:**
1. **Select Workflow**: Choose a workflow based on your input file type (e.g., PDF/Word or TXT), such as `MarkdownBasedWorkflow` or `TXTWorkflow`.
2. **Build Configuration**: Create the corresponding configuration object for the selected workflow (e.g., `MarkdownBasedWorkflowConfig`). This configuration object contains all necessary sub-configurations, such as:
* **Converter Config**: Defines how to convert the original file (e.g., PDF) to Markdown.
* **Translator Config**: Defines which LLM, API-Key, target language, etc., to use.
* **Exporter Config**: Defines specific options for the output format (e.g., HTML).
3. **Instantiate Workflow**: Create a workflow instance using the configuration object.
4. **Execute Translation**: Call the workflow's `.read_*()` and `.translate()` / `.translate_async()` methods.
5. **Export/Save Results**: Call `.export_to_*()` or `.save_as_*()` methods to get or save the translation results.
## Available Workflows
| Workflow | Applicable Scenarios | Input Formats | Output Formats | Core Configuration Class |
|:---|:---|:---|:---|:---|
| **`MarkdownBasedWorkflow`** | Handles rich text documents like PDF, Word, images, etc. Flow: `File -> Markdown -> Translate -> Export`. | `.pdf`, `.docx`, `.md`, `.png`, `.jpg`, etc. | `.md`, `.zip`, `.html` | `MarkdownBasedWorkflowConfig` |
| **`TXTWorkflow`** | Handles plain text documents. Flow: `txt -> Translate -> Export`. | `.txt` and other plain text formats | `.txt`, `.html` | `TXTWorkflowConfig` |
| **`JsonWorkflow`** | Handles JSON files. Flow: `json -> Translate -> Export`. | `.json` | `.json`, `.html` | `JsonWorkflowConfig` |
| **`DocxWorkflow`** | Handles docx files. Flow: `docx -> Translate -> Export`. | `.docx` | `.docx`, `.html` | `docxWorkflowConfig` |
| **`XlsxWorkflow`** | Handles xlsx files. Flow: `xlsx -> Translate -> Export`. | `.xlsx`, `.csv` | `.xlsx`, `.html` | `XlsxWorkflowConfig` |
| **`SrtWorkflow`** | Handles srt files. Flow: `srt -> Translate -> Export`. | `.srt` | `.srt`, `.html` | `SrtWorkflowConfig` |
| **`EpubWorkflow`** | Handles epub files. Flow: `epub -> Translate -> Export`. | `.epub` | `.epub`, `.html` | `EpubWorkflowConfig` |
| **`HtmlWorkflow`** | Handles html files. Flow: `html -> Translate -> Export`. | `.html`, `.htm` | `.html` | `HtmlWorkflowConfig` |
> In the interactive interface, you can also export to PDF format.
**Basic flow:**
1. Select workflow based on file type
2. Configure the workflow (LLM, parsing engine, output format)
3. Execute translation
4. Save results
## Start Web UI and API Service
@@ -154,6 +133,129 @@ docutranslate -i
## Usage Examples
### Using the Simple Client SDK (Recommended)
The easiest way to get started is using the `Client` class, which provides a simple and intuitive API for translation:
```python
from docutranslate.sdk import Client
# Initialize the client with your AI platform settings
client = Client(
api_key="YOUR_OPENAI_API_KEY", # or any other AI platform API key
base_url="https://api.openai.com/v1/",
model_id="gpt-4o",
to_lang="Chinese",
concurrent=10, # Number of concurrent requests
)
# Translate a single file (auto-detects file type)
result = client.translate("path/to/your/document.pdf")
# Save with default format (PDF -> html by default)
print(f"Translation complete! Saved to: {result.save()}")
# Or specify output format explicitly
# For PDF/markdown_based:
# - "markdown": Markdown with embedded base64 images (default)
# - "markdown_zip": Markdown with separate image files (ZIP archive)
# - "html": HTML format
# For docx: "docx"
# For xlsx: "xlsx"
result.save(fmt="html") # Save as HTML
result.save(fmt="markdown") # Save as Markdown with embedded images
result.save(fmt="markdown_zip") # Save as ZIP with separate images
# Save to custom location
result.save(output_dir="./my_translations", name="my_document.html")
# Export as base64 encoded string
base64_content = result.export(fmt="html")
print(f"Exported content length: {len(base64_content)}")
# You can also access the underlying workflow for advanced operations
# workflow = result.workflow
```
**Client Features:**
- **Auto-detection**: Automatically detects file type and selects the appropriate workflow
- **Flexible Configuration**: Override any default settings per translation call
- **Multiple Output Options**: Save to disk or export as Base64 string
- **Async Support**: Use `translate_async()` for concurrent translation tasks
#### Client SDK Parameters
| Parameter | Type | Default | Description |
|:---|:---|:---|:---|
| **api_key** | `str` | - | AI platform API key |
| **base_url** | `str` | - | AI platform base URL (e.g., `https://api.openai.com/v1/`) |
| **model_id** | `str` | - | Model ID to use for translation |
| **to_lang** | `str` | - | Target language (e.g., `"Chinese"`, `"English"`, `"Japanese"`) |
| **concurrent** | `int` | 10 | Number of concurrent LLM requests |
| **convert_engine** | `str` | `"mineru"` | PDF parsing engine: `"mineru"`, `"docling"`, `"mineru_deploy"` |
| **mineru_deploy_base_url** | `str` | - | Local minerU API address (when `convert_engine="mineru_deploy"`) |
| **mineru_token** | `str` | - | minerU API token (when using online minerU) |
| **skip_translate** | `bool` | `False` | Skip translation, only parse document |
| **output_dir** | `str` | `"./output"` | Default output directory for `save()` |
| **chunk_size** | `int` | 3000 | Text chunk size for LLM processing |
| **temperature** | `float` | 0.3 | LLM temperature parameter |
| **timeout** | `int` | 60 | Request timeout in seconds |
| **retry** | `int` | 3 | Number of retry attempts on failure |
| **provider** | `str` | `"auto"` | AI provider type (auto, openai, azure, etc.) |
| **force_json** | `bool` | `False` | Force JSON output mode |
| **rpm** | `int` | - | Requests per minute limit |
| **tpm** | `int` | - | Tokens per minute limit |
#### Result Methods
| Method | Parameters | Description |
|:---|:---|:---|
| **save()** | `output_dir`, `name`, `fmt` | Save translation result to disk |
| **export()** | `fmt` | Export as Base64 encoded string |
| **supported_formats** | - | Get list of supported output formats |
| **workflow** | - | Access underlying workflow object |
```python
import asyncio
from docutranslate.sdk import Client
async def translate_multiple():
client = Client(
api_key="YOUR_API_KEY",
base_url="https://api.openai.com/v1/",
model_id="gpt-4o",
to_lang="Chinese",
)
# Translate multiple files concurrently
files = ["doc1.pdf", "doc2.docx", "notes.txt"]
results = await asyncio.gather(
*[client.translate_async(f) for f in files]
)
for r in results:
print(f"Saved: {r.save()}")
asyncio.run(translate_multiple())
```
### Available Workflows (For Workflow API)
If you prefer more control, use the Workflow API directly. Here are the available workflows:
| Workflow | Applicable Scenarios | Input Formats | Output Formats | Core Configuration Class |
|:---|:---|:---|:---|:---|
| **`MarkdownBasedWorkflow`** | Handles rich text documents like PDF, Word, images, etc. Flow: `File -> Markdown -> Translate -> Export`. | `.pdf`, `.docx`, `.md`, `.png`, `.jpg`, etc. | `.md`, `.zip`, `.html` | `MarkdownBasedWorkflowConfig` |
| **`TXTWorkflow`** | Handles plain text documents. Flow: `txt -> Translate -> Export`. | `.txt` and other plain text formats | `.txt`, `.html` | `TXTWorkflowConfig` |
| **`JsonWorkflow`** | Handles JSON files. Flow: `json -> Translate -> Export`. | `.json` | `.json`, `.html` | `JsonWorkflowConfig` |
| **`DocxWorkflow`** | Handles docx files. Flow: `docx -> Translate -> Export`. | `.docx` | `.docx`, `.html` | `docxWorkflowConfig` |
| **`XlsxWorkflow`** | Handles xlsx files. Flow: `xlsx -> Translate -> Export`. | `.xlsx`, `.csv` | `.xlsx`, `.html` | `XlsxWorkflowConfig` |
| **`SrtWorkflow`** | Handles srt files. Flow: `srt -> Translate -> Export`. | `.srt` | `.srt`, `.html` | `SrtWorkflowConfig` |
| **`EpubWorkflow`** | Handles epub files. Flow: `epub -> Translate -> Export`. | `.epub` | `.epub`, `.html` | `EpubWorkflowConfig` |
| **`HtmlWorkflow`** | Handles html files. Flow: `html -> Translate -> Export`. | `.html`, `.htm` | `.html` | `HtmlWorkflowConfig` |
> In the interactive interface, you can also export to PDF format.
### Example 1: Translate a PDF File (Using `MarkdownBasedWorkflow`)
This is the most common use case. We will use the `minerU` engine to convert the PDF to Markdown, and then translate it using an LLM. This example uses asynchronous execution.
@@ -193,22 +295,6 @@ async def main():
translator_config=translator_config, # Pass translator config
html_exporter_config=MD2HTMLExporterConfig(cdn=True) # HTML export config
)
# Using locally deployed mineru service
# from docutranslate.converter.x2md.converter_mineru_deploy import ConverterMineruDeployConfig
# converter_config = ConverterMineruDeployConfig(
# base_url = "http://127.0.0.1:8000",
# output_dir= "./output", # Due to mineru limitations, parsed files are saved to output_dir and need periodic cleaning
# backend= "pipeline",
# start_page_id = 0,
# end_page_id = 99999,
# )
# workflow_config = MarkdownBasedWorkflowConfig(
# convert_engine="mineru_deploy", # Specify parsing engine
# converter_config=converter_config, # Pass converter config
# translator_config=translator_config, # Pass translator config
# html_exporter_config=MD2HTMLExporterConfig(cdn=True) # HTML export config
# )
# 4. Instantiate Workflow
workflow = MarkdownBasedWorkflow(config=workflow_config)
@@ -237,254 +323,25 @@ if __name__ == "__main__":
asyncio.run(main())
```
### Example 2: Translate a TXT File (Using `TXTWorkflow`)
### Other Workflows
For plain text files, the process is simpler as it doesn't require a document parsing (conversion) step. This example uses asynchronous execution.
All workflows follow the same pattern. Import the corresponding config and workflow, then configure:
```python
import asyncio
from docutranslate.workflow.txt_workflow import TXTWorkflow, TXTWorkflowConfig
from docutranslate.translator.ai_translator.txt_translator import TXTTranslatorConfig
from docutranslate.exporter.txt.txt2html_exporter import TXT2HTMLExporterConfig
async def main():
# 1. Build Translator Configuration
translator_config = TXTTranslatorConfig(
base_url="https://api.openai.com/v1/",
api_key="YOUR_OPENAI_API_KEY",
model_id="gpt-4o",
to_lang="Chinese",
)
# 2. Build Main Workflow Configuration
workflow_config = TXTWorkflowConfig(
translator_config=translator_config,
html_exporter_config=TXT2HTMLExporterConfig(cdn=True)
)
# 3. Instantiate Workflow
workflow = TXTWorkflow(config=workflow_config)
# 4. Read file and execute translation
workflow.read_path("path/to/your/notes.txt")
await workflow.translate_async()
# Or use synchronous method
# workflow.translate()
# 5. Save results
workflow.save_as_txt(name="translated_notes.txt")
print("TXT file saved.")
# You can also export the translated plain text
text = workflow.export_to_txt()
if __name__ == "__main__":
asyncio.run(main())
# TXT: from docutranslate.workflow.txt_workflow import TXTWorkflow, TXTWorkflowConfig
# JSON: from docutranslate.workflow.json_workflow import JsonWorkflow, JsonWorkflowConfig
# DOCX: from docutranslate.workflow.docx_workflow import DocxWorkflow, DocxWorkflowConfig
# XLSX: from docutranslate.workflow.xlsx_workflow import XlsxWorkflow, XlsxWorkflowConfig
# EPUB: from docutranslate.workflow.epub_workflow import EpubWorkflow, EpubWorkflowConfig
# HTML: from docutranslate.workflow.html_workflow import HtmlWorkflow, HtmlWorkflowConfig
# SRT: from docutranslate.workflow.srt_workflow import SrtWorkflow, SrtWorkflowConfig
# ASS: from docutranslate.workflow.ass_workflow import AssWorkflow, AssWorkflowConfig
```
### Example 3: Translate a JSON File (Using `JsonWorkflow`)
This example uses asynchronous execution. In `JsonTranslatorConfig`, the `json_paths` item needs to specify the JSON paths to be translated (following `jsonpath-ng` syntax specifications); only values matching the JSON paths will be translated.
```python
import asyncio
from docutranslate.exporter.js.json2html_exporter import Json2HTMLExporterConfig
from docutranslate.translator.ai_translator.json_translator import JsonTranslatorConfig
from docutranslate.workflow.json_workflow import JsonWorkflowConfig, JsonWorkflow
async def main():
# 1. Build Translator Configuration
translator_config = JsonTranslatorConfig(
base_url="https://api.openai.com/v1/",
api_key="YOUR_OPENAI_API_KEY",
model_id="gpt-4o",
to_lang="Chinese",
json_paths=["$.*", "$.name"] # Satisfies jsonpath-ng syntax, values at matching paths will be translated
)
# 2. Build Main Workflow Configuration
workflow_config = JsonWorkflowConfig(
translator_config=translator_config,
html_exporter_config=Json2HTMLExporterConfig(cdn=True)
)
# 3. Instantiate Workflow
workflow = JsonWorkflow(config=workflow_config)
# 4. Read file and execute translation
workflow.read_path("path/to/your/notes.json")
await workflow.translate_async()
# Or use synchronous method
# workflow.translate()
# 5. Save results
workflow.save_as_json(name="translated_notes.json")
print("JSON file saved.")
# You can also export the translated JSON text
text = workflow.export_to_json()
if __name__ == "__main__":
asyncio.run(main())
```
### Example 4: Translate a Docx File (Using `DocxWorkflow`)
This example uses asynchronous execution.
```python
import asyncio
from docutranslate.exporter.docx.docx2html_exporter import Docx2HTMLExporterConfig
from docutranslate.translator.ai_translator.docx_translator import DocxTranslatorConfig
from docutranslate.workflow.docx_workflow import DocxWorkflowConfig, DocxWorkflow
async def main():
# 1. Build Translator Configuration
translator_config = DocxTranslatorConfig(
base_url="https://api.openai.com/v1/",
api_key="YOUR_OPENAI_API_KEY",
model_id="gpt-4o",
to_lang="Chinese",
insert_mode="replace", # Options: "replace", "append", "prepend"
separator="\n", # Separator used for "append", "prepend" modes
)
# 2. Build Main Workflow Configuration
workflow_config = DocxWorkflowConfig(
translator_config=translator_config,
html_exporter_config=Docx2HTMLExporterConfig(cdn=True)
)
# 3. Instantiate Workflow
workflow = DocxWorkflow(config=workflow_config)
# 4. Read file and execute translation
workflow.read_path("path/to/your/notes.docx")
await workflow.translate_async()
# Or use synchronous method
# workflow.translate()
# 5. Save results
workflow.save_as_docx(name="translated_notes.docx")
print("docx file saved.")
# You can also export the translated docx bytes
text_bytes = workflow.export_to_docx()
if __name__ == "__main__":
asyncio.run(main())
```
### Example 5: Translate an Xlsx File (Using `XlsxWorkflow`)
This example uses asynchronous execution.
```python
import asyncio
from docutranslate.exporter.xlsx.xlsx2html_exporter import Xlsx2HTMLExporterConfig
from docutranslate.translator.ai_translator.xlsx_translator import XlsxTranslatorConfig
from docutranslate.workflow.xlsx_workflow import XlsxWorkflowConfig, XlsxWorkflow
async def main():
# 1. Build Translator Configuration
translator_config = XlsxTranslatorConfig(
base_url="https://api.openai.com/v1/",
api_key="YOUR_OPENAI_API_KEY",
model_id="gpt-4o",
to_lang="Chinese",
insert_mode="replace", # Options: "replace", "append", "prepend"
separator="\n", # Separator used for "append", "prepend" modes
)
# 2. Build Main Workflow Configuration
workflow_config = XlsxWorkflowConfig(
translator_config=translator_config,
html_exporter_config=Xlsx2HTMLExporterConfig(cdn=True)
)
# 3. Instantiate Workflow
workflow = XlsxWorkflow(config=workflow_config)
# 4. Read file and execute translation
workflow.read_path("path/to/your/notes.xlsx")
await workflow.translate_async()
# Or use synchronous method
# workflow.translate()
# 5. Save results
workflow.save_as_xlsx(name="translated_notes.xlsx")
print("xlsx file saved.")
# You can also export the translated xlsx bytes
text_bytes = workflow.export_to_xlsx()
if __name__ == "__main__":
asyncio.run(main())
```
### Example 6: Config Options for Other Workflows (Using `HtmlWorkflow`, `EpubWorkflow`)
This example uses asynchronous execution.
```python
# HtmlWorkflow
from docutranslate.translator.ai_translator.html_translator import HtmlTranslatorConfig
from docutranslate.workflow.html_workflow import HtmlWorkflowConfig, HtmlWorkflow
async def html():
# 1. Build Translator Configuration
translator_config = HtmlTranslatorConfig(
base_url="https://api.openai.com/v1/",
api_key="YOUR_OPENAI_API_KEY",
model_id="gpt-4o",
to_lang="Chinese",
insert_mode="replace", # Options: "replace", "append", "prepend"
separator="\n", # Separator used for "append", "prepend" modes
)
# 2. Build Main Workflow Configuration
workflow_config = HtmlWorkflowConfig(
translator_config=translator_config,
)
workflow_html = HtmlWorkflow(config=workflow_config)
# EpubWorkflow
from docutranslate.exporter.epub.epub2html_exporter import Epub2HTMLExporterConfig
from docutranslate.translator.ai_translator.epub_translator import EpubTranslatorConfig
from docutranslate.workflow.epub_workflow import EpubWorkflowConfig, EpubWorkflow
async def epub():
# 1. Build Translator Configuration
translator_config = EpubTranslatorConfig(
base_url="https://api.openai.com/v1/",
api_key="YOUR_OPENAI_API_KEY",
model_id="gpt-4o",
to_lang="Chinese",
insert_mode="replace", # Options: "replace", "append", "prepend"
separator="\n", # Separator used for "append", "prepend" modes
)
# 2. Build Main Workflow Configuration
workflow_config = EpubWorkflowConfig(
translator_config=translator_config,
html_exporter_config=Epub2HTMLExporterConfig(cdn=True),
)
workflow_epub = EpubWorkflow(config=workflow_config)
```
Key config options:
- **insert_mode**: `"replace"`, `"append"`, or `"prepend"` (for docx/xlsx/html/epub)
- **json_paths**: JSONPath expressions for JSON translation (e.g., `["$.*", "$.name"]`)
- **separator**: Text separator for `"append"` / `"prepend"` modes
## Prerequisites and Detailed Configuration
@@ -562,31 +419,47 @@ converter_config = ConverterDoclingConfig(
)
```
### 2.3. Locally Deployed MinerU Service
For offline/intranet environments, deploy `minerU` locally with API enabled. Set `mineru_deploy_base_url` to your minerU API endpoint.
**Client SDK:**
```python
from docutranslate.sdk import Client
client = Client(
api_key="YOUR_LLM_API_KEY",
model_id="llama3",
to_lang="Chinese",
convert_engine="mineru_deploy",
mineru_deploy_base_url="http://127.0.0.1:8000", # Your minerU API address
)
result = client.translate("document.pdf")
result.save(fmt="markdown")
```
## FAQ
**Q: Why is the output still in the original language?**
A: Check the logs for errors. It is usually due to the AI platform running out of credits or network issues (check if system proxy needs to be enabled).
**Q: Output is in original language?**
A: Check logs for errors. Usually due to exhausted API credits or network issues.
**Q: Port 8010 is occupied, what should I do?**
A: Use the `-p` parameter to specify a new port, or set the `DOCUTRANSLATE_PORT` environment variable.
**Q: Port 8010 occupied?**
A: Use `docutranslate -i -p 8011` or set `DOCUTRANSLATE_PORT=8011`.
**Q: Are scanned PDFs supported?**
A: Yes. Please use the `mineru` parsing engine, which has powerful OCR capabilities.
**Q: Scanned PDFs supported?**
A: Yes, use `mineru` engine with OCR capabilities.
**Q: Why is the first PDF translation very slow?**
A: If you are using the `docling` engine, it needs to download models from Hugging Face on the first run. Please refer to the "Network Issues Solutions" section above to speed up this process.
**Q: First PDF translation slow?**
A: `docling` needs to download models on first run. Use Hugging Face mirror or pre-download artifact.
**Q: How can I use it in an Intranet (Offline) environment?**
A: Absolutely. You need to meet the following conditions:
**Q: Use in intranet/offline?**
A: Yes. Use local LLM (Ollama/LM Studio) and local minerU or docling.
1. **Local LLM**: Use tools like [Ollama](https://ollama.com/) or [LM Studio](https://lmstudio.ai/) to deploy the language model locally, and enter the local model's `base_url` in `TranslatorConfig`.
2. **Local PDF Parsing Engine** (Only needed for PDF parsing): Use the `docling` engine and follow the "Offline Use" instructions above to pre-download the model package.
**Q: PDF cache mechanism?**
A: `MarkdownBasedWorkflow` caches parsing results in memory (last 10 parses). Configure via `DOCUTRANSLATE_CACHE_NUM`.
**Q: How does the PDF parsing cache mechanism work?**
A: `MarkdownBasedWorkflow` automatically caches the results of document parsing (file-to-Markdown conversion) to avoid repeated parsing consuming time and resources. The cache is stored in memory by default and records the last 10 parses. You can modify the cache size via the `DOCUTRANSLATE_CACHE_NUM` environment variable.
**Q: How to enable proxy support for the software?**
A: The software does not use the system proxy by default. You can enable it by setting `system_proxy_enable=True` in `TranslatorConfig`.
**Q: Enable proxy?**
A: Set `system_proxy_enable=True` in TranslatorConfig.
## Star History