Doc2X:⾼精度、⾼性价⽐⽂档解析 API，助力Arxiv论文智能解读Agent构建

前言

在AI大模型时代，RAG（Retrieval-Augmented Generation）检索增强生成技术已经成为构建智能知识库和问答系统的核心架构。然而，在实际项目实施过程中，开发者们往往会遇到一个关键痛点：如何高质量地将各种格式的文档转换为结构化数据，以便后续的向量化和检索。

传统的文档解析方案存在诸多局限性：开源工具精度不足，商业化产品价格昂贵，复杂文档（特别是包含公式、图表的学术文档）解析效果差强人意。正是在这样的背景下，Doc2X应运而生，为开发者提供了一个高精度、高性价比的文档解析解决方案。

官方网站：https://doc2x.noedgeai.com/
Doc2X API接口文档：https://noedgeai.feishu.cn/wiki/Q8QIw3PT7i4QghkhPoecsmSCnG1

Doc2X产品概览

Doc2X是一款专为开发者设计的文档解析API服务，能够将PDF、图片等多种格式的文档精准转换为Markdown、LaTeX、HTML、Word等结构化格式。其核心优势可以概括为以下几点：

🎯 卓越的解析精度

相比传统开源方案和其他商业化工具，Doc2X在复杂文档解析方面表现突出：

复杂布局处理：对于包含多栏布局、图文混排的文档，能够准确识别和保持结构
表格跨页合并：智能识别并合并跨越页面边界的表格，确保数据完整性
图片内容提取：不仅提取图片，还能识别图片中的文字内容和对应的caption

🧮 领先的公式识别能力

这是Doc2X的核心竞争优势之一：

多格式公式支持：无论是印刷体还是部分手写体公式，都能实现高精度识别
LaTeX标准输出：转换结果符合LaTeX标准，支持MathJax渲染
Word兼容性：转换的公式在Word中能够正确显示，避免乱码问题

💰 极致性价比

相比同类产品，Doc2X提供了更具竞争力的价格方案，让中小企业和个人开发者也能享受到高质量的文档解析服务。其中0.02元一页，

在官方体验平台最近也在搞新用户活动，大家可以体验一下效果，每日签到可以送解析页码额度

在使用Doc2X之前，我们先回顾下RAG系统构建中的关键步骤是什么？

多种使用方式

支持API调用

Doc2x API v2 PDF 接口文档:https://noedgeai.feishu.cn/wiki/Q8QIw3PT7i4QghkhPoecsmSCnG1

这个文档也提供了

官方SDK工具封装的pdfdeal

源码地址：https://github.com/NoEdgeAI/pdfdeal-docs
文档地址：https://noedgeai.github.io/pdfdeal-docs/zh/guide/

文档对新手非常友好，里面也有些教程，大家可以操作试试。

桌面端应用：支持多种平台安装和使用

RAG系统构建中的核心价值

数据预处理阶段的关键作用

在RAG系统的构建流程中，Doc2X主要发挥以下作用：

文档标准化：将各种格式的文档统一转换为机器友好的格式
信息完整性保障：确保公式、表格、图表等关键信息不丢失
结构化数据输出：为后续的文本分块和向量化提供高质量的数据源

提升RAG系统整体效果

高质量的文档解析直接影响RAG系统的最终表现：

检索准确性提升：

准确的文本内容确保关键信息能被正确索引
保留的文档结构有助于上下文理解
完整的公式和表格信息提升专业领域查询的召回率

生成质量改善：

结构化的输入数据让大模型能够更好地理解文档内容
准确的公式表示避免了生成过程中的理解偏差
丰富的上下文信息提升了答案的准确性和完整性

学术论文PDF解析效果

最近读论文比较多，刚好见到这个不凑的工具，相比开源工具，容易调用以及构建应用，笔者充值了10元，500页额度，来测试下论文解读的效果

笔者通过Doc2X对Arxiv解析之后的论文markdown内容输入到大模型服务中，然后输出整篇论文解读内容。下面我们尽量做到自动化：

根据查询词实现Arxiv论文列表检索
指定某个论文然后下载PDF文件
然后将PDF文件传入到Doc2X API服务进行解析
根据解析结果调用大模型进行论文解读八股文

下面我们看看怎么实现？

Arxiv论文检索

首先安装arxiv 包

pip install arxiv

pypi文档地址：https://pypi.org/project/arxiv/
下面我们实现Arxiv论文搜索以及PDF论文下载

import arxiv
import os
from typing import List, Optional, Generator
from pathlib import Pathclass ArxivSearcher:"""Arxiv论文搜索和下载工具类"""def __init__(self):"""初始化Arxiv客户端"""self.client = arxiv.Client()def search_papers(self, query: str, max_results: int = 10, sort_by: arxiv.SortCriterion = arxiv.SortCriterion.Relevance) -> List[arxiv.Result]:"""搜索论文Args:query: 搜索查询词max_results: 最大结果数量sort_by: 排序方式Returns:论文结果列表"""search = arxiv.Search(query=query,max_results=max_results,sort_by=sort_by)results = list(self.client.results(search))return resultsdef search_by_id(self, paper_ids: List[str]) -> List[arxiv.Result]:"""根据论文ID搜索Args:paper_ids: 论文ID列表Returns:论文结果列表"""search = arxiv.Search(id_list=paper_ids)results = list(self.client.results(search))return resultsdef download_paper(self, paper_id: str, download_dir: str = "./downloads", filename: Optional[str] = None) -> str:"""下载指定论文的PDFArgs:paper_id: 论文IDdownload_dir: 下载目录filename: 自定义文件名Returns:下载文件的完整路径"""# 确保下载目录存在Path(download_dir).mkdir(parents=True, exist_ok=True)# 搜索论文papers = self.search_by_id([paper_id])if not papers:raise ValueError(f"未找到ID为 {paper_id} 的论文")paper = papers[0]# 下载PDFif filename:filepath = paper.download_pdf(dirpath=download_dir, filename=filename)else:filepath = paper.download_pdf(dirpath=download_dir)return filepathdef print_paper_info(self, papers: List[arxiv.Result]) -> None:"""打印论文信息Args:papers: 论文结果列表"""for i, paper in enumerate(papers, 1):print(f"\n{i}. 标题: {paper.title}")print(f"   作者: {', '.join([author.name for author in paper.authors])}")print(f"   发布日期: {paper.published.strftime('%Y-%m-%d')}")print(f"   摘要: {paper.summary[:200]}...")print(f"   PDF链接: {paper.pdf_url}")print(f"   论文ID: {paper.entry_id.split('/')[-1]}")def search_and_display(self, query: str, max_results: int = 10) -> List[arxiv.Result]:"""搜索并显示论文信息Args:query: 搜索查询词max_results: 最大结果数量Returns:论文结果列表"""print(f"正在搜索: {query}")print(f"最大结果数: {max_results}")print("-" * 80)papers = self.search_papers(query, max_results)self.print_paper_info(papers)return papers# 使用示例
if __name__ == "__main__":# 创建搜索器实例searcher = ArxivSearcher()# 示例1: 搜索"Retrieval Augmented Generation"相关论文print("=" * 80)print("示例1: 搜索 'Retrieval Augmented Generation' 相关论文")print("=" * 80)rag_papers = searcher.search_and_display(query="Retrieval Augmented Generation RAG",max_results=5)if rag_papers:print("\n" + "=" * 80)print("示例2: 下载第一篇论文")print("=" * 80)first_paper = rag_papers[0]paper_id = first_paper.entry_id.split('/')[-1]try:downloaded_path = searcher.download_paper(paper_id=paper_id,download_dir="./downloads",filename=f"rag_paper_{paper_id}.pdf")print(f"论文已下载到: {downloaded_path}")except Exception as e:print(f"下载失败: {e}")

Doc2X论文解析

from pdfdeal import Doc2X
from pathlib import Path
from typing import Union, List, Tuple, Optional
import os
import zipfile
import shutilclass PDFParser:"""PDF解析器类，用于将PDF文件转换为Markdown内容"""def __init__(self, api_key: str, debug: bool = True, thread: int = 5, full_speed: bool = True):"""初始化PDF解析器Args:api_key: Doc2X API密钥debug: 是否开启调试模式thread: 线程数full_speed: 是否开启全速模式"""self.client = Doc2X(apikey=api_key,debug=debug,thread=thread,full_speed=full_speed)def _extract_zip_file(self, zip_path: str, extract_to: str = None) -> str:"""解压ZIP文件Args:zip_path: ZIP文件路径extract_to: 解压目标目录，如果为None则解压到ZIP文件同目录Returns:解压后的目录路径"""if not os.path.exists(zip_path):raise FileNotFoundError(f"ZIP文件不存在: {zip_path}")# 如果没有指定解压目录，则使用ZIP文件同目录if extract_to is None:extract_to = os.path.dirname(zip_path)# 创建解压目录Path(extract_to).mkdir(parents=True, exist_ok=True)# 解压文件with zipfile.ZipFile(zip_path, 'r') as zip_ref:zip_ref.extractall(extract_to)print(f"ZIP文件已解压到: {extract_to}")return extract_todef parse_pdf_to_markdown_with_auto_extract(self, pdf_path: str,output_path: str = "./Output",output_format: str = "md",ocr: bool = True,convert: bool = False,auto_extract: bool = True,keep_zip: bool = False) -> Tuple[Union[str, List[str]], List[dict], bool, str]:"""将PDF文件解析为Markdown内容并自动解压（如果生成了ZIP文件）Args:pdf_path: PDF文件路径output_path: 输出目录路径output_format: 输出格式，支持 'md', 'md_dollar', 'text', 'texts', 'detailed'ocr: 是否使用OCRconvert: 是否将 [ 和 [[ 转换为 $ 和 $$auto_extract: 是否自动解压ZIP文件keep_zip: 是否保留原ZIP文件Returns:成功转换的内容或文件路径、失败信息、是否有错误、解压目录路径的元组"""# 检查PDF文件是否存在if not os.path.exists(pdf_path):raise FileNotFoundError(f"PDF文件不存在: {pdf_path}")# 确保输出目录存在Path(output_path).mkdir(parents=True, exist_ok=True)# 调用Doc2X进行转换success, failed, flag = self.client.pdf2file(pdf_file=pdf_path,output_path=output_path,output_format=output_format,ocr=ocr,convert=convert,)extract_dir = None# 如果转换成功且需要自动解压if not flag and auto_extract:# 检查是否生成了ZIP文件if isinstance(success, str) and success.endswith('.zip'):try:# 解压ZIP文件extract_dir = self._extract_zip_file(success)# 如果不保留ZIP文件，则删除它if not keep_zip:os.remove(success)print(f"已删除ZIP文件: {success}")print(f"解压完成，文件位于: {extract_dir}")except Exception as e:print(f"解压ZIP文件时出错: {e}")elif isinstance(success, list):# 处理多个文件的情况for file_path in success:if isinstance(file_path, str) and file_path.endswith('.zip'):try:extract_dir = self._extract_zip_file(file_path)if not keep_zip:os.remove(file_path)print(f"已删除ZIP文件: {file_path}")except Exception as e:print(f"解压ZIP文件 {file_path} 时出错: {e}")return success, failed, flag, extract_dirdef parse_existing_zip(self, zip_path: str, extract_to: str = None, keep_zip: bool = False) -> str:"""解析已存在的ZIP文件Args:zip_path: ZIP文件路径extract_to: 解压目标目录keep_zip: 是否保留原ZIP文件Returns:解压后的目录路径"""extract_dir = self._extract_zip_file(zip_path, extract_to)if not keep_zip:os.remove(zip_path)print(f"已删除ZIP文件: {zip_path}")return extract_dirdef parse_pdf_to_markdown(self, pdf_path: str,output_path: str = "./Output",output_format: str = "md",ocr: bool = True,convert: bool = False,) -> Tuple[Union[str, List[str]], List[dict], bool]:"""将PDF文件解析为Markdown内容Args:pdf_path: PDF文件路径output_path: 输出目录路径output_format: 输出格式，支持 'md', 'md_dollar', 'text', 'texts', 'detailed'ocr: 是否使用OCRconvert: 是否将 [ 和 [[ 转换为 $ 和 $$Returns:成功转换的内容或文件路径、失败信息、是否有错误的元组"""# 检查PDF文件是否存在if not os.path.exists(pdf_path):raise FileNotFoundError(f"PDF文件不存在: {pdf_path}")# 确保输出目录存在Path(output_path).mkdir(parents=True, exist_ok=True)# 调用Doc2X进行转换success, failed, flag = self.client.pdf2file(pdf_file=pdf_path,output_path=output_path,output_format=output_format,ocr=ocr,convert=convert,)return success, failed, flagdef parse_pdf_to_text(self, pdf_path: str) -> str:"""将PDF文件解析为纯文本字符串Args:pdf_path: PDF文件路径Returns:解析后的文本内容"""success, failed, flag = self.parse_pdf_to_markdown(pdf_path=pdf_path,output_format="text")if flag:  # 有错误raise Exception(f"PDF解析失败: {failed}")return successdef parse_pdf_to_pages(self, pdf_path: str) -> List[str]:"""将PDF文件按页解析为文本列表Args:pdf_path: PDF文件路径Returns:按页分割的文本列表"""success, failed, flag = self.parse_pdf_to_markdown(pdf_path=pdf_path,output_format="texts")if flag:  # 有错误raise Exception(f"PDF解析失败: {failed}")return successdef parse_pdf_to_markdown_file(self, pdf_path: str,output_path: str = "./Output",custom_filename: Optional[str] = None) -> str:"""将PDF文件转换为Markdown文件并保存Args:pdf_path: PDF文件路径output_path: 输出目录路径custom_filename: 自定义输出文件名Returns:生成的Markdown文件路径"""output_names = Noneif custom_filename:output_names = [custom_filename]success, failed, flag = self.client.pdf2file(pdf_file=pdf_path,output_names=output_names,output_path=output_path,output_format="md",ocr=True)if flag:  # 有错误raise Exception(f"PDF转换失败: {failed}")return success[0] if isinstance(success, list) else successdef batch_parse_pdfs(self, pdf_paths: List[str],output_path: str = "./Output",output_format: str = "md") -> Tuple[List[str], List[dict], bool]:"""批量解析多个PDF文件Args:pdf_paths: PDF文件路径列表output_path: 输出目录路径output_format: 输出格式Returns:成功转换的文件路径列表、失败信息列表、是否有错误"""# 检查所有PDF文件是否存在for pdf_path in pdf_paths:if not os.path.exists(pdf_path):raise FileNotFoundError(f"PDF文件不存在: {pdf_path}")# 确保输出目录存在Path(output_path).mkdir(parents=True, exist_ok=True)# 批量转换success, failed, flag = self.client.pdf2file(pdf_file=pdf_paths,output_path=output_path,output_format=output_format,ocr=True)return success, failed, flagdef get_markdown_content(self, pdf_path: str) -> str:"""直接获取PDF的Markdown内容（不保存文件）Args:pdf_path: PDF文件路径Returns:Markdown格式的文本内容"""success, failed, flag = self.parse_pdf_to_markdown(pdf_path=pdf_path,output_format="text",convert=True  # 转换数学公式格式)if flag:  # 有错误raise Exception(f"PDF解析失败: {failed}")return success# 使用示例
if __name__ == "__main__":# 初始化解析器（需要替换为您的API密钥）parser = PDFParser(api_key="sk-8vnrrnhtttc6xtk1qout8cqti65g3ocz")# 示例2: 解析PDF并自动解压pdf_path = "downloads/recent_rag_paper_2505.22571v3.pdf"if os.path.exists(pdf_path):try:print("\n正在解析PDF并自动解压...")success, failed, flag, extract_dir = parser.parse_pdf_to_markdown_with_auto_extract(pdf_path=pdf_path,output_path="./auto_extract_output",output_format="md",auto_extract=True,keep_zip=False  # 不保留ZIP文件)if not flag:print(f"PDF解析成功！")if extract_dir:print(f"内容已自动解压到: {extract_dir}")else:print(f"生成的文件: {success}")else:print(f"PDF解析失败: {failed}")except Exception as e:print(f"解析PDF时出错: {e}")

能够正确解析论文中的图片

论文表格解析完全正确

论文解读八股文

我们基于调用大模型服务，传入论文markdown内容，然后生成以下各个部分内容

研究动机：分析论文研究的核心问题和背景
研究现状：总结该领域的研究现状和前人工作
创新点：分析论文的创新思路来源
解决方案：详细分析论文提出的解决方案
实验设计：分析实验设计和验证方法
研究结论：总结论文的主要发现和结论
未来方向：分析论文提出的未来研究方向
伪代码：基于论文内容生成核心算法的伪代码

下面笔者构建了一个Streamlit应用，我们使用看看怎么使用

首先我们搜索一些关于RAG的论文

然后选择某篇我们感兴趣的论文进行下载

然后通过Doc2X进行解析

调用DeepSeek实现论文解读

下面是解析结果，我们可以看下：

在这里插入图片描述
最后是伪代码生成：

import numpy as np
from typing import List, Dict, Tuple
from transformers import AutoModelForCausalLM, AutoTokenizerclass RAGInstruct:def __init__(self, corpus: List[str], retriever_model: str = "contriever-msmarco",llm_model: str = "gpt-4"):"""初始化RAG-Instruct生成器参数:corpus: 外部知识语料库retriever_model: 检索模型名称llm_model: 用于生成指令的大语言模型"""self.corpus = corpusself.retriever = self._load_retriever(retriever_model)self.llm = self._load_llm(llm_model)self.instruction_datasets = self._load_exemplar_datasets()def generate_rag_instructions(self, num_instructions: int = 40000,max_docs_per_instruction: int = 5) -> List[Dict]:"""生成RAG指令数据集参数:num_instructions: 要生成的指令数量max_docs_per_instruction: 每个指令关联的最大文档数返回:生成的RAG指令数据集"""dataset = []for _ in range(num_instructions):# 1. 随机选择一个RAG范式rag_paradigm = self._sample_rag_paradigm()# 2. 随机选择一个模拟指令作为模板exemplar_instruction = self._sample_exemplar_instruction()# 3. 基于模拟指令检索相关文档relevant_docs = self._retrieve_docs(exemplar_instruction, top_k=max_docs_per_instruction)# 4. 根据RAG范式筛选文档selected_docs = self._select_docs_by_paradigm(relevant_docs, rag_paradigm)# 5. 随机采样不相关文档作为噪声unrelated_docs = self._sample_unrelated_docs(selected_docs)# 6. 使用LLM生成RAG指令和回答instruction, answer = self._generate_with_llm(selected_docs, unrelated_docs, exemplar_instruction, rag_paradigm)# 7. 添加到数据集dataset.append({"instruction": instruction,"answer": answer,"relevant_docs": selected_docs,"unrelated_docs": unrelated_docs,"paradigm": rag_paradigm})return datasetdef _sample_rag_paradigm(self) -> str:"""从5种RAG范式中随机采样一种"""paradigms = ["r0", "r1", "r2", "r3", "r4"]  # 对应论文中的5种范式weights = [0.1, 0.2, 0.2, 0.3, 0.2]  # 每种范式的采样权重return np.random.choice(paradigms, p=weights)def _generate_with_llm(self, relevant_docs: List[str], unrelated_docs: List[str],exemplar_instruction: str,paradigm: str) -> Tuple[str, str]:"""使用LLM生成RAG指令和回答参数:relevant_docs: 相关文档列表unrelated_docs: 不相关文档列表exemplar_instruction: 模拟指令模板paradigm: RAG范式返回:(生成的指令, 生成的回答)"""# 构建LLM提示(简化版，实际实现更复杂)prompt = f"""<Documents>{self._format_docs(relevant_docs)}</Documents>Your task is to generate a question q* and response a* based on:- RAG Paradigm: {self._get_paradigm_description(paradigm)}- Simulated Instruction: {exemplar_instruction}"""# 调用LLM生成response = self.llm.generate(prompt)return self._parse_llm_response(response)