PyVision：基于动态工具的具身智能体

论文地址：

[2507.07998v1] PyVision: Agentic Vision with Dynamic Tooling

1. 背景

现有的智能体一般都是通过大模型规划调用已经预定义好的一些工具（具体来说也就是一些函数）来解决问题。这样就会导致在针对特征的任务上Agent去解决问题缺乏灵活性。所以这篇文章就提出了pyVision来在解决特定问题的时候，针对任务具体的生成一些工具（函数或者也这说是代码）来提高智能体解决问题的能力。

2.框架架构

从示意图中可以看到PyVision 使一个多语言大语言模型(MLLM) 能够在推理过程中动态生成并执行Python 代码。在每个会话中，MLLM 接收一个输入（Input），生成相应的Python 代码，并在一个隔离的Python 运行环境中执行它。生成的输出——文本、视觉或两者皆有——会反馈回MLLM 的上下文，使其能够在多轮中迭代和完善其推理，直到产生最终答案。

其中：

code_block_i 指的是MLLM 在第i 轮生成的Python 代码。
mm_clue_i 指的是Python 解释器执行后的多模态输出。

3.具体推理案例

在文章中提到了针对几个不同领域特定的任务，来使用pyVsion来解决视觉推理的例子。

3.1 视觉搜索

3.2 医学图像分析

3.3 符号视觉谜题

3.4视觉草图绘制

3.5 视觉差异比较

3.6 视频理解

4 论文结论

从图中可以看到选择的这几种的任务数据集上，其中：

MathVista和MathVision-mini其中主要是多模态数学。用于测试LLM在具有需要视觉感知和数值推理的数学问题的表现。

MMMU：其中主要是测试跨学科的特定领域推理。

VisualPUzzles和VLMAreBlind-mini:里面主要是符号视觉谜题组成，用于测试LLM探索对抽象、结构化视觉原语
进行解析和推理的极限。

V∗主要用于测试LLM精确识别微妙的视觉细节。

从图中可以看到，在GPT-4.1的使用了Pyvision之后的PyVision-GPT-4.1在 MathVista的测试上提升了1.8%（也就是从69.9%-71.7%）同样的也可以看到在其他任务上也得到了一些提升。但是相比于o1 o3这些模型上面，其实还是差了不少。也同样说明这个框架中所用的后端大模型对于整体在解决这些问题上也是很重要的。

5. 代码复现

项目源码：https://github.com/agents-x-project/PyVision

DEMO：https://huggingface.co/spaces/Agents-X/PyVision

源码解析：

1. 配置LLM的API配置

项目里面提供了三种LLM的配置，分别是openai， auzre和vllm。其中配置文件是放在：

./api_config_files/api_config_*

2. 提示词模版

#英文原版
{"retool": "Solve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process. The Python code will be executed by an external sandbox, and the output (wrapped in `<interpreter>output_str</interpreter>`) can be returned to aid your reasoning and help you arrive at the final answer. The Python code should be complete scripts, including necessary imports. \nEach code snippet is wrapped with `<code>\n```python\ncode snippet\n```\n</code>`.\nThe last part of your response should be in the following format:\n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>\n\n*user question:*\nAnswer the following Math Problem and put the answer in the format of \\boxed{{answer}}\n\n{query}\n\n\nRemember to place the final answer in the last part using the format: \n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>","vistool": "Solve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process. The Python code will be executed by an external sandbox.\n\nFor all the provided images, in order, the i-th image has already been read into the global variable `image_clue_i` using the PIL.Image.open() function. When writing Python code, you can directly use these variables without needing to read them again.\n\nSince you are dealing with the VQA task, you MUST use the python tool (e.g., matplotlib library) to analyze or transform images whenever it could improve your understanding or aid your reasoning. This includes but is not limited to zooming in, rotating, adjusting contrast, computing statistics, or isolating features. \n\nNote that when you use matplotlib to visualize data or further process images, you need to use plt.show() to display these images; there is no need to save them. Do not use image processing libraries like cv2 or PIL. If you want to check the value of a variable, you MUST use print() to check it.\n\nThe output (wrapped in `<interpreter>output_str</interpreter>`) can be returned to aid your reasoning and help you arrive at the final answer. The Python code should be complete scripts, including necessary imports. \nEach code snippet is wrapped with `<code>\n```python\ncode snippet\n```\n</code>`.\nThe last part of your response should be in the following format:\n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>\n\n*user question:*\nAnswer the following Problem with an image provided and put the answer in the format of \\boxed{{answer}}\n\n{query}\n\nRemember to place the final answer in the last part using the format: \n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>","vistool_with_img_info": "Solve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process. The Python code will be executed by an external sandbox.\n\nFor all the provided images, in order, the i-th image has already been read into the global variable `image_clue_i` using the PIL.Image.open() function. When writing Python code, you can directly use these variables without needing to read them again.\n\nSince you are dealing with the VQA task, you MUST use the python tool (e.g., matplotlib library) to analyze or transform images whenever it could improve your understanding or aid your reasoning. This includes but is not limited to zooming in, rotating, adjusting contrast, computing statistics, or isolating features. \n\nNote that when you use matplotlib to visualize data or further process images, you need to use plt.show() to display these images; there is no need to save them. Do not use image processing libraries like cv2 or PIL. If you want to check the value of a variable, you MUST use print() to check it.\n\nThe output (wrapped in `<interpreter>output_str</interpreter>`) can be returned to aid your reasoning and help you arrive at the final answer. The Python code should be complete scripts, including necessary imports. \nEach code snippet is wrapped with `<code>\n```python\ncode snippet\n```\n</code>`.\nThe last part of your response should be in the following format:\n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>\n\n*image resolution:*\n\nImage Width: {width}; Image Height: {height}\n\n*user question:*\nAnswer the following Problem with an image provided and put the answer in the format of \\boxed{{answer}}\n\n{query}\n\nRemember to place the final answer in the last part using the format: \n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>","vistool_with_img_info_multi_image": "Solve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process. The Python code will be executed by an external sandbox.\n\nFor all the provided images, in order, the i-th image has already been read into the global variable `image_clue_i` using the PIL.Image.open() function. When writing Python code, you can directly use these variables without needing to read them again.\n\nSince you are dealing with the VQA task, you MUST use the python tool (e.g., matplotlib library) to analyze or transform images whenever it could improve your understanding or aid your reasoning. This includes but is not limited to zooming in, rotating, adjusting contrast, computing statistics, or isolating features. \n\nNote that when you use matplotlib to visualize data or further process images, you need to use plt.show() to display these images; there is no need to save them. Do not use image processing libraries like cv2 or PIL. If you want to check the value of a variable, you MUST use print() to check it.\n\nThe output (wrapped in `<interpreter>output_str</interpreter>`) can be returned to aid your reasoning and help you arrive at the final answer. The Python code should be complete scripts, including necessary imports. \nEach code snippet is wrapped with `<code>\n```python\ncode snippet\n```\n</code>`.\nThe last part of your response should be in the following format:\n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>\n\n*image resolution:*\n\n{image_information}\n\n*user question:*\nAnswer the following Problem with an image provided and put the answer in the format of \\boxed{{answer}}\n\n{query}\n\nRemember to place the final answer in the last part using the format: \n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>","vistool_with_img_info_v2": "You are an agent - please keep going until the user’s query is completely resolved, before ending your turn and yielding back to the user. Only terminate your turn when you are sure that the problem is solved. \n\nSolve the following problem step by step. You now have the ability to selectively write executable Python code to enhance your reasoning process. The Python code will be executed by an external sandbox. \n\nYou MUST plan extensively before each function call, and reflect extensively on the outcomes of the previous function calls. DO NOT do this entire process by making function calls only, as this can impair your ability to solve the problem and think insightfully.\n\nFor all the provided images, in order, the i-th image has already been read into the global variable `image_clue_i` using the PIL.Image.open() function. When writing Python code, you can directly use these variables without needing to read them again.\n\nSince you are dealing with the vision-related question answering task, you MUST use the python tool (e.g., matplotlib library) to analyze or transform images whenever it could improve your understanding or aid your reasoning. This includes but is not limited to zooming in, rotating, adjusting contrast, computing statistics, or isolating features. \n\nNote that when you use matplotlib to visualize data or further process images, you need to use plt.show() to display these images; there is no need to save them. Do not use image processing libraries like cv2 or PIL. If you want to check the value of a variable, you MUST use print() to check it.\n\nThe output (wrapped in `<interpreter>output_str</interpreter>`) can be returned to aid your reasoning and help you arrive at the final answer. The Python code should be complete scripts, including necessary imports. \nEach code snippet is wrapped with `<code>\n```python\ncode snippet\n```\n</code>`.\nThe last part of your response should be in the following format:\n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>\n\n*image resolution:*\n\nImage Width: {width}; Image Height: {height}\n\n*user question:*\nAnswer the following Problem with an image provided and put the answer in the format of \\boxed{{answer}}\n\n{query}\n\nRemember to place the final answer in the last part using the format: \n<answer>\n\\boxed{{'The final answer goes here.'}}\n</answer>","no_tool": "You are a helpful assistant. And you are dealing with the VQA tasks. Solve the visual questions step by step and give the correct answer. Note: put your answer in the format of \"\\boxed{{the right answer here}}\"\n *user question*:\n{query}","no_tool_no_cot": "Question:\n{query}\nGive the correct answer directly, in the format of \"Final Answer:\\boxed{{the final answer here}}\"\n"
}#中文版
{"retool": "逐步解决以下问题。您现在有能力选择性地编写可执行的Python代码来增强您的推理过程。Python代码将由外部沙盒执行，输出（包装在`<interpreter>output_str</interpreter>`中）可以返回以帮助您的推理并帮助您得出最终答案。Python代码应该是完整的脚本，包括必要的导入。\n每个代码片段都用`<code>\n```python\n代码片段\n```\n</code>`包装。\n您回答的最后部分应该采用以下格式：\n<answer>\n\\boxed{{'最终答案放在这里。'}}\n</answer>\n\n*用户问题：*\n回答以下数学问题并将答案放在\\boxed{{answer}}格式中\n\n{query}\n\n\n记住在最后部分使用以下格式放置最终答案：\n<answer>\n\\boxed{{'最终答案放在这里。'}}\n</answer>","vistool": "逐步解决以下问题。您现在有能力选择性地编写可执行的Python代码来增强您的推理过程。Python代码将由外部沙盒执行。\n\n对于所有提供的图像，按顺序，第i个图像已经使用PIL.Image.open()函数读入全局变量`image_clue_i`中。在编写Python代码时，您可以直接使用这些变量，而无需再次读取它们。\n\n由于您正在处理VQA任务，每当可以改善您的理解或帮助您的推理时，您必须使用python工具（例如，matplotlib库）来分析或转换图像。这包括但不限于放大、旋转、调整对比度、计算统计信息或隔离特征。\n\n请注意，当您使用matplotlib可视化数据或进一步处理图像时，您需要使用plt.show()来显示这些图像；无需保存它们。不要使用cv2或PIL等图像处理库。如果您想检查变量的值，您必须使用print()来检查它。\n\n输出（包装在`<interpreter>output_str</interpreter>`中）可以返回以帮助您的推理并帮助您得出最终答案。Python代码应该是完整的脚本，包括必要的导入。\n每个代码片段都用`<code>\n```python\n代码片段\n```\n</code>`包装。\n您回答的最后部分应该采用以下格式：\n<answer>\n\\boxed{{'最终答案放在这里。'}}\n</answer>\n\n*用户问题：*\n回答以下提供图像的问题并将答案放在\\boxed{{answer}}格式中\n\n{query}\n\n记住在最后部分使用以下格式放置最终答案：\n<answer>\n\\boxed{{'最终答案放在这里。'}}\n</answer>","vistool_with_img_info": "逐步解决以下问题。您现在有能力选择性地编写可执行的Python代码来增强您的推理过程。Python代码将由外部沙盒执行。\n\n对于所有提供的图像，按顺序，第i个图像已经使用PIL.Image.open()函数读入全局变量`image_clue_i`中。在编写Python代码时，您可以直接使用这些变量，而无需再次读取它们。\n\n由于您正在处理VQA任务，每当可以改善您的理解或帮助您的推理时，您必须使用python工具（例如，matplotlib库）来分析或转换图像。这包括但不限于放大、旋转、调整对比度、计算统计信息或隔离特征。\n\n请注意，当您使用matplotlib可视化数据或进一步处理图像时，您需要使用plt.show()来显示这些图像；无需保存它们。不要使用cv2或PIL等图像处理库。如果您想检查变量的值，您必须使用print()来检查它。\n\n输出（包装在`<interpreter>output_str</interpreter>`中）可以返回以帮助您的推理并帮助您得出最终答案。Python代码应该是完整的脚本，包括必要的导入。\n每个代码片段都用`<code>\n```python\n代码片段\n```\n</code>`包装。\n您回答的最后部分应该采用以下格式：\n<answer>\n\\boxed{{'最终答案放在这里。'}}\n</answer>\n\n*图像分辨率：*\n\n图像宽度：{width}；图像高度：{height}\n\n*用户问题：*\n回答以下提供图像的问题并将答案放在\\boxed{{answer}}格式中\n\n{query}\n\n记住在最后部分使用以下格式放置最终答案：\n<answer>\n\\boxed{{'最终答案放在这里。'}}\n</answer>","vistool_with_img_info_multi_image": "逐步解决以下问题。您现在有能力选择性地编写可执行的Python代码来增强您的推理过程。Python代码将由外部沙盒执行。\n\n对于所有提供的图像，按顺序，第i个图像已经使用PIL.Image.open()函数读入全局变量`image_clue_i`中。在编写Python代码时，您可以直接使用这些变量，而无需再次读取它们。\n\n由于您正在处理VQA任务，每当可以改善您的理解或帮助您的推理时，您必须使用python工具（例如，matplotlib库）来分析或转换图像。这包括但不限于放大、旋转、调整对比度、计算统计信息或隔离特征。\n\n请注意，当您使用matplotlib可视化数据或进一步处理图像时，您需要使用plt.show()来显示这些图像；无需保存它们。不要使用cv2或PIL等图像处理库。如果您想检查变量的值，您必须使用print()来检查它。\n\n输出（包装在`<interpreter>output_str</interpreter>`中）可以返回以帮助您的推理并帮助您得出最终答案。Python代码应该是完整的脚本，包括必要的导入。\n每个代码片段都用`<code>\n```python\n代码片段\n```\n</code>`包装。\n您回答的最后部分应该采用以下格式：\n<answer>\n\\boxed{{'最终答案放在这里。'}}\n</answer>\n\n*图像分辨率：*\n\n{image_information}\n\n*用户问题：*\n回答以下提供图像的问题并将答案放在\\boxed{{answer}}格式中\n\n{query}\n\n记住在最后部分使用以下格式放置最终答案：\n<answer>\n\\boxed{{'最终答案放在这里。'}}\n</answer>","vistool_with_img_info_v2": "您是一个代理 - 请继续直到用户的查询完全解决，在结束您的回合并让回给用户之前。只有在您确定问题已解决时才终止您的回合。\n\n逐步解决以下问题。您现在有能力选择性地编写可执行的Python代码来增强您的推理过程。Python代码将由外部沙盒执行。\n\n您必须在每次函数调用之前进行广泛规划，并对之前函数调用的结果进行广泛反思。不要仅通过进行函数调用来完成整个过程，因为这可能会损害您解决问题和深入思考的能力。\n\n对于所有提供的图像，按顺序，第i个图像已经使用PIL.Image.open()函数读入全局变量`image_clue_i`中。在编写Python代码时，您可以直接使用这些变量，而无需再次读取它们。\n\n由于您正在处理与视觉相关的问题回答任务，每当可以改善您的理解或帮助您的推理时，您必须使用python工具（例如，matplotlib库）来分析或转换图像。这包括但不限于放大、旋转、调整对比度、计算统计信息或隔离特征。\n\n请注意，当您使用matplotlib可视化数据或进一步处理图像时，您需要使用plt.show()来显示这些图像；无需保存它们。不要使用cv2或PIL等图像处理库。如果您想检查变量的值，您必须使用print()来检查它。\n\n输出（包装在`<interpreter>output_str</interpreter>`中）可以返回以帮助您的推理并帮助您得出最终答案。Python代码应该是完整的脚本，包括必要的导入。\n每个代码片段都用`<code>\n```python\n代码片段\n```\n</code>`包装。\n您回答的最后部分应该采用以下格式：\n<answer>\n\\boxed{{'最终答案放在这里。'}}\n</answer>\n\n*图像分辨率：*\n\n图像宽度：{width}；图像高度：{height}\n\n*用户问题：*\n回答以下提供图像的问题并将答案放在\\boxed{{answer}}格式中\n\n{query}\n\n记住在最后部分使用以下格式放置最终答案：\n<answer>\n\\boxed{{'最终答案放在这里。'}}\n</answer>","no_tool": "您是一个有用的助手。您正在处理VQA任务。逐步解决视觉问题并给出正确答案。注意：将您的答案放在\"\\boxed{{正确答案在这里}}\"格式中\n*用户问题*：\n{query}","no_tool_no_cot": "问题：\n{query}\n直接给出正确答案，格式为\"最终答案：\\boxed{{最终答案在这里}}\"\n"
}

3. 启动main.py

from openai import OpenAI
from inference_engine.vis_inference_demo_gpt import evaluate_single_data, evaluate_single_with_cleanup
from inference_engine.safe_persis_shared_vis_python_exe import PythonExecutor......
# Run inference with safe execution
print(f"Processing image: {args.image_path}")
print(f"Question: {args.question}")
print("Running inference with safe execution...")# messages, final_response = evaluate_single_with_cleanup(eval_args, data, client)
executor = PythonExecutor()
messages, final_response = evaluate_single_data(eval_args, data, client, executor)# Save results
os.makedirs(args.output_dir, exist_ok=True)if args.save_messages:messages_path = os.path.join(args.output_dir, "test_messages.json")with open(messages_path, "w", encoding="utf-8") as f:json.dump(messages, f, indent=4, ensure_ascii=False)print(f"Messages saved to: {messages_path}")

这里整体的逻辑也很简单，就是配置好参数之后使用evaluate_single_data函数进行执行得到模型推理结果。

4. inference_engine

这个模块是项目中最核心的代码，负责主要负责处理视觉问答（VQA）任务中的代码执行和推理过程。

evaluate_single_data

其中evaluate_single_data是整个系统的核心代码，实现了基于动态工具的具身视觉问答的完整流程。

# 参数提取和验证
prompt_template = args.prompt_template
prompt = args.prompt
exe_code = args.exe_code
max_tokens = args.max_tokens
temperature = args.temperature
api_name = args.api_name#提示模板选择逻辑
if "no_tool" in prompt:# 不使用工具的纯文本推理if len(image_path_list) == 1:messages = process_prompt_init(...)elif len(image_path_list) >= 2:messages = process_prompt_init_multi_images(...)
else:# 使用工具的推理if len(image_path_list) == 1:prompt = "vistool_with_img_info_v2"  # 单图像增强版messages = process_prompt_init(...)elif len(image_path_list) >= 2:prompt = "vistool_with_img_info_multi_image"  # 多图像messages = process_prompt_init_multi_images(...)
#迭代推理循环
while True:if exe_code and pred_stop_reason == "</code>":# 需要执行代码的情况# 1. 提取代码code_to_execute = response_text.split("```python")[-1].split("```")[0].strip()# 2. 执行代码exe_result = execute_codes([code_to_execute], messages, executor)[0][0]# 3. 处理执行结果if report == "Done":# 成功执行text_result = exe_result[0]['text']images_result = exe_result[0]['images']else:# 执行出错error_result = report# 4. 更新消息历史messages, new_image_clue_idx = update_messages_with_execute_content(...)# 5. 继续生成下一部分response_text, pred_stop_reason = call_chatgpt_api(...)else:# 不需要执行代码，完成推理final_response = response_textbreak

call_chatgpt_api 函数 - API调用封装

#多API支持
if client_type == "openai" or client_type == "azure":# OpenAI/Azure APIresponse = client.chat.completions.create(...)
elif client_type == "anthropic":# Claude APImessage = client.messages.create(...)
elif client_type == "vllm":# VLLM APIresponse = client.chat.completions.create(...)

#停止条件检测
# 检测停止序列
if stop and any(s in response_text for s in stop):for s in stop:if s in response_text:stop_reason = sbreak# 特殊处理代码块
if "<code>" in response_text:stop_reason = "</code>"

process_prompt_init 函数 - 提示构建

#图像编码处理
if "claude" in api_name:img_result = encode_image_with_resize(image_path)  # Claude需要调整尺寸
else:img_result = encode_image(image_path)  # 其他API直接编码

#消息结构构件
# 对于工具模式，添加image_clue标签
content.append({"type": "text", "text": "<image_clue_0>"})
content.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}})
content.append({"type": "text", "text": "</image_clue_0>\n\n"})

execute_codes 函数 - 代码执行管理

def execute_codes(codes, messages, executor: PythonExecutor):no_code_idx = []codes_use = []# 过滤空代码for i, code in enumerate(codes):if code == "":no_code_idx.append(i)else:codes_use.append(code)# 批量执行代码batch_results = executor.batch_apply(codes_use, messages)return batch_results, no_code_idx

update_messages_with_execute_content 函数 - 执行结果整合

#执行成功的情况
if error_result is None:# 构建解释器消息interpreter_message_text_prefix = [{"type": "text", "text": f"<interpreter>\nText Result:\n{text_result}\nImage Result:\n"}]# 处理生成的图像if images_result is not None:for image_base64_item in images_result:interpreter_message_images = [{"type": "text", "text": f"<image_clue_{image_clue_idx}>"},{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64_item}"}},{"type": "text", "text": f"</image_clue_{image_clue_idx}>"}]image_content += interpreter_message_imagesimage_clue_idx += 1
#执行失败的图像
else:interpreter_message_text_prefix = [{"type": "text", "text": f"<interpreter>{error_result}"}]