音频转文本技术详解：API接口、实用示例与最佳实践

概述
接口类型与模型说明
支持的音频格式与文件大小限制
快速入门
音频转录（Transcription）
音频翻译（Translation）
支持的语言列表
时间戳功能
处理较长音频
上下文提示与转录优化
流式转录
静态音频文件流式转录
实时录音流式转录
提高转录可靠性的建议

概述

本文介绍通过 API 将音频内容转化为文本的方法，涵盖接口模型、参数说明、关键技术细节及实用代码示例。全部示例以 https://zzzzapi.com 作为演示 base URL，该域名仅用于演示，请根据实际项目替换为自有或合规的服务地址。

接口类型与模型说明

API 提供两类主要的语音转文本接口：

Transcriptions（转录）：将音频转录成原始语言文本，或根据参数设置翻译成英文。
Translations（翻译）：将多语言音频直接翻译并转录为英文文本。

历史上，这两个端点均基于开源 Whisper 模型（whisper-1）。目前，转录端点还支持更高质量的快照模型，参数支持有限：
- gpt-4o-mini-transcribe
- gpt-4o-transcribe

更新说明：部分新模型（如 GPT-4o 系列）支持参数与输出格式有限，仅支持 json 或纯文本输出。

支持的音频格式与文件大小限制

文件大小限制：最大 25 MB。
支持的输入格式：mp3、mp4、mpeg、mpga、m4a、wav、webm。

快速入门

音频转录（Transcription）

以下 Python 示例演示如何将本地音频文件转录为文本：

文件名示例：transcribe_audio.py

from openai import OpenAIclient = OpenAI()
# 使用二进制读取音频文件，确保路径正确
audio_file = open("audio.mp3", "rb")
# 使用高质量转录模型转录
transcription = client.audio.transcriptions.create(model="gpt-4o-transcribe",  # 或 whisper-1，根据需求选择模型file=audio_file
)
print(transcription.text)

默认响应格式为 json，其中包含原始文本。
若需指定响应格式为纯文本：

from openai import OpenAIclient = OpenAI()
audio_file = open("speech.mp3", "rb")
transcription = client.audio.transcriptions.create(model="gpt-4o-transcribe",file=audio_file,response_format="text"
)
print(transcription.text)

安全要点与错误处理建议：
- 确保 API key 合规且不会泄漏。
- 对 API 响应做异常捕获和超时处理。
- 音频文件应避免包含敏感信息。
- 注意速率限制，合理设置重试机制。

参数详情可在官方 API 文档查阅，不同模型参数支持情况有所差异。

音频翻译（Translation）

Translations 端点仅 whisper-1 模型支持。将非英语音频翻译为英文：

文件名示例：translate_audio.py

from openai import OpenAIclient = OpenAI()
audio_file = open("german.mp3", "rb")
translation = client.audio.translations.create(model="whisper-1",file=audio_file
)
print(translation.text)

输出为英文文本。
目前仅支持翻译为英文。

支持的语言列表

当前支持以下语言（部分 ISO 639-1/639-3 代码）：

Afrikaans、Arabic、Armenian、Azerbaijani、Belarusian、Bosnian、Bulgarian、Catalan、Chinese、Croatian、Czech、Danish、Dutch、English、Estonian、Finnish、French、Galician、German、Greek、Hebrew、Hindi、Hungarian、Icelandic、Indonesian、Italian、Japanese、Kannada、Kazakh、Korean、Latvian、Lithuanian、Macedonian、Malay、Marathi、Maori、Nepali、Norwegian、Persian、Polish、Portuguese、Romanian、Russian、Serbian、Slovak、Slovenian、Spanish、Swahili、Swedish、Tagalog、Tamil、Thai、Turkish、Ukrainian、Urdu、Vietnamese、Welsh。

模型训练覆盖近 98 种语言，但仅对列出的语言保证较高准确率（词错误率低于 50%）。其他语言可尝试，但质量不保证。

时间戳功能

通过 timestamp_granularities[] 参数，whisper-1 支持在转录结果中添加时间戳，可针对片段或词级。

文件名示例：transcribe_with_timestamps.py

from openai import OpenAIclient = OpenAI()
audio_file = open("speech.mp3", "rb")
transcription = client.audio.transcriptions.create(file=audio_file,model="whisper-1",response_format="verbose_json",timestamp_granularities=["word"]
)
print(transcription.words)

仅 whisper-1 支持该参数。
可用于字幕、视频编辑等场景。

处理较长音频

25 MB 限制下，长音频需分割。建议避免在句子中间切割，可用 pydub 等库辅助：

文件名示例：split_audio.py

from pydub import AudioSegment# 加载音频
song = AudioSegment.from_mp3("good_morning.mp3")
ten_minutes = 10 * 60 * 1000  # 10分钟（毫秒）
first_10_minutes = song[:ten_minutes]
first_10_minutes.export("good_morning_10.mp3", format="mp3")

注意：第三方工具如 pydub 需自行评估安全性与兼容性。

上下文提示与转录优化

通过 prompt 参数可提升模型识别罕见词汇、专有名词或维持文本风格：

文件名示例：transcribe_with_prompt.py

from openai import OpenAIclient = OpenAI()
audio_file = open("speech.mp3", "rb")
transcription = client.audio.transcriptions.create(model="gpt-4o-transcribe",file=audio_file,response_format="text",prompt="本次录音为关于OpenAI、GPT-4.5与人工智能未来发展的讲座。"
)
print(transcription.text)

提示场景举例：
- 提供领域词汇、专有名词（如 DALL·E、GPT-3）。
- 分段处理时，传递上一段文本保持上下文。
- 控制输出风格（如简体/繁体、标点、填充词等）。

whisper-1 仅保留最后 224 个 token 的 prompt，有一定局限性。
多语言输入下采用专用分词器，详情可查阅 Whisper 开源包。

流式转录

静态音频文件流式转录

API 支持对已完成音频（如文件、基于自身轮次检测）进行流式转录：

文件名示例：stream_transcription.py

from openai import OpenAIclient = OpenAI()
audio_file = open("speech.mp3", "rb")
stream = client.audio.transcriptions.create(model="gpt-4o-mini-transcribe",file=audio_file,response_format="text",stream=True
)
for event in stream:print(event)  # 实时输出转录片段

每段转录完成即收到 transcript.text.delta 事件，最终有 transcript.text.done 包含完整文本。
可通过 include[] 参数获取词概率（logprobs），评估模型置信度。
该功能不适用于 whisper-1。

实时录音流式转录

通过实时 API，可通过 WebSocket 实时推送音频并转录。连接示例：

wss://zzzzapi.com/v1/realtime?intent=transcription

初始化转录会话示例 Payload：

{"type": "transcription_session.update","input_audio_format": "pcm16","input_audio_transcription": {"model": "gpt-4o-transcribe","prompt": "","language": ""},"turn_detection": {"type": "server_vad","threshold": 0.5,"prefix_padding_ms": 300,"silence_duration_ms": 500},"input_audio_noise_reduction": {"type": "near_field"},"include": ["item.input_audio_transcription.logprobs"]
}

推送音频数据：

{"type": "input_audio_buffer.append","audio": "Base64EncodedAudioData"
}

采用 VAD（语音活动检测）模式时，每检测到语音块即返回 input_audio_buffer.committed 事件。
API 返回语音开始、停止、完成等事件。
实时会话资源结构详见 API 文档。
WebSocket 可直接用 API key 或临时 token 认证，临时 token 可通过 POST v1/realtime/transcription_sessions 获取。

提高转录可靠性的建议

常见挑战如罕见词汇识别，建议：
- 使用适当 prompt 提供上下文。
- 转录后可结合 GPT-4 等模型进行文本校正。
- 结合词概率（logprob）输出辅助评估。
- 遇到多段处理时，合理传递上下文。
- 留意模型与参数支持情况，合理选择模型。

附注：
- 以上示例仅供技术演示，请根据实际业务合规性与安全要求进行调整。
- 相关参数如超时、重试、速率限制请参考官方 API 文档及最佳实践。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。
如若转载，请注明出处：http://www.pswp.cn/bicheng/94625.shtml
繁体地址，请注明出处：http://hk.pswp.cn/bicheng/94625.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！