This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using sampling to generate model responses and model grading to score the model responses against the output audio and reference answer. Note that grading will be on audio outputs from the sampled response.
Before audio support was added, to evaluate audio conversations, they first needed to be transcribed to text. Now you can use the original audio and get samples from the model in audio as well. This more accurately represents workflows such as a customer support scenario where both the user and the agent are using audio. For grading, we will use an audio model to grade the audio response with a model grader. We could alternatively, or in combination, use the text transcript from the sampled audio and leverage the existing suite of text graders.
In this example, we will evaluate how well our model can:
- Generate appropriate responses to user prompts about an audio message
- Align with reference answers that represent high-quality responses