Whisper

openai가 2022년 10월에 출시한 음성 인식 모델이다. 68,000시간의 음성 데이터를 학습했다. 한국어를 포함한 99개 언어를 지원한다. 오픈소스로 공개했다. openapi는 whisper에 연동될 수 있는 ai service api를 제공한다.

버전 업그레이드가 되고 있지만, api에서는 첫 번째 버전에서 사용했던 모델명을 그대로 사용하고 있다. whisper-1

api의 request body를 보면, 언어 모델에서 사용하던 것들을 사용함을 볼 수 있다. whisper가 openai의 llm을 근간으로 한다는 것을 알 수 있다.

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing a single model to replace many stages of a traditional speech-processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

[음성 인식; speech to text]

오디오 파일을 받아 텍스트로 변환한다.

다양한 음성 파일 형식(m4a, mp3, mp4, mpeg, mpga, wav, webm 등)을 지원한다.

프롬프트도 줄 수 있다. An optional text to guide the model’s style or continue a previous audio segment. The prompt should match the audio language.

텍스트 변환을 응답이라고 할 때, 응답 형식도 지정할 수 있다. 기본 값은 json이다. json, text, srt, verbose_json, vtt로 설정 가능

verbose_json – 타임스탬프 포함, 상세한 정보를 얻을 수 있다.
- timestamp 제공 시, 어떤 방식을 사용할 지 정할 수도 있다. timestamp_granularities[]에 word나 segment를 사용할 수 있다. There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

temperature도 설정할 수 있다.