Microsoft has released VibeVoice-ASR, a unified speech-to-text model available on Hugging Face that can transcribe long audio segments up to 60 minutes in a single pass without splitting.

Key features include:

  • Single-pass transcription for up to one hour, reducing context loss and maintaining stable speech recognition throughout the audio.
  • Built-in diarization and timestamps that identify who is speaking and when.
  • Custom hotwords and user context input to improve recognition accuracy for domain-specific words and names.

The model outputs a structured transcription indicating Who spoke, When, and What was said.

Explore VibeVoice-ASR on Hugging Face