Microsoft Releases VibeVoice-ASR Speech-to-Text Model
Microsoft has released VibeVoice-ASR, a unified speech-to-text model available on Hugging Face that can transcribe long audio segments up to 60 minutes in a single pass without splitting.
Key features include:
- Single-pass transcription for up to one hour, reducing context loss and maintaining stable speech recognition throughout the audio.
- Built-in diarization and timestamps that identify who is speaking and when.
- Custom hotwords and user context input to improve recognition accuracy for domain-specific words and names.
The model outputs a structured transcription indicating Who spoke, When, and What was said.