| Home | About Us | Contact Us | Support | Twitter Linkedin Facebook RSS
Vocapia Logo Leading edge speech processing technology

Speech to Text Conversion

Speech to text conversion is the process of converting spoken words into written texts. This process is also often called speech recognition. Although these terms are almost synonymous, Speech recognition is sometimes used to describe the wider process of extracting meaning from speech, i.e. speech understanding. The term voice recognition should be avoided as it is often associated to the process of identifying a person from their voice, i.e. speaker recognition.

How does it work?

All speech-to-text systems rely on at least two models: an acoustic model and a language model. In addition large vocabulary systems use a pronunciation model. It is important to understand that there is no such thing as a universal speech recognizer. To get the best transcription quality, all of these models can be specialized for a given language, dialect, application domain, type of speech, and communication channel.

Like any other pattern recognition technology, speech recognition cannot be error free. The speech transcript accuracy is highly dependent on the speaker, the style of speech and the environmental conditions. Speech recognition is a harder process than what people commonly think, even for a human being. Humans are used to understanding speech, not to transcribing it, and only speech that is well formulated can be transcribed without ambiguity.

From the user's point of view, a speech-to-text system can be categorized based in its use: command and control, dialog system, text dictation, audio document transcription, etc. Each use has specific requirements in terms of latency, memory constraints, vocabulary size, and adaptive features.


The VoxSigma software suite offers large vocabulary multilingual speech-to-text capabilities with state-of-the-art accuracy. It has been specifically designed for professional users, needing to transcribe large quantities of audio and video documents such as broadcast data, either in batch mode or in real-time. It can also be used to analyze call-center data.

The complete voice-to-text conversion process is done in three steps. The software first identifies the audio segments containing speech, then it recognizes the language being spoken if it is not known a priori, and finally it converts the speech segments to text and time-codes. VoxSigma includes adaptive features allowing the transcription of noisy speech such as speech with background music. The result is a fully annotated XML document including speech and non speech segments, speaker labels, words with time codes, high quality confidence scores, and punctuations. This XML file can be directly indexed by a search engine, or alternatively can be converted into plain text.

Vocapia Research also offers services to adapt, tune or create specific models or systems tailored to exactly match your needs. Tailoring models for your application is the best way to ensure you get the best possible results for your needs. High accuracy is essential to maximize your ROI, as to a first approximation, the cost of using a speech-to-text system is proportional to the system's error rate. Therefore using a system with a 80% accuracy (i.e. 20% error) may cost almost twice that of using a system with a 90% accuracy (i.e. 10% error). This is also be the case for systems with 90% and 95% accuracy, although the difference in error rate is 5%, the first system makes twice as many errors as the second.

Friday June 14, 2024

© Vocapia Research SAS,
2006-2023. All rights reserved.

Legal Notice   Privacy
About Us
Apply for job
Contact Us
Request form
STT for Linux