Yes, but the 
speech
                    recognition accuracy varies greatly depending upon a large
                number of
                factors, including the type of speech (from prepared to spontaneous
                speech and conversational speech) and the noise level. So you can
                expect very good results when transcribing the speech of an anchor
                speaker in a TV or radio news show, but much less good results for
                the speech of someone engaged in a very casual conversation.
            
                Yes, the output of the VoxSigma software is an XML file that can be
                easily converted into plain punctuated text by discarding additional
                information such as word time-codes and word confidence scores.
            
 It depends greatly on the available
                language resources for the specific language. It also depends on the type of
                speech data you want to process. We are supporting many languages, including
                Arabic, Cantonese, Czech, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian,
                Italian, Latvian, Lithuanian, Mandarin, Pashto, Persian, Polish, Portuguese,
                Romanian, Russian, Spanish, Swahili, Swedish,
                Turkish, Ukrainian and Urdu. 
Contact us to get a more precise answer
                for the languages you are interested in.
            
                Vocapia Research 
LVCSR systems come
                with fully trained language models, so the only information you have
                to provide to the system is the language being spoken.
                If the language is not known, the language can be identified
                automatically (among 100 known languages) by using the VoxSigma language recognition
                software. A language identification system identifies the language
                being spoken from the speech signal.
            
 First you need a speech data set
                representative of the targeted data along with a reference transcription. This
                data set must large enough to estimate an accuracy which statistically
                significant. It is common to use test sets with 3 to 5 hours of speech from
                at least 20 speakers. It is common practice to measure
                the 
word error rate (WER) instead of the
                accuracy as it is correlated with the cost of using the system. The WER is
                defined as the ratio between the sum of the substitutions, insertions, and
                deletion, divided by the total number of word in the reference word. You can
                use the NIST sclite software to perform the alignment between the reference
                words and hypothesized words and compute the WER and to analyze the errors.