As we have seen in other articles, Automatic Speech Recognition (ASR) technology uses devices and software to identify and process oral language, much like it can be used to verify a person’s identity through their voice. Automatic speech recognition (ASR) is everywhere, from automatic subtitles on YouTube to virtual assistants like Siri, Google Assistant, and Alexa.
However, while there have been significant advancements in this technology in recent years, it does not always offer accurate results. During the process of speech recognition and its subsequent translation to text, some words may be omitted or translated incorrectly. This, in short, is the “word error rate” (WER).
In this article, we will define in detail what WER is and why you should consider it when evaluating the ASR system you plan to implement in your Call Center.
What is WER?
Word Error Rate (WER) is a common metric for measuring accuracy in voice-to-text conversion. WER evaluates the accuracy of speech recognition, being an indicator of the amount of errors that occur in transcription in relation to the total number of spoken words.
To calculate WER, the total number of errors (the sum of substitutions, insertions, and deletions) is divided by the total number of spoken words. This measure is based on Levenshtein distance, which measures the difference between two word strings in a transcription. For example, if a transcription has 9 errors in a 36-word phone call, the WER would be 25%. A low WER is usually a signal of higher ASR software accuracy, while a high WER indicates lower accuracy.
However, WER has several limitations as it does not consider the source of errors such as recording quality, background noise, microphone quality, technical or industry-specific terms, and speaker pronunciation. Additionally, WER does not take into account the importance of words for the specific purpose of the transcription. Although a transcription may have a low WER, it may be less useful if it omits relevant keywords for analysis, just as some systems with a relatively high WER can produce useful data in specific contexts. Therefore, it is important to consider how a speech recognition tool will handle data and which words are important for the purpose of transcription.
How can WER reduce errors in calls?
To reduce errors in call center calls, WER can be used to monitor the accuracy of the speech recognition system. If the WER is high, it means that the system is making many errors in transcribing the conversation.
When evaluating conversational AI solutions, it is important to consider that the Word Error Rate (WER) is just one metric for evaluating automatic speech recognition (ASR), and it is not perfect as it only counts errors and does not consider the variables that cause them. To reduce these errors, the following measures can be taken:
- Improve audio quality: The speech recognition system may struggle to accurately transcribe the audio if there is a lot of background noise or the audio quality is poor. Therefore, it should be ensured that customers and agents have good audio quality on their devices.
- Model training: The speech recognition system uses machine learning algorithms to identify patterns in conversation. If the system is making many errors, the model can be trained with more data to improve the accuracy of the system.
- Limit vocabulary: If the system is recognizing incorrect words, the vocabulary used in the conversation can be limited to reduce the number of words the system must recognize.
- Use grammar and context: Grammar and context can be used to help the system better understand the conversation and reduce recognition errors.
However, the mistake many companies make is using the same dataset to train and evaluate their models, which can generate artificially high accuracy. Therefore, it is important to select a tool that fits the specific needs of your company and the audio data that will be analyzed.
Upbe‘s ASR is designed to transcribe phone conversations in Spanish from all Spanish-speaking countries. It is specifically trained for the context of a call center where there is background noise, voice overlap, and limited quality recordings. It is considered the best ASR with the lowest WER (Word Error Ratio) for Spanish telephone conversation.
2 Comments
Comments are closed.