Today, call recordings and transcriptions are an important part of an effective customer service experience. The fact that companies can use text transcriptions of their calls will help them not only to know the quality of the service they provide but also the needs of their customers in order to adapt their speech or products to these needs. It is because of this need to improve the customer experience and the quality of service that Artificial Intelligence-based solutions, such as call transcribers, are emerging to facilitate this task.
<<< Discover Recordia Transcription solution >>>
But what is a call transcription software, how does it work and what are its peculiarities? We are delighted to have our Machine Learning Developer, Miguel Lallena, tell us all about call transcription.
1. First of all, what is a call transcription software?
A call transcription software is a service that manages to transform what a person is saying in an audio to a text directly. To perform this service, the call transcription software will have a series of modules and models that will take the audio, extract its wave function, identify the sounds it produces and transform them into characters. These characters will then be combined into words and, using a set of internal metrics, it will determine which are the most likely words based on how they are usually combined.
2. What problems do you think it solves?
One of the most important solutions provided by the call transcription software is cost reduction. A company that wants to keep track of what its customers and employees are saying through calls would have to manage a large number of audios by hand, which would require several hours of work for each hour of audio. An automatic transcription software, on the other hand, avoids that problem since it only needs a fraction of that time to achieve the result. In addition, they are also capable of processing multiple audios at the same time, which translates into obtaining results at a lower cost and time.
Another solution provided by the call transcription software is text analysis. If this analysis were done manually, besides requiring a long time and effort, it would be susceptible to subjective factors of the person’s own judgment or human error. Automated transcription software avoids this problem because a client only needs to enter a set of rules for the transcriptionist to extract all the information about the analysis in a systematic and fast way.
3. Is voice identification the same as voice recognition?
Voice identification and voice recognition are different things.
On the one hand, we have voice identification, which is a biometric application that allows us to know which person is speaking by measuring voice parameters and comparing them with a database. On the other hand, we have voice recognition, which is the ability of software to identify what is being spoken. Speech recognition software can identify words and phrases in an audio file and convert them into a machine-readable format. Both technologies can be combined in the same product, but they are different.
To clarify, the two technologies differ in two main ways: depending on what you are looking for or depending on how you are going to train it. In terms of what you are looking for, voice identification looks for patterns inherent to a person such as tone or timbre of voice, while voice recognition looks for words or phrases that are independent of who is speaking. In terms of how it will be trained, when you are looking for voice identification you need software that is able to extract the pitch and timbre information and associate it with a person; meanwhile, when you want to perform voice recognition, you need a software that searches the phonemes that are pronounced and associates them with words and phrases.
4. What is the WER rate and why is it important for call transcription software? What is the optimal WER rate percentage?
WER is an acronym that means Word Error Rate. This value is obtained by making a transcription of an audio and comparing it with the correct text. In other words, the WER rate is the ratio of words that are wrong. When we talk about wrong words, we can refer to words that have been added to the transcription, words that are missing but should be there, or words that are not spelled correctly. Combining these factors with the total number of words in the correct text will give us the WER rate.
Of course, the optimal WER rate for a call transcription software would be 0 because it would be a sign that you are going to have a perfect transcription. But it is a value that cannot be consistently achieved due to the peculiarities of automatic transcription software. The job is to try to reduce it more and more with training, testing, or audio cleaning. Around 20-30% is acceptable and where most of the automatic transcription software are right now, but you must keep training it as I say, always keeping in mind that the more you train them, the more difficult it is to reduce the WER rate.
5. What other metrics or parameters are important to consider in call transcription?
An important parameter to consider is the speed of transcription. A manual transcription can give us a perfect text, but it would take hours or even days to get that result. An automatic transcription software, on the other hand, is capable of transcribing in a fraction of the time it would take manually, even less than the actual length of the audio. We usually talk about the “5x” factor, which means that for every hour of work of the automatic transcription software, 5 hours of audio will be transcribed. Moreover, if we add to this the fact that several instances or transcriptions can be launched in parallel, this speed is multiplied.
Another parameter to consider is confidence. Confidence is the degree to which the call transcription software is certain that the transcription is correct. A higher degree of confidence indicates that the call transcription software has performed correctly and therefore the transcribed text is more likely to be correct.
Finally, another parameter to consider is audio quality. The ratio of signal to noise is an important element because an audio with a lot of background noise will be more difficult to be transcribed correctly. In fact, it is important to use this noise to train the transcription engine in such a way that being able to distinguish the noise from the phonemes in the audio will help to improve the other parameters we are talking about.
6. What elements of a call are important for a call transcription software to detect?
One of the first elements to take into account is the identification of the language spoken in the calls because this is clearly the starting point for the transcription model that will be used to transcribe. If we have an audio in English, we cannot apply a Spanish transcription model to it and vice versa, even if we are dealing with similar languages such as Romance languages.
Another element to emphasize is the distinction between speakers in an audio. In the case of stereo audio, this distinction is easy because we will have the agent and the client each on a different channel. But the problem arises in two situations: several interlocutors in mono audio or typical audio of an Internet call, VoIP, where there is a host on one channel and several guests on another. In these situations, the process called diarization is performed, in which the audio is automatically studied and the voice corresponding to each speaker is distinguished.
Related to diarization, we find overlapping. Overlapping is the phenomenon that occurs when 2 or more people speak at the same time. In a company dedicated, for example, to telephone sales, overlapping will be an important factor to detect since it will provide information about the treatment given by the agent or possible customer dissatisfaction.
Waiting periods are another element to detect. These waiting periods can occur either because of the initial tone of voice, the waiting music, or because there is a silence within the conversation and can affect the company-customer relationship. It should always be bear in mind that a customer who is kept waiting for a long time is a dissatisfied customer, so it is essential for companies to detect these waiting periods and determine their duration.
Finally, another element to detect is the speed of speech. This can reveal a lot about an agent’s or a customer’s enthusiasm about a particular topic. In addition to being able to be studied at a specific moment in the call, the entire call can be measured, revealing information about the variation in an agent’s energy during the whole call.
7. Many companies have special needs because of the sector where they operate, can the transcription service be customized? What kind of customizations can be made?
Yes, it is possible to customize the transcription service, although it will require an effort on the part of the client, it is possible to do it. To do this, several elements can be used. Firstly, there are language models that are useful especially for companies working internationally or in multilingual countries such as, for instance, Belgium. In these circumstances, it will be possible to choose to have several language models and the company will be the one to request that each audio be transcribed through a model. You can even include dialects of the language.
On the other hand, in each language, there will be the possibility of using a generic model or more specialized models depending on the sector where the company operates. For example, a model for a bank will mention mortgage, APR, or pension plan, while a model for the health sector will include words such as gastroenteritis, coronary, or stroke.
But if you want to go even further, there are customizations that allow you to add names of competitors or your own products to these dictionaries. The company can provide a list of words together with the pronunciation of each one of them so that they can be included in the dictionary that feeds the model.
8. Is it difficult to implement a transcription model?
It depends on the technology you use, but initially, rather than difficult, I would say that it is a lengthy process. First, you need many conversations, like hundreds or thousands, that will contain several hundreds of hours of audio that are stored in a specific format. Secondly, you also need to have the transcription of all those audios into text, which requires a lot of work. Then you have to prepare the data, store the transcriptions, and all the other necessary elements together with the audios.
To continue, we have to create a dictionary of words, and, in addition, we need a model that associates a set of words with probabilities. It works in the same way as the text predictor of cell phones: each word will have a probability associated with the next word. On the other hand, each word must be associated with its pronunciation in writing through, for example, the IPA alphabet, that is, phonetic alphabets in which each sound is associated with a symbol.
Finally, once you have all of this, you will need an acoustic model that will associate the phonemes to the sounds and will be trained using neural networks, which requires more time because it performs a large amount of data processing, comparing them with the correct results and adjusting internally. As I say, it is a long process and it is also subject in all cases to the problem that, if there is a failure in the computer, you have to practically start again from the beginning.
9. Are there any new developments in the world of call transcription?
Currently, the main effort regarding automatic transcription software is focused on improving accuracy, reducing the WER rate and transcription speed. But recently, new ideas and complementary elements to this transcription are emerging and adding value, such as for example, keyword search. This will help companies to keep track of how their agents interact with customers and if they follow the scripts they should. On the other hand, they will be able to acquire the text of the transcripts and analyze it automatically to detect the feelings of the customers, to know if they are satisfied. An unhappy customer is a lost customer, and this can be detected through the transcript.
Finally, you can also analyze the syntax of the text as it can give you ideas of how the agent or the customer is expressing themselves and provide useful data. For example, an agent who speaks in the first person is an agent who is likely to be difficult to deal with.
10. And finally, what would you say to a company that is hesitating to use a call transcription software?
I would tell them that a call transcription software, properly used, will help them to increase the value of all the interactions they have with their customers. Every conversation you have will be a new source of data and information for the company, which will then have the ability to satisfy existing customers, attract new ones, and expand its sights to new horizons.
If you still want to learn more about call transcription software, click here.
See you soon!
