When talking about voice technology, many people have questions about the multitude of terms and definitions that used in the field.
What is the difference between text-to-speech, voice changer, speech recognition, voice ID, paralinguistics or voice biometrics, just to name a few?
We have put together some definitions and explanations for the most widely used terms below. Moreover, we analyzed the popularity of each topic within the field of voice technology in order to understand the perceived priorities of the public.
While the variety of terms within the field of voice technology is high and continuously rising, most can be categorized as the industry’s answer to one of the following three questions:
The following table shows some of most popular terms in voice tech – the tool ubersuggest.com has been used to conduct the analysis which is based on data from search engines:
* mixed usage, sometimes also synonym to “speech recognition” ** sometimes also used in a text-to-speech context
Source: analysis by PeakProfiling based on ubersuggest.com - listing terms popular in online search, September 2020
The most popular terms within voice tech refer to semantics: algorithms are trying to recognize what a speaker is saying and convert it to a written format (speech to text) – or vice versa (text to speech). The field has a long research tradition with various waves of innovation. In recent years, success rates have been pushed beyond human-level-performance by means of big data and progress in artificial intelligence.
Speech recognition systems are broadly used in various applications from car interfaces to digital assistants like Amazon Alexa. However, considering the list of top search terms, it seems fair to assume speech recognition is mostly used to craft voice messages and memos – search terms like voice recording / speech recording, voice typing, speech memo, etc. rank high in the list of the most searched for terms (see below).
Overall, this segment within voice tech is pretty mature. Current progress is mostly incremental, for example improving detection rate in noisy situations or for languages with smaller populations.
The second well-known area within voice tech is speaker recognition, also called voice biometrics. Historically, there has been a close connection speech recognition (see above) because recognizing ‘what’ someone is saying is easier if the analysis can be to one specific person. However, speaker recognition is obviously also used as standalone solution to recognize voices, i.e. for verification or identification purposes.
Verification, also called authentication, means assessing if the claimed identity of speaker is correct, e.g. when attempting to enter a sensitive system. For example, this is sometimes used in banking applications, or in physical alarm systems (high security doors, etc.). Technically, the task is comparably simple: patterns in the voice are compared to previous recordings for similarity.
Identification is typically more complex, since the voice has to be compared against a wider set of recordings. However, the boundaries are increasingly blurring.
The field within voice tech with least publicity is analyzing HOW someone is speaking – which is often called paralinguistics or voice sound analysis. In contrast to the ‘Who’ and the ‘What’ described above, even the definition is more complex in this case:
When looking at the topic from a linguistic perspective (the overall scientific study of the language), the HOW would fall into the subdiscipline of Phonetics – which is concerning acoustics, production, and perception of speech sound.
Yet much of the research approaches the topic from a different angle, namely from a communication point of view. Thereby, HOW someone is speaking is seen as one type of non-verbal communication – like for example kinesics (the study of body movements and gestures), proxemics (the amount of space that people feel it necessary to set between themselves and others) or oculesics (the study of eye movement, eye behavior, gaze, and eye-related nonverbal communication). Specifically, paralinguistics is defined by the Cambridge dictionary as “connected with the ways in which people show what they mean other than by the words they use, for example by their tone of voice, or by making sounds with the breath”.
In recent years, the automatic analysis of the HOW by algorithms and artificial intelligence has gained prominence. PeakProfiling is a pioneer in the field. Specifically, we are addressing the topic with our background in musicology, analyzing musical categories in speech:
In order to compare the popularity of the three fields above, we analyzed how many searches on search engines like google are conducted using ubersuggest.com. Specifically, we derived the top 2.000 keywords in voice tech (excluding purely generic terms like “voice”) and classified them into semantics vs. biometrics vs. paralinguistics. The result can be seen in the graphic below:
The results are obvious: 66% of the top tail searches about voice tech on search engines concern Semantics. Interestingly, Text-to-speech is the biggest subcategory, closely followed by voice memo requests. Speech-to-text and translation services are the further big categories within the WHAT. The presumable reason for the popularity is the usage in day-to-day applications that make the interaction with content more comfortable - i.e. people using text to speech algorithms to dictate instead of typing themselves and letting algorithms read text for them and consuming them as audio.
26% of all searches are about speaker recognition (the WHO). Surprisingly, the internet user’s interest seems to be less about classic subtopics in the field, but more about the ability to change the voice (“voice changer”). This is for example relevant in gaming, when computer players want to distort the own voice so they are not recognizable anymore by humans (however, they still might be easily recognizable by voice biometric algorithms). Not only the extent of this phenomena is surprising, but also its development over time. Looking at the indexed, normalized search traffic for “voice changer” on google worldwide interest has recently been strongly increasing:
The HOW is the smallest category, accounting for only 8% of the top tail search traffic in voice tech. Furthermore, by far most of this category is not what we typically expect from a voice tech perspective - i.e. analyzing how voices sound using algorithms.
Instead, most search traffic in the field is about how people can improve/train their voices to sound better. Paralinguistics in a narrow sense only accounts for 0.6% of all top tail searches in voice tech. This demonstrates that, while the WHO and the WHAT are fields with a high degree of saturation - the HOW is comparably still a dynamic field in its early phase.
Within this growing niche, PeakProfiling is a pioneer and leading provider of analytical solutions.