Voice has the potential to revolutionize the way we interact with machines. Up to today, it is common to type on your laptop, phone or any other device like kitchen machinery or car. Wouldn’t speaking with a machine be a much more natural way of interaction?
The first very clear signs for a shift towards voice interaction have been around for some time: for example, already in 2016, Google reported that 20% of of searches in the Google App were done by voice in the US.
Since then, the wide adoption of voice assistants like siri, alexa or cortana have accelerated the development. According to Statista estimates, 4.2 billion digital voice assistants are used in devices around the world in 2020 - and this is expected to double until 2024.
PeakProfiling is convinced that speech will become the primary way in which humans interact with machines. Important preconditions like accurate speech recognition are already sufficiently advanced today. At the same time, today’s speech recognition systems tend to understand the content of speech, but not its meaning.
Moreover, there is a plethora of information which is communicated in a non-verbal manner. For example, saying “yes” in different tonalities can mean a strong confirmation as well as an obvious ironic “no”. This is where affective voice computing comes into play: recognizing human affects to eventually become more empathic.
Activities in the autonomic nervous system impact a person's speech, and affective algorithms are able to leverage this information to recognize emotion (Cummins et al., 2018). Detecting emotions from the human voice has a long research tradition (Gunawan, Alghifari, Morshidi, & Kartiwi, 2018). For example Fairbanks and Pronovost wrote about “An experimental study of the pitch characteristics of the voice during the expression of emotion” as early as 1939.
Research has prospered ever since. In recent years, approaches applying artificial intelligence account for the largest part of the achievements.
Most scientific research features either dimensional approaches to classify emotions, e.g. valence vs. arousal, or categorical approaches like Ekman's Basic Emotions (Juslin & Scherer, 2008). Both can be reasonably well measured from the voice with standard analytical approaches.
While these approaches typically work well in the laboratory, many of them fail in real-life applications. For example, some of our clients that were systematically assessing tools in the market saw success rates dropping towards chance levels when applying only slight changes in the environment. Why is this?
The most important reason is the data sets that many algorithms are based on. There are three options (Gangamohan, Kadiri, & Yegnanarayana, 2016):
(Actors play the emotions)
(Emotion stimulated in the lab)
(spontaneous speech annotated with emotion labels)
Many algorithms in the field are purely built on acted emotion data: actors are playing specific emotional states, often Ekman’s basic emotions. This comes with many advantages: data is widely available and can be produced in a controllable way. Actors are able to play the emotion in a very pure way - in “high quality” so to say. Similarly, recording quality is good and conditions are kept constant. Moreover, actors can keep up the emotion over longer time frames. Algorithms built on such data can achieve high success rates, often outperforming human raters.
However, when applying such algorithms in real-life situations with spontaneous speech, performance typically crashes. Humans tend to express emotions in a much more subtle way - most of the time voices are not loaded with with massive emotions. Moreover, there can be fluctuations in short time frames, a constant expression of the same emotion over a long time is rather rare.
The other two forms of emotion data, induced and natural data are typically less available and costly to generate. Moreover, there is a failure rate for voice induction and annotation of real-world emotions raises questions about the “source of truth” of the emotion at hand. Overall, there is not a single perfect way to generate a strong data basis to build the algorithms on.
This is where PeakProfiling comes in with vast experience. We found solutions to combine the different data sources in a scalable way - acted, induced and natural emotion data is mixed for a maximum of robustness. We typically do this in an industry specific manner, adjusting the signals for dedicated sources of noise. For instance, removing environmental sound is essential to achieve the highest quality in an industrial setup.
On this basis, we apply our unique analytics technology based on decades of research in quantitative musicology. In case you are interested in measuring emotional states from the voice just let us know and we are happy to share some demo applications.