More voice, less choice: The rise of voice interfaces and the decline of open source voice
Tux Theatre | Sun 24 Jan 4:40 p.m.–5:25 p.m.
Presented by
-
Kathy Reid
@kathyreid
https://kathyreid.com.au
Kathy Reid works at the intersection of open source, emerging technologies and the communities that bring them to life. She has twenty years' experience in development, developer and technical leadership and management roles across education and emerging technology.
She is currently with Mozilla's Voice team, and is doing a PhD with ANU's 3A Institute on how open voice technology goes to scale.
Kathy Reid
@kathyreid
https://kathyreid.com.au
Abstract
It's estimated that by 2023, there will be 8 billion voice assistants in the world. We find them on our mobile devices. We find them in our cars. We find them embodied into hardware devices, situated on in our living rooms, our kitchens, our bedrooms. Voice user interfaces - using natural spoken language to issue commands to a computing system - have been around for decades, both in reality, and in our imagination. Advances in machine learning and in embedded hardware mean that voice interfaces are experiencing a renaissance - just ask Siri, or Alexa, or OK Google.
At the same time, traditional mainstays of open source speech recognition and voice synthesis are in decline. For example, the CMU Sphinx team has moved on to a commercial project. Common Voice and DeepSpeech from Mozilla are no longer being actively supported, and their future (at the time of writing) is unclear. The open source voice assistant community - Mycroft, Julius, Sepia, OVAL - is fragmented, and struggling to gain traction due to the power of content creators - such as Spotify - to dictate which channels their content can be accessed through. This reinforces the power and the dominance of proprietary players.
Those same proprietary players are driven by commercial considerations to operate in markets that yield a profit - mostly white, Western markets. Without open source options, much of the Global South will be under-served by voice technology and its benefits - the ability to overcome poor literacy, gain instant access to information without a computer in your own language - and so on.
If we're going to use voice technology to provide benefits to some groups of people in the world, and in doing so, that reinforces digital divides around inclusion, equity and access, then shouldn't voice technology be a public good?
There are several things we can do to intervene in this state of affairs, including;
- gaining more widespread support and adoption of open source voice assistants, which aids the case for content providers to support them;
- investing in voice for low resource languages through measures such as corpora creation, and tools for training low resource languages;
- ensuring that we know the value our voice data has for proprietary providers, and choosing judiciously when we share it;
- and allowing more services that have benefit to marginalised groups to deliver their services through voice user interfaces.
But there are significant barriers to achieving voice that is more accessible to all, including;
- the vast amounts of data required for training;
- the high barriers to entry of deep learning and machine learning infrastructure;
- and differences in languages and accents
This talk will provide an overview of the landscape, and identify where, and how, we can best intervene to ensure everyone has a voice.
It's estimated that by 2023, there will be 8 billion voice assistants in the world. We find them on our mobile devices. We find them in our cars. We find them embodied into hardware devices, situated on in our living rooms, our kitchens, our bedrooms. Voice user interfaces - using natural spoken language to issue commands to a computing system - have been around for decades, both in reality, and in our imagination. Advances in machine learning and in embedded hardware mean that voice interfaces are experiencing a renaissance - just ask Siri, or Alexa, or OK Google. At the same time, traditional mainstays of open source speech recognition and voice synthesis are in decline. For example, the CMU Sphinx team has moved on to a commercial project. Common Voice and DeepSpeech from Mozilla are no longer being actively supported, and their future (at the time of writing) is unclear. The open source voice assistant community - Mycroft, Julius, Sepia, OVAL - is fragmented, and struggling to gain traction due to the power of content creators - such as Spotify - to dictate which channels their content can be accessed through. This reinforces the power and the dominance of proprietary players. Those same proprietary players are driven by commercial considerations to operate in markets that yield a profit - mostly white, Western markets. Without open source options, much of the Global South will be under-served by voice technology and its benefits - the ability to overcome poor literacy, gain instant access to information without a computer in your own language - and so on. If we're going to use voice technology to provide benefits to some groups of people in the world, and in doing so, that reinforces digital divides around inclusion, equity and access, then shouldn't voice technology be a public good? There are several things we can do to intervene in this state of affairs, including; - gaining more widespread support and adoption of open source voice assistants, which aids the case for content providers to support them; - investing in voice for low resource languages through measures such as corpora creation, and tools for training low resource languages; - ensuring that we know the value our voice data has for proprietary providers, and choosing judiciously when we share it; - and allowing more services that have benefit to marginalised groups to deliver their services through voice user interfaces. But there are significant barriers to achieving voice that is more accessible to all, including; - the vast amounts of data required for training; - the high barriers to entry of deep learning and machine learning infrastructure; - and differences in languages and accents This talk will provide an overview of the landscape, and identify where, and how, we can best intervene to ensure everyone has a voice.