I’m Jeff (@jffwng) and I’m the Product Manager for the Universal Speech Translator project at Meta’s FAIR (Fundamental AI Research). We recently released a speech translation model that supports a primarily...
I was struck by this passage in the Universal Speech Translator project overview -- "Consider a sentence in German, “Ich möchte alle Sprachen übersetzen,” and its equivalent in Spanish, “Quisiera traducir todos los idiomas.” Both mean “I would like to translate all languages.” But translating from German to English in real time would be more challenging because the verb “translate” appears at the end of the sentence, while the word order in Spanish and English is similar." How did you approach this in your translation model for Hokkien? If real-time translation is not fully solved today, do you have any thoughts on how this will be solved in the future?
The system featured is not yet streaming capable. Our aim in the future is to support real-time multimodal translations for many languages. We are currently developing streaming models for speech-to-speech translation and we already have experience developing streaming models for speech-to-text translation (see for example https://ieeexplore.ieee.org/abstract/document/9414897)
hello, what datasets are the speech translation models trained on? does the team have audio recordings of the Hokkien language (that are labelled)? This is interesting to me for audio-only tribal dialects (e.g. tribal versions of arabic).
The team has collected ~60-hr English-Hokkien speech-to-speech translation data, while the main contributor to the system quality comes from the large amount (8k-hr) of weakly supervised data that we generated from monolingual Hokkien speech leveraging Mandarin as a pivot language and Mandarin-English translation.
What role will Meta AI play in advancing novel protein development and what implications might this have in the near future?
How does the Meta AI research team or UST project think about access to models? Is this translation model going to be accessible to researchers, developers, other companies?
Our team is part of FAIR (Fundamental AI Research) and one of our values is to be open. Thus we strive to open source data and models. For this project, you can access the model, code and some data here: https://github.com/facebookresearch/fairseq/tree/ust We also made our model available through Hugging Face here: https://huggingface.co/spaces/facebook/Hokkien_Translation In the future we will continue to strive to provide open access to data and model to the extent possible.
Curious what some of the early iterations were and if there any breakthroughs or it was a gradual iteration to your end result.
We had several research breakthroughs in the past couple years before we build on English-Hokkien system. We first researched how to enable speech-to-speech translation for unwritten language simulated on written language, by using clean synthetic speech through TTS. We then improve model quality by self-supervised pre-training, data agumentation, model architecture. We build the English-Hokkien when we accumulate enough insights to prove our findings can be applied to real-world use case and data setup.
Does the model have the ability to perform attribution on the tokens used at inference time ?
By attribution, if you mean obtaining an alignment between source and target content, yes, it is possible via the attention mechanism (feel free to also check the papers for details on the model: https://arxiv.org/abs/2211.06474). Otherwise, could you elaborate on your question?
Thank you Juan! By attribution I mean being able to determine the source of data that resulted in a given response for a given input. I will review the paper later, any call outs to specifics within https://arxiv.org/abs/2211.06474 would be appreciated as I am definitely a layperson here. Thanks again!
Thank you for clarifying! No, given an input and an automatic translation, we cannot identify a training sample or a subset of the training data that led the model to produce such translation.
What are you doing to ensure a lack of bias in the data that the model trains on?
We did several inspections to the content of the data collected from difference data sources and use data from the high quality sources. Also, Hokkien has a wide varity of accents, and the Hokkien corpus (TAT) we choose made sure it covers many accents across Taiwan as possible.
In the demos, the system translates between Hokkien and English, meaning one side of the translation does have a standard written representation. Does this model require that, or could it translate between two spoken-only languages?
The model can support translating between two spoken-only languages. However having a written representation in a related language could improve the UnitY model performance.
To the end-user, ML models are only as good as the interface that serves them.
Do you have any suggestions or ideas for the interface that might serve this model to end-users?
Do you think that voice will one-day play an integral part in its interface? (A lot of the global pop. has low illiteracy, but can voice-memo etc.)
I feel UI interfaces would benefit if Machine/Speech Translation is more integrated into existing workflows and user behaviors. Rather than having to take a break and open up a separate app for translations, translations should just be something "just works" or appear in a non-disruptive way. I do think voice will play an increasingly important role.
Possible that linguistics and the IPA can help? You could translate the spoken language into IPA, and then write a model which translates the IPA into English. Or does the IPA not actually capture all of Hokkien's nuance?
We do have prior work on leveraging IPA to improve our systems (see for example https://arxiv.org/abs/2204.05409). However, for Hokkien, it is very difficult to obtain an IPA form since grapheme to phoneme conversion tools to not support that language.
For SpeechMatrix, it seems like multilingual performance is somewhat dependent on the choice of languages used during training. Is there a clever way we can choose languages during training, or is the ultimate answer just to use a bigger network?
Language selection for multilingual modeling is an open research problem. Choosing similar languages generally boosts the performance of low resource languages due to transfer learning but might hurt the performance of higher resource languages due to interference. And yes, increasing model capacity is one of the simple solutions.
Sweet, thanks for the reply! Quick follow-up: how do you gauge similarity here? Is it language similarity (syntax/semantics/word ordering), similarity of the source speech samples, or unit similarity?
I was referring to language similarity. For ex: choosing languages belonging to the same language family.
How do you evaluate speech-to-speech at the long tail? Seems like this would be increasingly difficult for less popular langs and might be different than for speech-to-text.
This is an open research area. Currently we rely on ASR BLEU metrics for evaluating our speech-to-speech models and lack of enough data might make it hard to train high quality ASR models for long tail languages.
Can you preserve language-specific nuances with speech-to-speech?
For now we focus on translating the linguistic content and not perserving aspects in speech such as voice or emotion, etc. For languages that have finer-grained concepts on certain topics, the translation is more likely to be constrained by the vocabulary available on the target language side, similar to the challenges faced by human translators.
Thanks for the reply! Quick follow-up: do translators have any tools to gauge the size of vocab gap b/w languages? Curious if there is any clever way to do this within the model itself.
Hokkien presents the challenge of not having a standard writing system. This means that we cannot rely on a transcription for the modeling and have to develop more direct or end-to-end approaches to the speech-to-speech translation task. In addition, we chose Hokkien as the first language to support because of its reach as well as our personal connection to it. Hokkien is a common language spoken by over 40 million people in the world, across China, Taiwan, Singapore, Malaysia and Philippines. We also have multiple team members on the team that speak Hokkien, making it something we're personally passionate about.
What have been some of the surprising moments in developing this technology? E.g., what worked better than you thought it would? What did you think would help but ended up not helping very much? What took longer than you expected, and what took less time than you expected?
After we built the prototype of the system, I tested with my parents who speaks Hokkien but don't know English. I was surprised that it worked well and my parent can understand what the system speaks in Hokkien, and they were very excited about the system. The Hokkien speech synthesizing part is more difficult than we thought because it is a tonal language. Current system still pronounce in the wrong tone which sometimes changes the meaning.
If you could give one piece of advice to the UST team two years ago, what would it be?
One thing that comes to mind is that it would be beneficial to plan ahead in order to find large amounts of data sources for raw audio and raw video as these resources become very scarce very quickly outside of the most spoken languages in the world.
Have a long-term vision, but also be ready to adapt and change quickly as the world changes around you.
One use case for this would be a meeting between manufacturers and buyers from different countries. How would the translator do in the case where there are people speaking over each other (but not the dominant sound)? Seems like for this to work well the room needs to be quite silent with one person speaking at a time.
Speaking over one another is challenge and translators currently don't do well here. Being able to distinguish who's speaking would help here. This is still an open research area.
Do you need to set the language or is it recognized automatically? In my meetings there might be someone speaking Hokkien and another speaking Mandarin.
Currently the user will have to select the right language direction to play with as we don't currently support language identification, but in the future this is something of our interest.
what are the biggest constraints for quality improvements here? my assumption is that it's data bottlenecks like other projects in the space, but speech translation seems slightly different due to the vast amount of historical data we have to draw from
Data is still the biggest bottleneck for quality improvements. However our current speech to speech translation models are also constrained by other dependencies like unit extraction pipelines, vocoder quality, ASR models used for evaluation etc.
How much has this project cost since its inception? How does it contribute to revenue at the company, or if it does not yet contribute, how will it and when do you think that will start?
We're really proud to have pushed quality and language coverage (written and unwritten) to where we are today. That said, there are still challenges in the speed of translations, quality across different domains, and meaning beyond words. We're excited to take these on.
I'm a fairly standard server side software engineer. I'm just sort of lost about how AI could affect my career or what software development might look like going forward. My behavior hasn't changed that much but I keep feeling like we're on the verge of, like, a revolutionary transformation in what our jobs look like. What should I be looking at day to day, or year to year here?
I would suggest finding a AI expert mentor to help guide you through your career. A good way to start with ML/AI is to read textbooks (deeplearning book, Chris Bishop's book) and consult resources such as fast AI or the deeplearning course series on Coursera.
Pre-trained models are getting better and are making it easier for general software engineers to leverage ML models without necessarily needing to be a ML expert. Look into platforms that serve these models via a API.
Can you explain the intuition behind why using Mandarin as the intermediate representation improves accuracy? Is it that there is a larger corpus of Hokkien to Mandarin training data and also a large corpus of Mandarin to English training data, but there is very little Hokkien to English training data?
Compared between speech and text, the former has extra variations in the signal that are unrelated to the content due to different speakers and background noise, etc. As a result, from the modeling perspective, having additional supervision from the text modality is helpful even for unwritten languages. From the data perspective, using Mandarin as an intermediate language indeed allows us to greatly increase the amount of data we can leverage through Mandarin-English data.
Amazing progress so far on the models! What ML advances do you think need to happen next for UST to significantly improve?
Thank you! Next, we would like to make more advances on streaming/low-latency models and support more (unwritten) languages. Important ML advances that will benefit us down the road are models that can better represent speech and text in the same space, better model architectures, improved multimedia data accessibility (larger amounts of audio and video data in multiple languages), and possibly large multimodal language models (similar to how the PALM model is obtaining state-of-the-art results on text machine translation).
Can you, or will you opensource some or all of the intermediate tools you used to build the intermediate results of NLLB?
I mean the tagging platforms, intermediate mining tools, etc?
I noticed that several of the tools described in the paper are not available (at least to my knowledge)
Lot of the mining related tools are available in this repo : https://github.com/facebookresearch/stopes. If you find something is missing or can be better supported please feel free to open an issue and we will try to help with that. Thanks!
How does the team plan to address the challenges of accurately translating idiomatic expressions, slang, and other language nuances that cannot be easily translated through a literal interpretation of the words? I imagine differences in sentence structure and grammar can make it difficult for a machine learning model to accurately capture the intended meaning of the speaker.