• About
  • Disclaimer
  • Privacy Policy
  • Contact
Saturday, June 14, 2025
Cyber Defense GO
  • Login
  • Home
  • Cyber Security
  • Artificial Intelligence
  • Machine Learning
  • Data Analysis
  • Computer Networking
  • Disaster Restoration
No Result
View All Result
  • Home
  • Cyber Security
  • Artificial Intelligence
  • Machine Learning
  • Data Analysis
  • Computer Networking
  • Disaster Restoration
No Result
View All Result
Cyber Defense Go
No Result
View All Result
Home Artificial Intelligence

Pushing the frontiers of audio technology

Md Sazzad Hossain by Md Sazzad Hossain
0
Pushing the frontiers of audio technology
585
SHARES
3.2k
VIEWS
Share on FacebookShare on Twitter


Applied sciences

Revealed
30 October 2024
Authors

Zalán Borsos, Matt Sharifi and Marco Tagliasacchi

An illustration depicting speech patterns, iterative progress on dialogue generation,  and a relaxed conversation between two voices.

Our pioneering speech technology applied sciences are serving to folks world wide work together with extra pure, conversational and intuitive digital assistants and AI instruments.

Speech is central to human connection. It helps folks world wide change data and concepts, categorical feelings and create mutual understanding. As our know-how constructed for producing pure, dynamic voices continues to enhance, we’re unlocking richer, extra participating digital experiences.

Over the previous few years, we’ve been pushing the frontiers of audio technology, growing fashions that may create top quality, pure speech from a spread of inputs, like textual content, tempo controls and specific voices. This know-how powers single-speaker audio in lots of Google merchandise and experiments — together with Gemini Dwell, Challenge Astra, Journey Voices and YouTube’s auto dubbing — and helps folks world wide work together with extra pure, conversational and intuitive digital assistants and AI instruments.

Working along with companions throughout Google, we just lately helped develop two new options that may generate long-form, multi-speaker dialogue for making advanced content material extra accessible:

  • NotebookLM Audio Overviews turns uploaded paperwork into participating and energetic dialogue. With one click on, two AI hosts summarize consumer materials, make connections between subjects and banter forwards and backwards.
  • Illuminate creates formal AI-generated discussions about analysis papers to assist make information extra accessible and digestible.

Right here, we offer an summary of our newest speech technology analysis underpinning all of those merchandise and experimental instruments.

Pioneering strategies for audio technology

For years, we have been investing in audio technology analysis and exploring new methods for producing extra pure dialogue in our merchandise and experimental instruments. In our earlier analysis on SoundStorm, we first demonstrated the power to generate 30-second segments of pure dialogue between a number of audio system.

This prolonged our earlier work, SoundStream and AudioLM, which allowed us to use many text-based language modeling strategies to the issue of audio technology.

SoundStream is a neural audio codec that effectively compresses and decompresses an audio enter, with out compromising its high quality. As a part of the coaching course of, SoundStream learns the right way to map audio to a spread of acoustic tokens. These tokens seize all the data wanted to reconstruct the audio with excessive constancy, together with properties resembling prosody and timbre.

AudioLM treats audio technology as a language modeling activity to supply the acoustic tokens of codecs like SoundStream. Because of this, the AudioLM framework makes no assumptions concerning the kind or make-up of the audio being generated, and may flexibly deal with quite a lot of sounds without having architectural changes — making it an excellent candidate for modeling multi-speaker dialogues.

Instance of a multi-speaker dialogue generated by NotebookLM Audio Overview, primarily based on just a few potato-related paperwork.

Constructing upon this analysis, our newest speech technology know-how can produce 2 minutes of dialogue, with improved naturalness, speaker consistency and acoustic high quality, when given a script of dialogue and speaker flip markers. The mannequin additionally performs this activity in underneath 3 seconds on a single Tensor Processing Unit (TPU) v5e chip, in a single inference move. This implies it generates audio over 40-times quicker than actual time.

Scaling our audio technology fashions

Scaling our single-speaker technology fashions to multi-speaker fashions then grew to become a matter of knowledge and mannequin capability. To assist our newest speech technology mannequin produce longer speech segments, we created an much more environment friendly speech codec for compressing audio right into a sequence of tokens, in as little as 600 bits per second, with out compromising the standard of its output.

The tokens produced by our codec have a hierarchical construction and are grouped by time frames. The primary tokens inside a gaggle seize phonetic and prosodic data, whereas the final tokens encode fantastic acoustic particulars.

Even with our new speech codec, producing a 2-minute dialogue requires producing over 5000 tokens. To mannequin these lengthy sequences, we developed a specialised Transformer structure that may effectively deal with hierarchies of data, matching the construction of our acoustic tokens.

With this system, we will effectively generate acoustic tokens that correspond to the dialogue, inside a single autoregressive inference move. As soon as generated, these tokens could be decoded again into an audio waveform utilizing our speech codec.

Animation displaying how our speech technology mannequin produces a stream of audio tokens autoregressively, that are decoded again to a waveform consisting of a two-speaker dialogue.

To show our mannequin the right way to generate reasonable exchanges between a number of audio system, we pretrained it on tons of of hundreds of hours of speech knowledge. Then we finetuned it on a a lot smaller dataset of dialogue with excessive acoustic high quality and exact speaker annotations, consisting of unscripted conversations from plenty of voice actors and reasonable disfluencies — the “umm”s and “aah”s of actual dialog. This step taught the mannequin the right way to reliably swap between audio system throughout a generated dialogue and to output solely studio high quality audio with reasonable pauses, tone and timing.

According to our AI Ideas and our dedication to growing and deploying AI applied sciences responsibly, we’re incorporating our SynthID know-how to watermark non-transient AI-generated audio content material from these fashions, to assist safeguard in opposition to the potential misuse of this know-how.

New speech experiences forward

We’re now centered on enhancing our mannequin’s fluency, acoustic high quality and including extra fine-grained controls for options, like prosody, whereas exploring how greatest to mix these advances with different modalities, resembling video.

The potential functions for superior speech technology are huge, particularly when mixed with our Gemini household of fashions. From enhancing studying experiences to creating content material extra universally accessible, we’re excited to proceed pushing the boundaries of what’s potential with voice-based applied sciences.

Acknowledgements

Authors of this work: Zalán Borsos, Matt Sharifi, Brian McWilliams, Yunpeng Li, Damien Vincent, Félix de Chaumont Quitry, Martin Sundermeyer, Eugene Kharitonov, Alex Tudor, Victor Ungureanu, Karolis Misiunas, Sertan Girgin, Jonas Rothfuss, Jake Walker and Marco Tagliasacchi.

We thank Leland Rechis, Ralph Leith, Paul Middleton, Poly Pata, Minh Truong and RJ Skerry-Ryan for his or her crucial efforts on dialogue knowledge.

We’re very grateful to our collaborators throughout Labs, Illuminate, Cloud, Speech and YouTube for his or her excellent work bringing these fashions into merchandise.

We additionally thank Françoise Beaufays, Krishna Bharat, Tom Hume, Simon Tokumine, James Zhao for his or her steering on the challenge.

You might also like

Why Creators Are Craving Unfiltered AI Video Mills

6 New ChatGPT Tasks Options You Have to Know

combining generative AI with live-action filmmaking


Applied sciences

Revealed
30 October 2024
Authors

Zalán Borsos, Matt Sharifi and Marco Tagliasacchi

An illustration depicting speech patterns, iterative progress on dialogue generation,  and a relaxed conversation between two voices.

Our pioneering speech technology applied sciences are serving to folks world wide work together with extra pure, conversational and intuitive digital assistants and AI instruments.

Speech is central to human connection. It helps folks world wide change data and concepts, categorical feelings and create mutual understanding. As our know-how constructed for producing pure, dynamic voices continues to enhance, we’re unlocking richer, extra participating digital experiences.

Over the previous few years, we’ve been pushing the frontiers of audio technology, growing fashions that may create top quality, pure speech from a spread of inputs, like textual content, tempo controls and specific voices. This know-how powers single-speaker audio in lots of Google merchandise and experiments — together with Gemini Dwell, Challenge Astra, Journey Voices and YouTube’s auto dubbing — and helps folks world wide work together with extra pure, conversational and intuitive digital assistants and AI instruments.

Working along with companions throughout Google, we just lately helped develop two new options that may generate long-form, multi-speaker dialogue for making advanced content material extra accessible:

  • NotebookLM Audio Overviews turns uploaded paperwork into participating and energetic dialogue. With one click on, two AI hosts summarize consumer materials, make connections between subjects and banter forwards and backwards.
  • Illuminate creates formal AI-generated discussions about analysis papers to assist make information extra accessible and digestible.

Right here, we offer an summary of our newest speech technology analysis underpinning all of those merchandise and experimental instruments.

Pioneering strategies for audio technology

For years, we have been investing in audio technology analysis and exploring new methods for producing extra pure dialogue in our merchandise and experimental instruments. In our earlier analysis on SoundStorm, we first demonstrated the power to generate 30-second segments of pure dialogue between a number of audio system.

This prolonged our earlier work, SoundStream and AudioLM, which allowed us to use many text-based language modeling strategies to the issue of audio technology.

SoundStream is a neural audio codec that effectively compresses and decompresses an audio enter, with out compromising its high quality. As a part of the coaching course of, SoundStream learns the right way to map audio to a spread of acoustic tokens. These tokens seize all the data wanted to reconstruct the audio with excessive constancy, together with properties resembling prosody and timbre.

AudioLM treats audio technology as a language modeling activity to supply the acoustic tokens of codecs like SoundStream. Because of this, the AudioLM framework makes no assumptions concerning the kind or make-up of the audio being generated, and may flexibly deal with quite a lot of sounds without having architectural changes — making it an excellent candidate for modeling multi-speaker dialogues.

Instance of a multi-speaker dialogue generated by NotebookLM Audio Overview, primarily based on just a few potato-related paperwork.

Constructing upon this analysis, our newest speech technology know-how can produce 2 minutes of dialogue, with improved naturalness, speaker consistency and acoustic high quality, when given a script of dialogue and speaker flip markers. The mannequin additionally performs this activity in underneath 3 seconds on a single Tensor Processing Unit (TPU) v5e chip, in a single inference move. This implies it generates audio over 40-times quicker than actual time.

Scaling our audio technology fashions

Scaling our single-speaker technology fashions to multi-speaker fashions then grew to become a matter of knowledge and mannequin capability. To assist our newest speech technology mannequin produce longer speech segments, we created an much more environment friendly speech codec for compressing audio right into a sequence of tokens, in as little as 600 bits per second, with out compromising the standard of its output.

The tokens produced by our codec have a hierarchical construction and are grouped by time frames. The primary tokens inside a gaggle seize phonetic and prosodic data, whereas the final tokens encode fantastic acoustic particulars.

Even with our new speech codec, producing a 2-minute dialogue requires producing over 5000 tokens. To mannequin these lengthy sequences, we developed a specialised Transformer structure that may effectively deal with hierarchies of data, matching the construction of our acoustic tokens.

With this system, we will effectively generate acoustic tokens that correspond to the dialogue, inside a single autoregressive inference move. As soon as generated, these tokens could be decoded again into an audio waveform utilizing our speech codec.

Animation displaying how our speech technology mannequin produces a stream of audio tokens autoregressively, that are decoded again to a waveform consisting of a two-speaker dialogue.

To show our mannequin the right way to generate reasonable exchanges between a number of audio system, we pretrained it on tons of of hundreds of hours of speech knowledge. Then we finetuned it on a a lot smaller dataset of dialogue with excessive acoustic high quality and exact speaker annotations, consisting of unscripted conversations from plenty of voice actors and reasonable disfluencies — the “umm”s and “aah”s of actual dialog. This step taught the mannequin the right way to reliably swap between audio system throughout a generated dialogue and to output solely studio high quality audio with reasonable pauses, tone and timing.

According to our AI Ideas and our dedication to growing and deploying AI applied sciences responsibly, we’re incorporating our SynthID know-how to watermark non-transient AI-generated audio content material from these fashions, to assist safeguard in opposition to the potential misuse of this know-how.

New speech experiences forward

We’re now centered on enhancing our mannequin’s fluency, acoustic high quality and including extra fine-grained controls for options, like prosody, whereas exploring how greatest to mix these advances with different modalities, resembling video.

The potential functions for superior speech technology are huge, particularly when mixed with our Gemini household of fashions. From enhancing studying experiences to creating content material extra universally accessible, we’re excited to proceed pushing the boundaries of what’s potential with voice-based applied sciences.

Acknowledgements

Authors of this work: Zalán Borsos, Matt Sharifi, Brian McWilliams, Yunpeng Li, Damien Vincent, Félix de Chaumont Quitry, Martin Sundermeyer, Eugene Kharitonov, Alex Tudor, Victor Ungureanu, Karolis Misiunas, Sertan Girgin, Jonas Rothfuss, Jake Walker and Marco Tagliasacchi.

We thank Leland Rechis, Ralph Leith, Paul Middleton, Poly Pata, Minh Truong and RJ Skerry-Ryan for his or her crucial efforts on dialogue knowledge.

We’re very grateful to our collaborators throughout Labs, Illuminate, Cloud, Speech and YouTube for his or her excellent work bringing these fashions into merchandise.

We additionally thank Françoise Beaufays, Krishna Bharat, Tom Hume, Simon Tokumine, James Zhao for his or her steering on the challenge.

Tags: audiofrontiersGenerationPushing
Previous Post

Fortigate NGFW Resolution – 51 Safety

Next Post

Backdoor in Chinese language-made healthcare monitoring system leaks affected person information

Md Sazzad Hossain

Md Sazzad Hossain

Related Posts

Why Creators Are Craving Unfiltered AI Video Mills
Artificial Intelligence

Why Creators Are Craving Unfiltered AI Video Mills

by Md Sazzad Hossain
June 14, 2025
6 New ChatGPT Tasks Options You Have to Know
Artificial Intelligence

6 New ChatGPT Tasks Options You Have to Know

by Md Sazzad Hossain
June 14, 2025
combining generative AI with live-action filmmaking
Artificial Intelligence

combining generative AI with live-action filmmaking

by Md Sazzad Hossain
June 14, 2025
Photonic processor may streamline 6G wi-fi sign processing | MIT Information
Artificial Intelligence

Photonic processor may streamline 6G wi-fi sign processing | MIT Information

by Md Sazzad Hossain
June 13, 2025
Construct a Safe AI Code Execution Workflow Utilizing Daytona SDK
Artificial Intelligence

Construct a Safe AI Code Execution Workflow Utilizing Daytona SDK

by Md Sazzad Hossain
June 13, 2025
Next Post
Backdoor in Chinese language-made healthcare monitoring system leaks affected person information

Backdoor in Chinese language-made healthcare monitoring system leaks affected person information

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Mistral Le Chat vs OpenAI ChatGPT: Full comparability

Mistral Le Chat vs OpenAI ChatGPT: Full comparability

February 11, 2025
10 Command Line Abilities You Must Work with AI – Dataquest

10 Command Line Abilities You Must Work with AI – Dataquest

April 29, 2025

Categories

  • Artificial Intelligence
  • Computer Networking
  • Cyber Security
  • Data Analysis
  • Disaster Restoration
  • Machine Learning

CyberDefenseGo

Welcome to CyberDefenseGo. We are a passionate team of technology enthusiasts, cybersecurity experts, and AI innovators dedicated to delivering high-quality, insightful content that helps individuals and organizations stay ahead of the ever-evolving digital landscape.

Recent

Addressing Vulnerabilities in Positioning, Navigation and Timing (PNT) Companies

Addressing Vulnerabilities in Positioning, Navigation and Timing (PNT) Companies

June 14, 2025
Discord Invite Hyperlink Hijacking Delivers AsyncRAT and Skuld Stealer Concentrating on Crypto Wallets

Discord Invite Hyperlink Hijacking Delivers AsyncRAT and Skuld Stealer Concentrating on Crypto Wallets

June 14, 2025

Search

No Result
View All Result

© 2025 CyberDefenseGo - All Rights Reserved

No Result
View All Result
  • Home
  • Cyber Security
  • Artificial Intelligence
  • Machine Learning
  • Data Analysis
  • Computer Networking
  • Disaster Restoration

© 2025 CyberDefenseGo - All Rights Reserved

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In