Exploring Automated Voice Casting  for Content Localization  using Deep Learning
Date & Time
Wednesday, November 11, 2020, 8:00 PM - 8:30 PM
Session Type
Technical session
Aansh Malik


The process of casting talent for dubbing source language content into target languages –known as voice casting- consists largely of a manual workflow that could benefit immensely from increased levels of automation. Recent advancements in deep learning architectures for sequential data processing are providing the needed impetus to the realization of various AI enabled audio processing workflows. Specifically, applications such as speaker verification and speech synthesis have been gaining immense traction due to the advent and maturity of recurrent neural networks. We explore the viability of leveraging advancements in deep learning for text-independent speaker verification (TI-SV) for use in computer-aided voice casting. To this end, we propose and develop an automated voice casting tool that uses similarity scores generated from neural network embeddings - from a robust autoencoder model trained for the task of TI-SV - to rank voiceover artists across different languages in voice casting process. To evaluate the dexterity of the proposed approach, we conduct a subjective study accurately emulating a simplified voice casting process on actual voice testing kits (dubbing auditions) from our content. We also use casting decisions from casting experts to further evaluate the tool as well as the subjectivity involved in the voice casting process. We achieve promising results for the automated tool and prove that using similarity scores from speaker embeddings from autoencoders trained on TI-SV directly in the domain of voice casting could be a viable approach for automating the voice casting process and warrants further exploration.

Technical Depth of Presentation
What Attendees will Benefit Most from this Presentation
Engineers, Managers, Technologists interested in AI and its novel applications in the media and entertainment industry.
Take-Aways from this Presentation
Deep learning based neural speaker embedding systems have viable applications for voice dependent media and entertainment workflows. Voice Casting for content localization is a difficult and subjective process that could benefit by trustworthy objective metrics guiding the process Adopting AI based technologies to augment manual procedures in media and entertainment can provide substantial benefits in scalability, quality assurance as well as media output under creative guidance.