Davide Salvi, Clara Borrelli, Paolo Bestagini, Fabio Antonacci, Matthew Stamm, Lucio Marcenaro, Angshul Majumdar
The possibility of manipulating digital multimedia material is nowadays within everyone’s reach. In the audio case, anybody can create fake synthetic speech tracks using various methods with almost no effort [1]. These methods range from simple waveform concatenation operations to more complex neural networks [2], [3].
The misuse of counterfeit speech data can lead to severe threats [4]. Therefore, forensic researchers have devoted considerable effort to develop detectors capable of distinguishing original speech recordings from synthetically generated ones [5].
As synthetic speech is becoming prevalent, it is becoming increasingly important not only to detect when a speech signal is forged but also to identify which specific algorithm has been used to generate it. This second task is called synthetic speech attribution, and Figure 1 provides a visual representation of it. This problem has been less explored in the literature, but it proves paramount to pinpoint the author of some illicit material [6].
Figure 1. A representation of the synthetic speech attribution problem: given an artificially generated speech track, the problem is to detect which algorithm was used to synthesize it.
Due to its importance in the multimedia forensics scenario, synthetic speech attribution has been selected as the topic of the 2022 edition of the IEEE Signal Processing Cup (SP Cup). The SP Cup is a student competition in which undergraduate students form teams to work on real-life challenges. Each team should include one faculty member as an advisor, at most one graduate student as a mentor, and three to 10 undergraduate students. All of the teams participated in a preliminary competition, while the top three teams were selected to present their work at the final stage of the contest.
Due to the COVID-19 pandemic situation, the whole competition turned into a virtual event. Nonetheless, some room for the competition was given onsite in Singapore at ICASSP 2022. The three finalist teams had the chance to present their work during the show-and-tell poster session (Figure 2). Finally, they participated in the final awarding ceremony.
Figure 2. The show-and-tell poster session at ICASSP 2022 in Singapore.
In this article, we share an overview of the IEEE SP Cup experience, including the competition tasks, participating teams, technical approaches, and statistics.
Solving the problem of synthetic speech attribution consists of determining which speech generator has been used to synthesize a speech track under analysis.
To comprehend the ability of forensic detectors to identify which type of synthetic speech generator forged a given track, it is essential to first review how the generators themselves synthesize speech.
During the whole competition, we only considered speech data generated with text-to-speech (TTS) methods. These correspond to algorithms that take a text as input and vocalize it. Most of the adopted methods use a data-driven approach and aim to synthesize speech using the voices of the speakers seen during training. Based on this, the number of speakers that each algorithm can reproduce depends on how it has been trained. In general, TTS systems follow a two-stage pipeline. The first block takes a text as input and generates a mel spectrogram, while the second is a vocoder that translates the output of the first block to actual sound. Only a few methods follow an “end-to-end” approach, generating speech straight from an input text.
Given this brief overview of TTS systems, it is clear that both pipeline components can introduce traces into the final audio track. In particular, artifacts can be raised by both the spectrogram generator and the vocoder.
Synthetic speech attribution systems are based on the hypothesis that these traces are typical for each generation algorithm. Therefore, by analyzing them with signal processing and deep learning techniques, it is possible to link a given speech track to the TTS method used to synthesize it. In the literature, these traces have often been used to discriminate between real and synthetic speech data. Among the methods that have been presented in this regard, some exploit low-level artifacts [7], [8], while others leverage more complex aspects [9], [10].
However, in addition to synthetic speech detection, the synthetic speech attribution task is also crucial and worthy of investigation. In fact, knowing which system has been used to synthesize a given speech track could be of fundamental importance to identify the author of illicit material [6], [11]. Similar problems proved to be of great interest in other contexts (e.g., the camera model [12] and recording device identification [13]), so the attribution problem is appropriate to apply also to the synthetic speech field.
Broadly speaking, synthetic speech attribution algorithms typically operate by designing a signal processing technique to extract a particular forensic trace from an audio signal. Synthetic speech “fingerprints” are then learned by extracting traces from many data synthesized by a particular generator and then repeating this process for several different TTS systems. After this, these traces are used as classification features when training a machine learning algorithm, such as a support vector machine or neural network, to recognize a synthetic speech generation method.
The goal of this competition was for teams to design and develop a system for synthetic speech attribution. This means, given an audio recording representing a synthetically generated speech track, detecting which method among a list of candidates has been used to synthesize the speech.
Teams had to use their signal processing and deep learning expertise to extract traces from audio data that could link them to different synthesizers.
The competition has been split into two stages: an open competition that any eligible team could participate in and an invitation-only final competition. The open phase was, again, divided into two parts. Each team’s score was computed with a weighted average on the performance of both phases. Finally, the three teams reporting the highest scores in the open competition were invited to the final competition.
Part 1 of the open competition was designed to give teams a well-defined problem and dataset that they could use to become familiar with the synthetic speech attribution task. Participants were provided with a dataset that they could use to train and test their classifiers.
The training dataset consisted of speech tracks from five different TTS systems, with 1,000 audio files generated from each. All of the audio files are given in WAV format, considering a sampling frequency of 16 kHz and a single channel. The speech content of all of the synthesized audio tracks has been generated with GPT2 [14].
The evaluation dataset contained 9,000 speech files generated from the five known systems plus five more, for a total of 10 different TTS methods. Participants were asked to identify the generation algorithm of each audio track and cluster together all of those generated by the systems unseen during training in an “unknown” class, resulting in an open-set classification problem. The accuracy of each system was used as the score for each team.
To prevent “brute force” attempts to guess the generation system of each audio track, teams were allowed to submit only a limited number of classification attempts per day during the evaluation period.
Part 2 of the open competition was designed to present teams with a more challenging scenario: determining the TTS generation system for synthetic speech data that have been postprocessed.
In this part of the contest, teams were presented with audio data postprocessed using one or more operations and were asked to determine the algorithm used to synthesize them. While postprocessing operations are commonly applied to tracks before they are shared online, these operations can potentially alter forensic traces present in speech data.
We considered three postprocessing operations: Gaussian noise injection, reverberation addition, and MP3 compression. Teams were provided with a list of the operations and MATLAB scripts that they could use to generate augmented data. No additional training data were provided for this task. Similar to part 1 of the open competition, the teams had to classify speech data generated using 10 different TTS systems and cluster all of the unseen algorithms in an “unseen” class.
The accuracy of each system was used as the score for each team. The final score of the open competition was a weighted average between part 1 (70%) and part 2 (30%) scores.
The three highest scoring teams from the open competition stage were selected as finalists and invited to compete in the final competition.
These teams were provided additional synthetic speech tracks for both training and testing. The additional training dataset contained 1,000 clean tracks from two new TTS generators to be considered as part of the “unknown” class. The previous datasets provided in the open competition could be used as additional training material, even if their ground-truth labels were not provided. The final evaluation dataset contained 10,000 synthetic speech tracks generated with a total of 12 TTS systems (five known and seven unknown) augmented using unrevealed techniques. In addition to the postprocessing techniques considered in the open competition, time stretching, pitch shifting, and high-pass filtering are also used in this case, with variable parameters.
The presented datasets, MATLAB scripts, and other associated material are available for download through the dedicated webpage on the Piazza forum: https://piazza.com/ieee_sps/spring2022/spcup2022/home. A recently prepared enhanced version of the dataset has been made available in [15].
As in previous years, the SP Cup was run as an online class through the Piazza platform, which allowed continuous interaction with the teams. In total, more than 250 students registered for the course, and the number of contributions to the forum exceeded 275.
We received complete and valid submissions from 23 eligible teams from 18 different universities in 12 countries around the world: Europe (Finland, Italy, and Poland), Asia (Bangladesh, China, India, Iran, Israel, Pakistan, Sri Lanka, and South Korea) and North America (the United States).
The scoring platform used to host the competition was CodaLab. There, in addition to the student teams, a significant number of other groups (not eligible for the SP Cup) participated in the competition, for a total of 97 teams.
Figure 3 shows the scores obtained by the participating teams for both open competition tasks. Remarkably, all of the participants in the competition achieved impressive results, with almost half of the teams scoring an accuracy value greater than 0.9 in both parts of the open competition. The score differences between the best performing teams were minimal. Nevertheless, the three finalist teams achieved the best results in both tasks, achieving near-perfect results.
Figure 3. The anonymized scores of the 23 teams for the two open competition subtasks: (a) open competition—part 1 and (b) open competition—part 2. Idx: index.
The approaches used by most of the participating teams were similar. All of them used supervised classification techniques borrowed from the machine learning and deep learning communities, fed with different representations of the input audio.
Several audio feature sets were considered (MFCC, CQCC, Spectrograms, X-vectors, etc.), along with raw audio, used directly as the input of the classifiers.
Traditional machine learning techniques (e.g., Gaussian mixture model and support vector machine) but also more complex neural networks (e.g., ResNet, EfficientNet, and BiLSTM) have been used as classification methods.
The most effective classifiers turned out to be those based on convolutional neural networks (CNNs) with raw audio as the input. Such algorithms let the neural network learn the best representation of the analyzed track without relying on predetermined features. More isolated methods have used techniques such as open-set nearest neighbors, denoising approaches, or systems for modeling human speech.
One technique that has often been used to improve classification performance is ensembling. This consists of performing several predictions of an audio track independently of each other and fusing the scores obtained to increase the confidence of the final detection. In all cases for which this technique has been used, it has consistently increased the accuracy obtained, with the fused scores turning out to be more reliable than those of the individual methods. Among the fusion strategies used are majority voting and score averaging.
It is worth noting that most higher accuracy teams benefitted from an augmented training dataset. Augmented tracks were computed both using the MATLAB scripts provided by the organizers and with other techniques to increase the classifier robustness.
In this section, we provide details about the three winning teams, an overview of their proposed approaches, and some feedback and perspectives received from them. Pictures of all team members are also shown in Figure 4.
Figure 4. Members of the three finalist teams at ICASSP 2022. (a) First place: Synthesizers. (b) Second place: IITH. (c) Third place: Students Procrastinating.
The proposed synthetic speech attribution method takes an audio recording as the input, transforms it into a log-mel spectrogram, extracts features using a CNN, and classifies the track between the five known and one unknown algorithms. The approach is based on training on diverse data, data augmentation and the ensembling of various CNN backbones for supervised learning, and incorporating pseudo labels for semisupervised training. This constitutes a very effective classifier for synthetic speech attribution. The proposed approach shows outstanding performance in the evaluation datasets, validating the method.
Existing state-of-the-art speech generation systems do an excellent job at capturing up to second-order correlations from natural speech. This motivated the team to pick features that capture higher order correlations, such as voice source features like linear prediction (LP) residuals. The approach consists of extracting information from LP residuals by introducing a set of trainable filter banks. Successively, segment-level encoding is performed using an X-vector architecture followed by multichannel self-attention, which offers differential temporal weightings. For comparisons, the team developed a similar system without trainable filter banks for vocal tract system (VTS) features (i.e., the log-mel spectrogram) that captures second-order correlations. It has been found that a system developed using voice source features is more robust than one using VTS features. For learning heuristics, a natural speech class has been added, offering discrimination in the latent space. The team used the CMU and VCTK data corpus for natural speech. The team proposed a lightweight system that outperformed architectures like VGG and YAM-Net by exploiting higher order correlation artifacts in spoof speech.
The team proposed an approach based on a CNN ensemble network in the time domain. The participants considered different spectral representations of the input audio (e.g., mel-frequency cepstral coefficients (MFCC), power spectral density, short-time Fourier transform, etc.) along with raw waveforms and classified them using a random forest and a ResNet architecture. This network was chosen because of its state-of-the-art accuracy and lower complexity on multiple latest speech detection datasets. Among the considered features, raw speech data were selected as the input for the models, as they provided more accurate results when using a ResNet. The audio length was kept at 10 s by repeating the clip and slicing, as 80% of the training data had a lower duration. After further experimentation, the team adapted multiple highly accurate CNN networks, including Resnet-, VGGnet-, and Inceptionnet-style 1D architectures and pretrained YAMNet, all giving close accuracy but variations in predictions. In the end, the best performing model was an ensemble network using the majority voting technique to reduce variance among these models and provide a highly accurate synthetic speech attribution pipeline for multiclass classification scenarios.
The 10th edition of the SP Cup will be held at ICASSP 2023. Teams that are interested in the SP Cup competition may visit this link: https://signalprocessingsociety.org/get-involved/signal-processing-cup.
In addition to the SP Cup, the IEEE Signal Processing Society (SPS) promotes the Video and Image Processing (VIP) Cup. The last edition of the VIP cup was held at the 2022 IEEE International Conference on Image Processing in Bordeaux, France. The theme of this competition was “Synthetic Image Detection.” For details on past and future VIP Cup editions, visit https://signalprocessingsociety.org/get-involved/video-image-processing-cup.
As the SP Cup 2022 Organizing Committee, we would like to express our warm gratitude to all of the people who took part in this competition: the participating teams, the judging panel, the local organizers, and the IEEE SPS Membership Board. Special thanks go to Dr. Gabriele Bunkheila (MathWorks), who organized a dedicated webinar, “MATLAB for Deep Learning in Audio and Speech Applications,” and MathWorks for its sponsorship. Paolo Bestagini is the corresponding author for this article.
Davide Salvi (davide.salvi@polimi.it) received his master’s degree in music and acoustic engineering from the Politecnico di Milano. He is a Ph.D. student in the Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133 Milan, Italy. During the IEEE Signal Processing Cup 2022, he led the entire data generation process, which involved synthesizing several speech tracks obtained with diverse state-of-the-art text-to-speech techniques. His research interests include signal processing techniques for forensics analysis, synthetic speech detection, and attribution and splicing localization in audio tracks. He is a Student Member of IEEE.
Clara Borrelli (clara.borrelli@polimi.it) received her Ph.D. degree from the Politecnico di Milano. She is currently a postdoc at the Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano, 20133 Milan, Italy. During the 2022 edition of the IEEE Signal Processing Cup, she prepared the submission system and monitored the forum. Her research interests include the application of machine learning and deep learning techniques to sound and music computing, music information retrieval, and multimedia forensics tasks. She is a Student Member of IEEE.
Paolo Bestagini (paolo.bestagini@polimi.it) received his Ph.D. degree in information technology from the Politecnico di Milano. He is an assistant professor with the Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133 Milan, Italy. He is an elected member of the IEEE Information Forensics and Security Technical Committee. He serves as associate editor for IEEE Transactions on Circuits and Systems for Video Technology. He was a co-organizer of the IEEE Signal Processing Cup (SP Cup) 2018, and he initiated and organized the 2022 edition of the SP Cup. His research interests include multimedia forensics and acoustic signal processing for microphone arrays. He is a Member of IEEE.
Fabio Antonacci (fabio.antonacci@polimi.it) received his Ph.D. degree in Information Technology from the Politecnico di Milano. He is an associate professor with the Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, 20133 Milan, Italy. He serves as a member of the IEEE Audio and Acoustic Signal Processing Technical Committee. He provided paramount information related to audio generation for the IEEE Signal Processing Cup. His research interests include space–time processing of audio signals for both speaker and microphone arrays (source localization, acoustic scene analysis, and rendering of spatial sound) and modeling of acoustic propagation (visibility-based beam tracing). He is a Member of IEEE.
Matthew Stamm (mstamm@drexel.edu) received his Ph.D. in electrical engineering, from the University of Maryland, College Park. He is an associate professor with the Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104 USA. He serves as an associate editor for IEEE Transactions on Multimedia. He is a co-organizer of the IEEE Signal Processing Cup 2018, and he ran the competition alongside the other organizers, providing insights on the forensic aspects. His research interests include information forensics, which involves developing techniques to detect multimedia forgeries, such as falsified images and videos; additionally, he develops and studies antiforensic countermeasures that an information attacker can use to disguise their forgeries. He is a Member of IEEE.
Lucio Marcenaro (lucio.marcenaro@unige.it) received his Ph.D. degree in electronics and computer engineering from the University of Genova. He is an associate professor of telecommunications with the Polytechnic School, Department of Electrical, Electronic, and Telecommunications Engineering and Naval Architecture, University of Genoa, 16145 Genoa, Italy. He has more than 20 years of experience in signal processing, image and video sequence analysis, and autonomous systems. He is or was an associate editor of IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems for Video Technology, and EURASIP Journal on Image and Video Processing. He chaired the Student Service Committee of the IEEE Signal Processing Society supporting the IEEE Signal Processing Cup up to 2022. He is a Member of IEEE.
Angshul Majumdar (angshul@iiitd.ac.in) received his Ph.D. degreee in electrical and computer engineering from the University of British Columbia, Vancouver, Canada. He is a professor at the Indraprastha Institute of Information Technology, New Delhi, Delhi 110020, India. He is an associate editor for IEEE Open Journal for Signal Processing and Elsevier’s Neurocomputing. He is currently the director of student services for the IEEE Signal Processing Society (SPS); prior to that, he was a member-at-large on the IEEE SPS Education Board, was chair for the Education Committee, and also served as chair for the Chapter’s committee. His research interests include signal processing and machine learning with applications in smart grids, bioinformatics, remote sensing, and medical imaging. He is a Member of IEEE.
[1] D. Wang et al., “Fcl-Taco2: Towards fast, controllable and lightweight text-to-speech synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2021, pp. 5714–5718, doi: 10.1109/ICASSP39728.2021.9414870.
[2] T. Dutoit, “High-quality text-to-speech synthesis: An overview,” J. Elect. Electron. Eng. Aust., vol. 17, no. 1, pp. 25–36, Mar. 1997.
[3] G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, and L. Xie, “Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech,” in Proc. IEEE Spoken Lang. Technol. Workshop (SLT), 2021, pp. 492–498, doi: 10.1109/SLT48900.2021.9383551.
[4] “Fraudsters cloned company director’s voice in $35 million bank heist, police find,” Forbes, Oct. 2021. [Online] . Available: https://www.forbes.com/sites/thomasbrewster/2021/10/14/huge-bank-fraud-uses-deep-fake-voice-tech-to-steal-millions/?sh=37864b675591
[5] M. Todisco et al., “ASVspoof 2019: Future horizons in spoofed and fake audio detection,” in Proc. Conf. Int. Speech Commun. Assoc. (INTERSPEECH), 2019, pp. 1008–1012, doi: 10.21437/Interspeech.2019-2249.
[6] C. Borrelli, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro, “Synthetic speech detection through short-term and long-term prediction traces,” EURASIP J. Inf. Secur., vol. 2021, no. 1, pp. 1–14, Dec. 2021, doi: 10.1186/s13635-021-00116-3.
[7] H. Malik, “Securing voice-driven interfaces against fake (cloned) audio attacks,” in Proc. IEEE Conf. Multimedia Inf. Process. Retrieval (MIPR), 2019, pp. 512–517, doi: 10.1109/MIPR.2019.00104.
[8] E. R. Bartusiak and E. J. Delp, “Synthesized speech detection using convolutional transformer-based spectrogram analysis,” in Proc. 55th Asilomar Conf. Signals, Syst., Comput. (ACSSC), 2021, pp. 1426–1430, doi: 10.1109/IEEECONF53345.2021.9723142.
[9] E. Conti et al., “Deepfake speech detection through emotion recognition: A semantic approach,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2022, pp. 8962–8966, doi: 10.1109/ICASSP43922.2022.9747186.
[10] L. Attorresi, D. Salvi, C. Borrelli, P. Bestagini, and S. Tubaro, “Combining automatic speaker verification and prosody analysis for synthetic speech detection,” in Proc. Int. Conf. Pattern Recognit. (ICPR), Springer-Verlag, 2022, arxiv:2210.17222.
[11] D. Salvi, P. Bestagini, and S. Tubaro, “Exploring the synthetic speech attribution problem through data-driven detectors,” in Proc. IEEE Int. Workshop Inf. Forensics Secur. (WIFS), 2022, pp. 1–6, doi: 10.1109/WIFS55849.2022.9975440.
[12] M. Stamm, P. Bestagini, L. Marcenaro, and P. Campisi, “Forensic camera model identification: Highlights from the IEEE signal processing cup 2018 student competition,” IEEE Signal Process. Mag., vol. 35, no. 5, pp. 168–174, Sep. 2018, doi: 10.1109/MSP.2018.2847326.
[13] A. Giganti, L. Cuccovillo, P. Bestagini, P. Aichroth, and S. Tubaro, “Speaker-independent microphone identification in noisy conditions,” in Proc. 30th Eur. Signal Process. Conf. (EUSIPCO), 2022, pp. 1047–1051, doi: 10.23919/EUSIPCO55093.2022.9909800.
[14] A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, p. 9, Feb. 2019.
[15] D. Salvi, B. Hosler, P. Bestagini, M. C. Stamm, and S. Tubaro, 2022, “TIMIT-TTS: A text-to-speech dataset for multimodal synthetic media detection,” Zenodo, doi: 10.5281/zenodo.6560159.svg.
Digital Object Identifier 10.1109/MSP.2023.3268823