Written by Marco Fontani
“SEEING IS BELIEVING.” Or, rather, that’s what we used
to say. Since the beginning of time, seeing a fact or a piece of news depicted
in an image was far more compelling than reading it, let alone hearing about it
from someone else. This power of visual content probably stemmed from its
immediacy: looking at a picture takes less effort and training than reading
text, or even listening to words. Then, the advent of photography brought an
additional flavor of undisputable objectivity. Thanks to photography, pictures
could be used as a reliable recording of events.
Looking closer, however, it turns out that photographs
have been faked since shortly after their invention. One of the most famous
examples of historical hoaxes, dating back to the late 1860s, is Abraham
Lincoln’s head spliced over John Calhoun’s body, and cleverly so (Figure 1).
(Note: Click here to
read the full hoax description on hoaxes.org.)
Politics was indeed an important driver for image
manipulation throughout the years, as witnessed by many fake pictures created
to serve leaders of democracies and tyrannies. We have photos of the Italian
dictator Benito Mussolini proudly sitting on a horse that was held by an ostler
(the latter promptly erased), photos of Joseph Stalin where some subjects were
removed after they fell in disgrace, and so on. All these pictures were “fake”,
in the sense that they were not an accurate representation of what they
purported to show.
Of course, creating hoaxes with good, old-fashioned
analog pictures was not something everyone could do. It took proper tools,
training, and lots of time. Then, digital photography arrived, which was soon
followed by digital-image manipulation software and, a few years later, digital-image
sharing platforms. With advanced image editing solutions available at
affordable prices—or even for free—there was a boom in the possibilities of
creating fake pictures. Of course, you still needed suitable training and time
to obtain professional results, but this was nothing compared to working with
In the last couple of years, we have witnessed yet
another revolution in the manipulation of images: “deepfakes”. A deepfake is a
fake image or video generated with the aid of a deep artificial neural network.
It may involve changing a person’s face with someone else’s face (so-called
“face-swaps”), changing what a subject is saying (“lip-sync” fakes), or even
changing the words and movements of someone’s head so that they are like a
puppet, or guided actor (“re-enactment”). But how is this achieved? What are
these “deep artificial neural networks”? How can we fight deepfakes? In this
article, we’ll try to address these questions and bring some order to all of
Artificial Neural NetworksAn artificial neural network (ANN) is a machine-learning
algorithm, and it’s not new at all. In fact, psychologist Frank Rosenblatt
proposed the first ANN as a way to model the human brain back in 1958. Like the
human brain, an ANN comprises many elementary units (neurons). Each neuron is
connected to other neurons through input connections and output connections,
and each connection is assigned a weight. The weighted contributions coming from
input neurons are summed together, and a single output value is computed using
an “activation function”. The obtained output is then sent to other neurons
through output connections. Neurons are distributed in layers: we have an input
layer, an output layer, and an arbitrary number of “hidden” layers in between.
Like the human brain, ANN must be trained with data.
Lots of data, possibly. The idea is that you need a labeled training dataset:
you feed one dataset element to the neural network, wait for the output to be
produced, then you measure how much of the output is wrong, and you
“backpropagate” the corrections to connection weights from the output layer
back to the input. Thus, training an ANN basically means updating its
connection weights until the produced output matches the expected one as
closely as possible.
As simple as they are, ANNs are extremely powerful.
Technically speaking, they are “universal function approximators”, which means
they can be used to compute virtually anything, provided a sufficiently complex
network of neurons is allowed. And actually, ANNs have been used in many
applications: playing video games, recognizing handwritten characters, spam
filtering, cancer diagnosis, financial forecasting, image classification, and more.
Using ANNs for DeepfakesNow, different tasks call for different network
architectures. In general, the word “deep” in deepfakes suggests that the
neural networks employed have a lot of hidden layers in order to be able to
carry out complex processing tasks. As far as deepfakes are concerned, there
are two neural network schemes that proved fundamental: auto-encoders and
generative adversarial networks (GANs).
An auto-encoder is a particular ANN that has the same
amount of input and output neurons, but at least one hidden layer with a
smaller number of neurons—like in Figure 2, below.
The network is simply asked to recreate the input data
in the output layer. But since there is a hidden layer with fewer neurons (the
“bottleneck”), the network cannot simply copy elements from input to output
neurons. Instead, the network must compress the information in the bottleneck, then
un-compress it, and then map it to the output. In other words, the left part of
the network works as an encoder, the bottleneck layer provides the compressed
data, and the right part of the network works as a decoder.
How is this related to deepfakes? Well, let’s imagine we
have a picture of Tom Cruise’s face and we want to swap it with Jim Carrey’s
face. First, we gather many images of both actors’ faces. Then, we train two
autoencoders that share the same weights in the encoding stage (that is, from
the input to the bottleneck), but have dedicated decoders—one per actor. In
other words, at compression time, the network learns how to preserve “common
traits” shared by both faces, but at decoding time, only the peculiar traits of
each actor are reenforced. Now that we have these networks, the trick is that
we’ll use the “wrong” decoder network: we compress Tom Cruise’s face with the shared
encoder, but then we purportedly use Jim Carrey’s network to carry out the
decoding. The result will be a “Jim Carrey-fied” Tom Cruise face. Want an
example? This YouTube video that added Robert Downey Jr. and Tom Holland to the
Back to the Future cast is quite
Of course, the full face-swap pipeline is larger than
just these auto-encoders. It is necessary to first isolate faces from other
actors in each frame, then they must be warped and aligned to a “standard
position”. In this standard position, the swapping happens. Then they must be
warped back to their original position and “blended” into the original actor’s head.
(Note: Face swaps normally only change the region going from the mouth to
eyebrows, and they only marginally affect hairs or jaws.)
Now, let’s talk about generative adversarial networks.
GANs are the predominant deep-learning technology employed when new content needs
to be generated. For example, the popular website thispersondoesnotexist.com
generates faces of people that are “hallucinated” by a neural network. These
people do not exist! How do they achieve this? In the definition of GAN, “generative”
indicates that the goal is to generate new content, rather than classify,
predict, or compress. “Adversarial” means that a GAN is actually made of two
neural networks, a Generator and a Discriminator, playing one against the
other. The Discriminator is trained to distinguish real content—in our case,
real faces—from synthetically generated faces. The Generator’s goal instead is
to fool the Discriminator by producing a sufficiently realistic face starting
from random pixels. Typically, at the beginning of training, the Generator
produces almost random pixels and the Discriminator easily wins (i.e., it
correctly detects that the generated content is not a real face). However, at
every iteration the output of the Discriminator is given to the Generator as
feedback, so that the Generator can improve again and again (Figure 3).
If the two networks are properly designed, and enough
training material and processing power is provided, the Generator will
eventually produce extremely realistic faces, such as the one below taken from
the mentioned website (Figure 4).
Fighting DeepfakesWe have mentioned two deep-learning architectures used
for deepfake creation, and we have seen how realistic the generated content can
be. Let us now move on the other side of the battlefield and see what we can do
to detect deepfakes.
We can virtually identify three possible “macro-approaches”
to deepfake content detection. One possible route is to treat deepfakes as
classical images to be analyzed. In the end, regardless of how well the face
has been generated and inserted, most deepfake images of video frames are
nothing but splicing of a fake face into a real picture. Therefore, classical
image and video forensic techniques—based on compression artifacts analysis,
noise consistency, and correlation analysis—have a chance to be successful in
detecting a deepfake (Verdoliva 2020). For example, Amped Authenticate’s ADJPEG
filter successfully detects many images generated with a popular face-swapping
app, as shown in Figure 5 below.
The ADJPEG filter works by finding double-compression
artifacts in the original part of the
image. This means that even if a more complex splicing system is used to
substitute the face, it will make little difference since it will affect the
manipulated part and not the original part, which is left untouched.
The second possible approach is to use deep learning on
the detection side. Researchers have been publishing tons of papers where more
and more complex neural networks are employed for deepfake content detection (Verdoliva
2020). The main issue of deep-learning-based detection techniques is that they
heavily depend on the training dataset. In other words, as long as you take a
large dataset, split it in two, then use one part to train the network and the
other to test it, things work nicely. But if you use the so-trained network on
images from a different dataset, performance drops dramatically, and this
severely limits the applicability of these data-driven approaches. Another
relevant problem is lack of explainability: it is normally very hard to explain
“how” the network reached the final classification, which is of course
problematic in a forensic scenario. Finally, if a valid detection network X is
made publicly available (as it should be for repeatability purposes, required
in forensics), there is a risk that attackers will build a GAN using X as the
discriminator, which means they could generate anti-forensic fakes targeted to
The third macro-approach consists of visual consistency
and behavioral analysis methods. Contrary to the first two approaches, these
make explicit use of the fact that deepfakes basically involve the faces of
people, and there are many subtle things that could go wrong in the generation
process. We may find obvious clues of manipulation, such as inconsistencies in
eye color or earrings, as in Figure 6 below (although, admittedly, you
may well find real people with different colored eyes, or wearing just one
These kinds of blatantly “strange” defects are becoming
less common as neural networks improve. However, there are other kinds of
anomalies that are harder for a network to avoid. For example, researchers
found that in deepfake videos, the tampered face has a much lower eye-blinking
rate than real faces (Li 2018). That’s probably because most neural networks
have limited time awareness, and they can hardly figure out the right moment
when eyes should blink. Of course, it would be possible for the attacker to
work around this issue by just “copying” the eye blinking time from the
original face into the tampered face, so even this anomaly probably may be
Very recently, it has been shown that when the deepfaked
subject speaks, there are inconsistencies between the phonemes (elementary
units of sound, those that you normally find written below words in
dictionaries to explain how to pronounce them) and the vysemes (the elementary
movement of the mouth) (Agarwal 2020). Of course, designing an automated
analysis for this kind of anomaly is not trivial, and manually carrying out the
analysis is time consuming.
Finally, it is worth mentioning an analysis method
designed to protect world leaders or very popular individuals, for which
hundreds of hours of video are normally available (Agarwal 2019). The method
creates a personalized “profile” of the peculiar behavioral characteristics of
the original subject using training videos (e.g., the way they move the head
and eyebrows when speaking, or the way eye wrinkles vary over time). Now, if
the attacker used an actor’s face to “re-enact” the world leader’s face, the
faked face will follow the actor’s behavioral characteristics, not those of the
original subject. Therefore, extracting these characteristics from the
questioned video and comparing them to the individual’s profile may allow
spotting possible inconsistencies.
Deepfakes: Good or Evil?As it often happens with technology in general, there
are both constructive and malicious uses for deepfakes. On the constructive
side, think about the movie production industry: they can finally fix the
annoying out-of-sync mouth effect in dubbed movies at little cost. Even more,
they could easily animate an avatar of the main character at little expense,
compared to having video-editing specialists working hours for every second of
the movie (this is bad news for such professionals, though).
On the evil side, there are sadly several ways to
weaponize the deepfake technology. Misinformation is one of them. It is getting
easier to create a video of a politician saying something they would never say.
Forging fake evidence is another possible misuse: someone could create an alibi
by swapping their face into someone else’s face so as to pretend they were in a
place at a certain time. Sadly enough, however, the main misuse of deepfakes is
currently related to non-consensual pornography. Women found their face
realistically spliced over an actress’s face in sexually explicit videos, with
an obvious negative impact on their reputation. Recently, even bots (automatic
chat responders) have been created that will “undress” any woman. The attacker
sends in a picture of the dressed victim, and the bot creates a picture where
the victim is naked.
Such a variety of potential misuses certainly calls for
the development of reliable deepfake detection technologies, but also suggests
that technology cannot be the sole answer. People need to be educated about the
existence of deepfakes, and to be made aware that seeing is no longer believing—well,
in the digital world, at least. In a world where everyone can post news on the
internet, the ability to scrutinize an information source to decide about its
reliability is becoming increasingly important. All in all, there’s no surprise
that a complex threat such as deepfakes requires a combination of education,
intelligence, and technology.
About the Author
Marco Fontani graduated in Computer Engineering (summa cum laude) in 2010 at the
University of Florence (Italy) and earned his Ph.D. in Information Engineering
in 2014 at the University of Siena under the supervision of Prof. Mauro Barni.
He works as an R&D Engineer at Amped Software, where he coordinates
research activities. He participated in several research projects, funded by
the European Union and by the European Office of Aerospace Research and
Development. He is the author or co-author of several journal papers and
conference proceedings, and he is a member of the Institute of Electrical and
Electronics Engineers (IEEE) Information Forensics and Security Technical
Committee. He delivered training to law enforcement agencies and he has
provided expert witness testimony on several forensic cases involving digital
images and videos.
ReferencesAgarwal, S., H. Farid, Y. Gu, M. He, K. Nagano, and H.
Li. 2019. Protecting world leaders against deep fakes. IEEE/CVF Conference
on Computer Vision and Pattern Recognition Workshops (CVPRW). Long Beach,
Agarwal, S., H. Farid, O. Fried, and M. Agrawala. 2020. Detecting
deep-fake videos from phoneme-viseme mismatches. IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops (CVPRW). Seattle,
Li, Y., M. Chang, and S. Lyu. 2018. In ictu oculi:
Exposing AI created fake videos by detecting eye blinking. IEEE
International Workshop on Information Forensics and Security (WIFS). Hong
Kong, Hong Kong. 2018:1-7.
Verdoliva, L. 2020. Media forensics and deepfakes: An
overview. arXiv preprint arXiv:2001.06564.