Deepfake Image Classification :
Table Of Content :
1 .) Introduction
2 .) Business Case
3 .) EDA And Preprocessing
4 .) Base Model (VGG-16)
5 .) CG-FACE Model
6 .) Model Improvement
7 .) Final Result
8 .) Final Observation
9 .) References
1 .) Introduction :
Deep fake images are the realistic images which are generated by various editing tool for example Deepfakes, Face swap, Face2Face, and NeuralTextures.
But in recent years we got new findings in the field of Computer Vison sector as Generative Adversarial Neural Network. This was built by a graduate student known as Ian Goodfellow.
As the famous saying says every coin has two sides one is positive and one is negative. The positive side of this network is that we can use this to generate more realistic images for our neural networks training purpose this can also be utilized by the entertainment industry in various ways but GAN”s are so powerful just imagine a situation where a fake news starts circulating attached with a fake facial image and start harming the communal harmony in the society.
And we in the era of this social media, can easily be misguided from the true events, this could be too harmful for our society.
This all is possible today just because we are sitting on a very powerful technology but we will utilize the same technology to find the solution of this problem.
2 .) Business Case :
The very 1st use case would be to identity fraud detection. In recent years all the banking and financial services moved on online platform so the trend of video KYC and online document verification increased.
Now here these deep fake images and videos are the big headache for the financial firms and for the banking sector.
The 2nd use case could be cyber security, Imagine a person uses multiple deepfake images to operate it’s various fake account or at any digital security platform where the image is given much importance to the user image.
The third use case would be identification of fake news. Fake news is becoming very serious problem day by day today due to high speed internet things get viral very quickly and people even do not verify the news and start reacting to it. The famous examples can be found on social media platforms where tons of fake news circulate on daily basis.
3 .) EDA :
Let’s visualize the image first to see whether the fake image and real images are classified by us just by looking visually.
It was not possible for a human to make a clear guess that which image is real and which is fake without provided the label.
As we move forward it was realized that for deepfake image classification a cropped and aligned face would be much more sensible because at the end we are creating a CNN model which extracts the important facial features from the image to make the classification.
So, let’s build our own face aligner and extractor and for doing this openCV and Dlib would be required. These are two image processing libraries which are very much popular for image related task in python.
The extracted and aligned images can be seen below and we would use different sizes of images for our input data as (224 X 224) and (64 X 64) because the input shape of both VGG-16 and the proposed model is different along with that we would use gray scale images as in normal RGB images there are three channels Red, Green and blue by changing the value for a given pixel the color contrast changes in the image .
But while training, as less number of channel will decreased the amount of learning parameters due to that total amount of training time will decrease.
Used the numpy format to store the all the preprocessed images and only choose 20000 images from both the data set to make the prediction problem balanced.
Label Encoding :
Label encoder is categorical encoding tools for example we are given three categories as Red, Blue and Green now in this case “label encoder” tries to give a integer value to each label it has “fit function” which creates a dictionary over all the unique value and then “transform function” which converts the categories to numerical values. Due to that Red will become 1, Blue will become 2 and green will become 3.
Here we have two class label one is ‘Fake’ and another is ‘Real’ so we chose to label encoding to convert our two class labels into numerical value.
Dividing Dataset for training :
4 .) Base Model VGG-16 :
VGG16 is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes.
The internal architecture of VGG-16 looks like,
The normal input shape of VGG-16 is 224 X 224 which is also used by the base model but the channel size is 1 not three since the input data images are already converted in the gray scale image.
Input image passes through 4 conv block followed by a max pooling layer at the end of each conv block. The number of filters for each conv blocks are 64, 128, 256 and 512 respectively where the padding = “same” and the activation unit “relu” used in each filter. Each conv block in VGG captures all the required features for the classification purpose.
In the normal form after the flatten layer typically the output vector passes through a small dense network of two layer with activation function equal to relu been used and the final output layer is a Softmax layer.
Softmax layer is good for multiclass setting generally because the softmax function formulates as,
Here Zi’s means that every individual output of the sigmoid layer Because softmax has the internal sigmoid unites and after passing to those sigmoid unite the softmax function gets applied. And for each output the softmax score is calculated by the above formula. But softmax can also be used in the binary setting, Since we are solving a binary classification problem so the output classes would be two in this case.
Use of other parameters : For solving this task ADAM optimizer with the initial learning rate = 0.001 used since the learning rate depends upon the previous update of (Vt^) and (Mt^) where these two are the bias correction terms and learning rate share the following relation with these two parameters.
Loss would be Binary cross entropy in this case since it is a binary classification problem.
Metrics: Though we are not handling of class imbalanced problem in this problem because the data set it perfectly balanced but still we will use F1 Score along with the accuracy as the metric and we are using it by a reason.
The formula of F1 Score can be write as,
Here TP belongs to True positive points and ( FP, FN ) refers as False positive and False negative. The formula is straight forward in term of precision and recall terms but when we further dig in then we can observe that from the final formula, written in the form of TP, FP and FN but TN does not give any contribution in the calculation of F1 Score.
Intuitively If we think then F1 Score would be min. when FP and FN would be high and vice versa. So our F1 Score is more sensitive towards the negative points because F1 Score always lay in the range of 0–1 so by using the F1 score we can train more sensitive model towards the negative points and in this case we are using negative data points as fake images.
5 .) CG Face Model :
This convolutional neural network model proposed by L. Minh Dang and Syed Ibrahim Hassan from Department of Computer Science and Engineering, Sejong University, Seoul 143–747, Korea in the paper “Deep Learning Based Computer Based Computer Generated Face Identification Using Convolution Neural Network”.
In the proposed paper GAN model such as BEGAN and PCGAN has been used to generate the fake facial images and for the correct class CELEB-A data set has been used.
The primary argument in this paper is that big models such as VGG16 and other pretrained model consist large number of convolution layers so in many cases the feature represented by these model to solving the problem are not well enough. Due to large number of parameters they tend to overfit to overcome this problem the university team proposed CGFACE and it’s variants for solving the deep fake image classifier task.
This model is a very shallow in nature and the research team argument after the experiment is that shallow network determines the right amount of features and it does not overfit. The model architecture consist as 64 X 64 as input size, again we will use channel size as 1 not 3 due to gray scale. And in total only five convolution layers followed by a max pooling layers after each convolution and the final layer passed through a dense layer consist of sigmoid as activation function.
The size of kernels in convolution and max pooling layers are decided after the experimentation with the dataset.
As the metric in the research paper was AUC that’s why we include this in our existing metric.
AUC stands for area under the curve. For calculating this we draw a curve between True positive rate and False positive rate. To calculate the TPR and FPR first we crate our confusion matrix such as,
TPR which also known as sensitivity can be calculated as dividing predicted points which are actually true class labels with the total true class labels. Here TP stands for class labels which are predicted true and they are actually true and FN represented as False negative points which means those points which are true in reality but predicted as false points.
Similarly we can calculate the FPR total number of False positive points with the total Negative points.
after calculating the both the rates we draw a plot between them to calculate the AUC score typically the graph looks like this,
To calculate the AUC sore we find the are under this cure which is shaded typically the score lies in between 0 and 1. But if our model AUC is 0.5 then we say our model is random and not doing anything and worst case is 0.0. To interpret the AUC we take for eg. as 0.8 it means that 80% chance that our model will distinguish between True class and False class.
We train both the models for 500 epochs and with batch size as 64. Though loss is minimum in the CG-Face model but VGG16 outperformed the CG-Face model with a significant margin but we should also not forget that the VGG-16 take 224 X 224 input size so it is actually crafting more number of important features for classification purpose.
6 .) Model Improvement :
This paper is still in under review but the method proposed in this research paper was showing very good improvements
There is concept called gram block which can be inserted between the existing CNN network layers to collect the texture features. It was found in the research paper that texture features play important role in differentiating the original images and the deepfake images.
Before moving to gram block let’s first understand the concept of gram matrix
As the formula suggest for calculating the gram matrix we do dot product between every two columns of a feature matrix. Gram matrix is generally used in style transfer learning But the researcher used this matrix in preserving the texture feature importance in the combination of CNN layers. The suggest gram block fits in the existing model like our RESNET model in RESNET we do skip connection between two layers and here we are inserting a whole block between two convolution layers.
The gram block consist of a convolution layer followed by a gram matrix layer and two other convolution layers and at the end we apply pooling layer.
We will use 4 gram blocks in our CG-Face model and train with the same parameters as we used in before.
7 .) Final Result :
the results are more refine and best in our CG-FACE model. and the final results with the comparison in metrics.
The full implementation of this case study is given at my GitHub.
8 .) Final Observation:
1 .) The first key take away from this experiment is, there is not always complex and higher layer CNN works better since we are getting good result with our CG-Face model which is less complex and have the low no. of CNN layers in comparison with the VGG-16. CG-Face is also working with low resolutions because for vgg16 we took 224 input shape and for our model we took 64 as input shape which means our model was dealing with more complex data then vgg16.
2 .) From the experiment with the gram block we got to know that for the facial image classification problem texture feature is actually important because it is now proven after the experiment that it enhances the power of a CNN model to an extent.
9 .) Reference:
1 .) For the fake facial image data set used in this case study the 1 million fake face data set which got from the link.
2 .) For the real facial image CELEB-HQ data set in the size of 256x256 from the given link used.
3 .) Implementation of the Research paper to solve this problem. with some modification for this problem.
4 .) And for the improvement of the model improvement gram block paper was used.
Hope you enjoyed the blog, Feel free to share your ideas in the comments section below or Get in touch with me on LinkedIn