Result Video (result.mp4).
Deepfake – is an advanced technique for human image synthesis based on artificial intelligence.
Masked or Masking – allows you to cut out something or even place something on top of your video.
SRC – source material.
DST – destination.
Model – are mathematical algorithms that are “trained” using data and human expert input to replicate a decision an expert would make when provided that same information.
In this project, I will describe in detail the process of making a deepfake video where I swap my head with the head of the Norwegian actor, Aksel Hennie to mimic the facial expressions in the destination video.
The term deepfake is a portmanteau of “deep learning” and fake, where synthetic images and videos can be constructed through inputs decided by their creators.
To do this, I am going to use the open-sourced program called Deep face Labs (DFL) as the primary tool. See steps below for usage:
Destination Video (data_dst.mp4).
A pre-trained model with 200 000 facial expressions at different viewing angles have been implemented in the input material to increase realism in the output video and reduce the training period.
Requirements to run the model
- GPU with 15 GB VRAM
- An i5 or AMD 12 / Ryzen processor
- 12 GB of RAM
- 10 GB of free hard drive space
For this experiment, I used Tesla V100-SXM2 (16GB VRAM) on Google Colab.
2. Main part
The first step is to extract the frames from five interviews that will be the source data for the Aksel head dataset, these interviews are compiled into one video file “data_src.mp4” and placed in the workspace folder under the DeepFaceLab folder.
After opening up the “2) extract image from video data_src.bat”, the fps chosen is 3 FPS since this will mitigate duplicated heads with the same facial expressions. The goal is to have a dataset with many different expressions, from different angles as well as different eye directions.
Image 1. 2) extract image from video data_src.bat.
Then it’s needed to clean up and remove any image that doesn’t contain any heads or is of too low quality after the extraction. These are placed under the data_src folder in workspace.
Image 2. Example with highlighted pictures.
The next step is to extract the heads, this is done by running the “4) data_src faceset extract.bat”, and using the default settings, and let it run for a while.
Image 3. 4) data_src faceset extract.
These head images will be placed under the folder aligned in data_src, before doing anything else it’s needed to remove the heads of other people than the subject Aksel Hennie, blurry faces, and bad face track as shown below in image 4.
Image 4. aligned folder in data_src.
After the removal of images of poor quality, the dataset is ready with 4116 head images of Aksel in different positions and facial expressions.
Compiled Video of dataset images (00000.mp4).
Before starting to mask the heads it’s needed to prepare the destination video or dst_data.mp4. This is done by running the “3) extract images from video data_dst FULL FPS.bat” for extracting all of the frames of the “data_dst.mp4” and “5) data_dst faceset extract.bat” to extract the heads from the extracted frames.
Compiled Video of face track (comp.mp4).
The next step is to mask what part is going to be trained in the dst and src with the “5.XSeg) data_dst mask – edit.bat” and “5.XSeg) data_src mask – edit.bat”. Its only needed to mask the heads in different angles and lighting conditions, for the src dataset with 4116 heads it was only used 21 masked heads before using XSeg training. For the dst dataset it was only used 6 masks for 959 heads.
Image 5. src masking with XSeg.
Image 6. dst masking with XSeg.
Then it’s needed to train the XSeg mask model with “5.XSeg) train.bat”.
Image 7. XSeg model training.
The XSeg training process was run with “5.XSeg) train.bat” until the result for acceptable masking of the heads as shown below in image 7. Then “5.XSeg) data_dst trained mask – apply.bat” and “5.XSeg) data_src trained mask – apply.bat” were used to apply the masks.
Image 7. XSeg mask model preview.
Then the training for the head model can be started, after starting “6) train SAEHD.bat” the settings are chosen for this training are the following:
- Autobackup every N hour ( 0..24 ?:help ) : 0
- Target iteration : 0
- Flip faces randomly ( y/n ?:help ) : n
- Batch_size ( ?:help ) : 8
- Resolution ( 64-640 ?:help ) : 288
- Face type ( h/mf/f/wf/head ?:help ) : head
- AE architecture: df
- AutoEncoder dims ( 32-1024 ?:help ) : 512
- Encoder dims ( 16-256 ?:help ) : 64
- Decoder dims ( 16-256 ?:help ) : 64
- Decoder mask dims ( 16-256 ?:help ) : 22
- Eyes and mouth priority ( y/n ?:help ) : n (but turnd on after 150000 iterations)
- Place models and optimizer on GPU ( y/n ?:help ) : y
- Use learning rate dropout ( y/n/cpu ?:help ) : n
- Use AdaBelief optimizer? ( y/n ?:help ) : n
- Enable random warp of samples ( y/n ?:help ) : y
- Masked training ( y/n ?:help ) : n (this should be y, since It will reduce the training time by prioritizing what has been masked.)
- Uniform_yaw ( y/n ?:help ) : n
- GAN power ( 0.0 .. 10.0 ?:help ) : 0.0
- True face’ power. ( 0.0000 .. 1.0 ?:help ) : 0.0
- Face style power ( 0.0..100.0 ?:help ) and Background style power ( 0.0..100.0 ?:help ) : 0.0
- Color transfer for src faceset ( none/rct/lct/mkl/idt/sot ?:help ) : none
- Enable gradient clipping ( y/n ?:help ) : y
- Enable pretraining mode ( y/n ?:help ) : n
Image 8. SAEHD.bat traning.
Then the model was trained until there were little improvements in the SRC loss value and there were no more improved facial detail, this was checked by looking at the preview jpg as shown in the picture below.
Image 9. Training from 300 to 2000000 iterations.
The last stage is to merge the trained model to the data_dst.mp4 this is done by using “7) merge SAEHD.bat” and using the following settings:
- Mode: overlay
- mask_mode: learned-prd
- erode_mask_modifier: 0
- blur_mask_modifier: 0
- motion_blur_power: 0
- output_face_scale: -5
- color_transfer_mode: rct
- sharpen_mode : None
- blursharpen_amount : 0
- super_resolution_power: 16
- image_denoise_power: 0
- bicubic_degrade_power: 0
- color_degrade_power: 0
After all of the frames are processed, the “8) merged to mp4.bat” is started to compile all of the processed frames back to a video file. This file is being output as result.mp4 as shown at the top of the article.
dst, result and facetrack (final.mp4)
3 – Summary
In the first stage of the project the preparation was to produce the dataset and do frame extraction of the destination video as well as the source video. Then the heads were extracted with facial tracking
The heads of the destination video and source video got masked so obstructions that are in front of the face or head are not trained as a part of the head, such as fingers and logos.
And the last stage where the model was trained with the appropriate settings, but in this case the training time took almost two months which longer than necessary, this was due masked training was set to “no”, this enables training of the whole image and doesn’t prioritize the masked part such as just the head. The loss value dropped significantly after enabling this, see the last example in image 9. Finally, the trained heads were merged to the destination video file, outputting it to a result video file containing the source head mimicking all of the facial expressions of the destination video.
Side by Side (side-by-side.mp4)