Facial recognition has become an increasingly important technology in various fields, from security to social media applications. Two common deep-learning architectures used in facial recognition are FaceNet and Siamese. Both architectures are trained on large datasets of labeled images and learn to encode the unique features of each face into a lower-dimensional feature space. In this blog post, we will explore the main differences between Siamese and FaceNet, with a particular emphasis on the differences in training approaches.
FaceNet is a deep learning algorithm designed for face recognition that uses a specific type of neural network called a Convolutional Neural Network (CNN). The purpose of FaceNet is to take an input image of a person's face and map it to a lower-dimensional feature space, where each point in the space (a vector of numbers) corresponds to a specific face. The neural network is typically trained on a large dataset of images of faces and learns to identify the unique features of each face based on a vector computed by a so-called feature extractor (the most commonly used feature extractor is ResNet). Once a face is mapped to the feature space, it can be compared to other faces in the space to determine if they are of the same person.
One of the key characteristics of FaceNet is its loss function, called “triplet loss”. It is called "triplet" loss because it involves comparing three images, or "triplets", at a time.
The three images are an "anchor" image, a "positive" image, and a "negative" image. The anchor image is a picture of a face that the network needs to identify, the positive image is another image of the same person, and the negative image is an image of a different person.
The network is trained to learn to map the anchor image and the positive image to points that are close together in the feature space while mapping the anchor image and negative image to points that are far apart. The goal is to minimize the distance between the anchor and positive images while maximizing the distance between the anchor and negative images.
The triplet loss function is used to calculate the distance between the anchor, positive, and negative images in the feature space. If the network's predictions are correct and the anchor and positive images are close together while the anchor and negative images are far apart, the loss function will have a small value, indicating that the network is performing well. If the network's predictions are incorrect and the anchor and negative images are closer together than the anchor and positive images, the loss function will have a larger value, indicating that the network needs to adjust its weights to improve its performance.
Siamese architecture is a type of deep learning architecture that has been used in various facial recognition applications. In a nutshell, a siamese network learns by analyzing how similar or dissimilar two pictures (“a pair”) are.
A Siamese architecture has two identical neural networks that work together to compare two different images. Each network takes one of the input images and processes it to generate the lower-dimensional vector of the image. The two vectors are then combined (this operation is executed by an architectural component called a “fully connected classifier”) and a single decimal number between 0 and 1 is returned. If this number is above a pre-determined threshold (e.g. 0.5), the network predicts that the two pictures belong to the same person. Else, they belong to different identities.
This process is repeated in iterations, allowing the algorithm to gradually increase its confidence and ability to identify people correctly. By using identical networks, the program ensures that the same features are being analyzed in both images, which helps to improve the accuracy of the comparison.
At the date of writing, FaceNet is considered one of the strongest architectures for face recognition. It is explicitly designed for solving this problem, while the Siamese architecture is a more general approach that can be used for various image recognition and classification tasks.
In general, FaceNet-like architectures perform better especially thanks to the triplet loss used for learning that helps the network in defining a better lower-dimensional space. In fact, as the learning iteration involves both same and different examples, the network has more information for finding the best mapping for each image.
Siamese networks, on the other hand, cast the recognition problem as a classification task and may fail to properly generate a well-separable lower-dimensional space.