Introduction to the FaceNet Architecture

Introduction to the FaceNet Architecture
Introduction to the FaceNet Architecture

Would you like to try Synthetic data?  

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Facial recognition technology has been around for a while now. One of the most significant breakthroughs in facial recognition technology, contributing to its widespread application, is the development of FaceNet. FaceNet is a deep learning model that can accurately identify and verify faces. In this article, we will explore the FaceNet architecture, its core components, and its different variations.

What is FaceNet?

FaceNet is a neural network-based system that uses deep learning algorithms to recognize faces. Developed by Google researchers in 2015, the system uses a convolutional neural network (CNN) to map facial features into a Euclidean space where distances directly correspond to a measure of face similarity. This space is then used to compare faces and recognize whether they belong to the same person or not.

FaceNet Architecture

In a nutshell, FaceNet uses a deep network that learns and extracts various facial features. These features are then mapped to a 128-dimensional space, where images belonging to the same person are close to each other and far from images belonging to different subjects.

The core components of this architecture are briefly described below.

FaceNet Architecture Diagram


Deep neural network: originally, FaceNet used Inception Network as its backbone architecture. The core concept of Inception is the use of 1X1 filters for dimensionality reduction (for example converting a 256x256x3 RGB image to a 256x256x1 image). Other filters (e.g., 3x3) can be used for the same input, combining the different outputs (this technique is called “inception module”). Nowadays, a multitude of backbone networks can be used in place of Inception, with ResNet being the most popular choice.

L2 Normalization: the outputs are normalized using L2 Norm, also known as the Euclidean norm. It’s the distance of the output from the origin in the n-dimensional space and it is computed by the square root of the sum of the squared vector values. 

Embeddings: the embeddings are calculated and mapped in the relevant feature space. The embeddings can be seen as a “summary” of an identity. Using such a smaller representation (compared to the full image) is crucial for tackling the face recognition task effectively.

Loss Function: this is one of the differentiating features of FaceNet, especially when compared to standard Siamese architectures. Hold your breath, the next paragraph is dedicated solely to FaceNet’s loss function: i.e. triplet loss.

Triplet Loss

The triplet loss function is the key to training the FaceNet model. The loss function compares the distance between an anchor image and a positive image of the same person with the distance between the anchor image and a negative image of a different person. The goal is to minimize the distance between the anchor image and the positive image while maximizing the distance between the anchor image and the negative image.

FaceNet Triplet Loss

The selection of triplets (the three pictures to be given as input to the function) for the triplet loss function is crucial for the effectiveness of the training. The triplets should be carefully selected to ensure that the network learns to produce highly discriminative features for facial recognition. One approach is to use semi-hard triplets, which are triplets where the negative is farther from the anchor than the positive, but closer than the positive plus an arbitrary margin. 

d(a, p) < d(a, n) < d(a, p) + margin

Another approach is to use hard triplets, which are triplets where the negative image is closer to the anchor image than the positive image. 

d(a,n) < d(a,p)

Hard triplets are more challenging to learn from but can be useful for improving the robustness of the network to variations in lighting, pose, and expression. The selection of triplets is an important aspect of training the FaceNet architecture and requires the balancing of desired performance with the available resources (computing power & time).

Triplet Selection

FaceNet Variations

Since the original development of FaceNet, several variations of the architecture have been proposed. One such variation is the MobileFaceNet, which is designed to work with fewer parameters, for mobile devices with limited computational resources. 

Another variation is the SphereFace, which uses a different loss function than the triplet loss. The SphereFace loss function aims to maximize the cosine similarity between the features of the anchor and positive images while minimizing the cosine similarity between the features of the anchor and negative images. The objective of SphereFasce is to improve face recognition performance for open-set use cases: namely, those scenarios where the people to identify in production are not represented in your training data.


Facial recognition technology has come a long way since its inception, and FaceNet is a significant breakthrough in this field. The FaceNet architecture, with its triplet loss function, has proven to be highly accurate in recognizing and verifying faces. With variations like MobileFaceNet and SphereFace, FaceNet has become more versatile and adaptable to different use cases.

Back to blog

Want to read more?