# Difference between revisions of "Loss Function Search for Face Recognition"

## Presented by

Jan Lau, Anas Mahdi, Will Thibault, Jiwon Yang

## Introduction

Face recognition is a technology that can label a face to a specific identity. The process involves two tasks: 1. Identifying and classifying a face to a certain identity and 2. Verifying if this face and another face map to the same identity. Loss functions are a method of evaluating how good the prediction models the given data. In the application of face recognition, they are used for training convolutional neural networks (CNNs) with discriminative features. Softmax probability is the probability for each class. It contains a vector of values that add up to 1 while ranging between 0 and 1. Cross-entropy loss is the negative log of the probabilities. When softmax probability is combined with cross-entropy loss in the last fully connected layer of the CNN, it yields the softmax loss function:

$L_1=-log\frac{e^{w^T_yx}}{e^{w^T_yx} + \sum_{k≠y}^K{e^{w^T_yx}}}$ 

Specifically for face recognition, $L_1$ is modified such that $w^T_yx$ is normalized and $s$ represents the magnitude of $w^T_yx$:

$L_2=-log\frac{e^{s cos{(\theta_{{w_y},x})}}}{e^{s cos{(\theta_{{w_y},x})}} + \sum_{k≠y}^K{e^{s cos{(\theta_{{w_y},x})}}}}$ 

This function is crucial in face recognition because it is used for enhancing feature discrimination. While there are different variations of the softmax loss function, they build upon the same structure as the equation above. Some of these variations will be discussed in detail in the later sections.

In this paper, the authors first identified that reducing the softmax probability is a key contribution to feature discrimination and designed two design search spaces (random and reward-guided method). They then evaluated their Random-Softmax and Search-Softmax approaches by comparing the results against other face recognition algorithms using nine popular face recognition benchmarks.

## Previous Work

Margin-based (angular, additive, additive angular margins) soft-max loss functions are important in learning discriminative features in face recognition. There have been hand-crafted methods previously developed that require much effort such as A-softmax, V-softmax, AM-Softmax, and Arc-softmax. Li et al. proposed an AutoML for loss function search method also known as AM-LFS from a hyper-parameter optimization perspective . It automatically determines the search space by leveraging reinforcement learning to the search loss functions during the training process, though the drawback is the complex and unstable search space.

## Motivation

Previous algorithms for facial recognition frequently rely on CNNs that may include metric learning loss functions such as contrastive loss or triplet loss. Without sensitive sample mining strategies, the computational cost for these functions was high. This drawback prompts the redesign of classical softmax loss that cannot discriminate features. Multiple softmax loss functions have since been developed, and including margin-based formulations, they often require fine tuning of parameters and are susceptible to instability. Therefore, researchers need to put in a lot of effort in creating their method in the large design space. AM-LFS takes an optimization approach for selecting hyperparameters for the margin-based softmax functions, but its aforementioned drawbacks are caused by the lack of direction in designing the search space.

To solve the issues associated with hand-tuned softmax loss functions and AM-LFS, the authors attempt to reduce the softmax probability to improve feature discrimination when using margin-based softmax loss functions. The development of margin-based softmax loss with only one parameter required and an improved search space using reward-based method allows the authors to determine the best option for their loss function.

## Problem Formulation

### Analysis of Margin-based Softmax Loss

Based on the softmax probability and the margin-based softmax probability, the following function can be developed :

$p_m=\frac{1}{ap+(1-a)}*p$
where $a=1-e^{s{cos{(\theta_{w_y},x)}-f{(m,\theta_{w_y},x)}}}$ and $a≤0$

$a$ is considered as a modulating factor and $h{(a,p)}=\frac{1}{ap+(1-a)} \in (0,1]$ is a modulating function . Therefore, regardless of the margin function ($f$), the minimization of the softmax probability will ensure success.

Compared to AM-LFS, this method involves only one parameter ($a$) that is also constrained, versus AM-LFS which has 2M parameters without constraints that specify the piecewise linear functions the method requires. Also, the piecewise linear functions of AM-LFS ($p_m={a_i}p+b_i$) may not be discriminative because it could be larger than the softmax probability.

### Random Search

Unified formulation $L_5$ is generated by inserting a simple modulating function $h{(a,p)}=\frac{1}{ap+(1-a)}$ into the original softmax loss. It can be written as below :

$L_5=-log{(h{(a,p)}*p)}$ where $h \in (0,1]$ and $a≤0$

In order to validate the unified formulation, a modulating factor is randomly set at each training epoch. This is noted as Random-Softmax in this paper.

### Reward-Guided Search

Unlike supervised learning, reinforcement learning (RL) is a behavioral learning model. It does not need to have input/output labelled and it does not need a sub-optimal action to be explicitly corrected. The algorithm receives feedback from the data to achieve the best outcome. The system has an agent that guides the process by taking an action that maximizes the notion of cumulative reward . The process of RL is shown in figure 1. The equation of the cumulative reward function is:

$G_t \overset{\Delta}{=} R_t+R_{t+1}+R_{t+2}+⋯+R_T$

where $G_t$ = cumulative reward, $R_t$ = immediate reward, and $R_T$ = end of episode.

$G_t$ is the sum of immediate rewards from arbitrary time $t$. It is a random variable because it depends on immediate reward which depends on the agent action and the environment reaction to this action. Figure 1: Reinforcement Learning scenario 

The reward function is what guides the agent to move into a certain direction. As mentioned above, the system receives feedback from the data to achieve the best outcome. This is caused by the reward being edited based on the feedback it receives when a task is completed .

In this paper, RL is being used to generate a distribution of the hyperparameter $\mu$ for the SoftMax equation using the reward function. $\mu$ updates after each epoch from the reward function.

$\mu_{e+1}=\mu_e + \eta \frac{1}{B} \sum_{i=1}^B R{(a_i)}{\nabla_a}log{(g(a_i;\mu,\sigma))}$

### Optimization

Calculating the reward involves a standard bi-level optimization problem, which involves a hyperparameter ({$a_1,a_2,…,a_B$}) that can be used for minimizing one objective function while maximizing another objective function simultaneously:

$max_a R(a)=r(M_{w^*(a)},S_v)$
$w^*(a)=_w \sum_{(x,y) \in S_t} L^a (M_w(x),y)$

In this case, the loss function takes the training set St and the reward function takes the validation set $S_v$. The weights $w$ are trained such that the loss function is minimized while the reward function is maximized. The calculated reward for each model ({$M_{we1},M_{we2},…,M_{weB}$}) yields the corresponding score, then the algorithm chooses the one with the highest score for model index selection. With the model containing the highest score being used in the next epoch, this process is repeated until the training reaches convergence. At the end, the algorithm takes the model with the highest score without retraining.

## Results and Discussion

### Results on LFW, SLLFW, CALFW, CPLFW, AgeDB, DFP

For LFW, there is not a noticeable difference between the algorithms proposed in this paper and the other algorithms. However, AM-Softmax achieved higher results than Search-Softmax. Random-Softmax achieved the highest results by 0.03%.

Random-Softmax outperforms baseline Soft-max and is comparable to most of the margin-based softmax. Search-Softmax boost the performance and better most methods specifically when training CASIA-WebFace-R data set, it achieves 0.72% average improvement over AM-Softmax. The reason the model proposed by the paper gives better results is because of their optimization strategy which helps boost the discimination power. Also the sampled candidate from the paper’s proposed search space can well approximate the margin-based loss functions. More tests need to happen to more complicated protocols to test the performance further. Not a lot of improvement has been shown on those test sets, since they are relatively simple and the performance of all the methods on these test sets are near saturation.

Table 1.Verification performance (%) of different methods on the test sets LFW, SLLFW, CALFW, CPLFW, AgeDB and CFP. The training set is CASIA-WebFace-R . ### Results on RFW

The RFW dataset measures racial bias which consists of Caucasian, Indian, Asian, and African. Using this as the test set, Random-softmax and Search-softmax performed better than the other methods. Random-softmax outperforms the baseline softmax by a large margin which means reducing the softmax probability will enhance the feature discrimination for face recognition. It is also observed that the reward guided search-softmax method is more likely to enhance the discriminative feature learning resulting in higher performance as shown in Table 2 and Table 3.

Table 2. Verification performance (%) of different methods on the test set RFW. The training set is CASIA-WebFace-R . Table 3. Verification performance (%) of different methods on the test set RFW. The training set is MS-Celeb-1M-v1c-R . ### Results on MegaFace and Trillion-Pairs

The different loss functions are tested again with more complicated protocols. The identification (Id.) Rank-1 and the verification (Veri.) with the true positive rate (TPR) at low false acceptance rate (FAR) at $1e-3$ on MegaFace, the identification TPR@FAR = $1e-6$ and the verification TPR@FAR = $1e-9$ on Trillion-Pairs are reported on Table 4 and 5.

On the test sets MegaFace and Trillion-Pairs, Search-softmax achieves the best performance over all other alternative methods. On MegaFace, Search-softmax beat the best competitor AM-softmax by a large margin. It also outperformed AM-LFS due to new designed search space.

Table 4. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is CASIA-WebFace-R . Table 5. Performance (%) of different loss functions on the test sets MegaFace and Trillion-Pairs. The training set is MS-Celeb-1M-v1c-R . From the CMC curves and ROC curves in Figure 2, similar trends are observed at other measures. There is a same trend on Trillion-Pairs where Search-softmax loss is found to be superior with 4% improvements with CASIA-WebFace-R and 1% improvements with MS-Celeb-1M-v1c-R at both the identification and verification. Based on these experiments, Search-Softmax loss can perform well, especially with a low false positive rate and it shows a strong generalization ability for face recognition.

## Conclusion

In this paper, it is discussed that in order to enhance feature discrimination for face recognition, it is key to know how to reduce the softmax probability. To achieve this goal, unified formulation for the margin-based softmax losses is designed. Two search methods have been developed using a random and a reward-guided loss function and they were validated to be effective over six other methods using nine different test data sets.

## Critiques

• Thorough experimentation and comparison of results to state-of-the-art provided a convincing argument​.
• Datasets used did require some preprocessing, which may have improved the results beyond what the method otherwise would​.
• AM-LFS was created by the authors for experimentation (the code was not made public) so the comparison may not be accurate​.
• The test data set they used to test Search-Softmax and Random-Softmax are simple and they saturate in other methods. So the results of their methods didn’t show much advantage since they produce very similar results. More complicated data set needs to be tested to prove the method's reliability.