Age of a person is expressed in their facial structures and human vision is able to estimate the age of a person based on only visual information. Can the computer be taught to recognize the age of a person as reliably as human observers? The process of age estimation can be thought of in two different ways. The age can be determined directly from images, giving an estimate in years. This approach is called the regression problem. Another possible approach is to place a person in an age group based on the photo. This is the classification problem. In this work, we test both different approaches and compare them. We train a custom neural network based on the EfficientNet model and evaluate its’ classification performance. We then compare this data to a publicly available regression estimator DeepFace and compare the results to evaluate the relative performance of our model.
Derived from Karhunen-Loeve's transformation. Given an s-dimensional vector representation of each face in a training set of images, Principal Component Analysis (PCA) tends to find a t-dimensional subspace whose basis vectors correspond to the maximum variance direction in the original image space. This new subspace is normally lower dimensional (t<<s). If the image elements are considered as random variables, the PCA basis vectors are defined as eigenvectors of the scatter matrix.
Independent Component Analysis minimizes both second-order and higher-order dependencies in the input data and attempts to find the basis along which the data (when projected onto them) are - statistically independent.
Linear Discriminant Analysis finds the vectors in the underlying space that best discriminate among classes. For all samples of all classes the between-class scatter matrix SB and the within-class scatter matrix SW are defined. The goal is to maximize SB while minimizing SW, in other words, maximize the ratio det|SB|/det|SW| . This ratio is maximized when the column vectors of the projection matrix are the eigenvectors of (SW^-1 × SB).
Kernel methods are a generalization of linear methods. Direct nonlinear manifold schemes are explored to learn this nonlinear manifold.
An eigenspace-based adaptive approach that searches for the best set of projection axes in order to maximize a fitness function, measuring at the same time the classification accuracy and generalization ability of the system. Because the dimension of the solution space of this problem is too big, it is solved using a specific kind of genetic algorithm called Evolutionary Pursuit.
Neural networks are a subset of machine learning, and they are at the heart of deep learning algorithms. They are comprised of node layers, containing an input layer, one or more hidden layers, and an output layer. Each node connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network. Convolutional neural networks are distinguished from other neural networks by their superior performance with image, speech, or audio signal inputs. They have three main types of layers, which are: Convolutional layer, Pooling layer, Fully-connected (FC) layer
Facebook Research presents a system called Deepface that has closed the majority of the remaining gap in the most popular benchmark in unconstrained face recognition, and is now at the brink of human level accuracy. It is trained on a large dataset of faces acquired from a population vastly different than the one used to construct the evaluation benchmarks, and it is able to outperform existing systems with only very minimal adaptation. The network architecture is based on the assumption that once the alignment is completed, the location of each facial region is fixed at the pixel level. It is therefore possible to learn from the raw pixel RGB values, without any need to apply several layers of convolutions as is done in many other networks.
Due to the size of the dataset, labels of ground truth present and free availability, the OUI-Adience dataset of faces in the wild was used. To improve the performance of our network, preprocessing step was omitted by using the aligned images in the dataset. Furthermore, only images with approximately frontal alignment were used (Eran Eidinger, Roee Enbar, and Tal Hassner, Age and Gender Estimation of Unfiltered Faces).
Dataset used in this project was taken from here.
This dataset contains 26,580 images which are portraying 2,284 individuals, classified into 8 age groups (0-2, 4-6, 8-13, 15-20, 25-32, 38-43, 48-53, 60- ). The images are cropped and aligned so that the dimensions of images are 816x816 px. The dataset is composed of 5 folds to allow 5-fold 'leave one out' cross validation. To prevent overfitting, each fold contains different subjects. Each fold is described by a csv file with 12 columns:
First, DeepFace was used for age regression. Since it is a regression solver, we have used the estimated age of the person to classify subjects into buckets.
y_model_noEnforce = []
i=0
# iterate over all files in the fold
for index, row in metadataFilt.iterrows():
# construct the filename
filename = dbPath + row['user_id'] + '/landmark_aligned_face.' + \ str(row['face_id']) + '.' + row['original_image']
tmp = DeepFace.analyze(filename, actions=['age'], enforce_detection=False)
y_model_noEnforce.append([i, tmp['age']])
print("Process image (row %d): " % i + filename)
print("\t model: %d, expected: " % tmp['age'] + gindxs[y_real[i]])
i=i+1
Documentation
For the second approach, an OpenCV method dnn with Caffe model by Levi and Hassner was used.
AGE_BUCKETS = ["(0, 2)", "(4, 6)", "(8, 12)",\ "(15, 20)", "(25, 32)", "(38, 43)", "(48, 53)", "(60, 100)"]
prototxtPath = os.path.sep.join(["/home/jost/dev/ssip2021/models", "deploy_age.prototxt"])
weightsPath = os.path.sep.join(["/home/jost/dev/ssip2021/models", "age_net.caffemodel"])
ageNet = cv2.dnn.readNet(prototxtPath, weightsPath)
y_model_second = []
y_model_second_conf = []
i=0
# iterate over all files in the fold
for index, row in metadataFilt.iterrows():
# construct the filename
filename = dbPath + row['user_id'] + '/landmark_aligned_face.' + \ str(row['face_id']) + '.' + row['original_image']
face = cv2.imread(filename)
faceBlob = cv2.dnn.blobFromImage(face,1.0, (227, 227),\ (78.4263377603, 87.7689143744, 114.895847746), swapRB=False)
# predict age
ageNet.setInput(faceBlob)
preds = ageNet.forward()
j = preds[0].argmax()
age = AGE_BUCKETS[j]
ageConfidence = preds[0][j]
y_model_second.append(age)
y_model_second_conf.append(ageConfidence)
# output data
print("Process image (row %d): " % i + filename)
#print("\t model: %s, expected: %s" % (age, gindxs[y_real[i]]))
print("\t model: %s, expected: %s" % (age, metadataFilt['age'][index]))
i=i+1
Levi, G., & Hassner, T. (2015). Age and gender classification using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 34-42)
Documentation
To showcase some of the existing pre-trained models, we have integrated DeepFace and an OpenCV compatible (Hassner and Levi, see methods section for more details) models into an OpenCV based live image acquisition GUI built using Jupyter. An example of OpenCV based age classification is shown as an overlay on the recorded image, while a more thorough analysis results of the DeepFace model are calculated on demand and shown in the output window on the right.
See the code on GitHubFor the third approch, a custom network was built on top of the EfficientNet network. Before the efficientNet netwrok, we have added a batch normalizaton layer. After the EfficientNetB1 network, batch normalization, global max pooling 2D and dropout layers are added before the last fully connected layer that transforms EfficientNet detected features to age groups. Definition of the model is given below. The network initialization was performed in Keras.
# Define how the model will look like
base_efficientnet_model = EfficientNetB1(input_shape = (240, 240, 3), \ include_top = False, weights = 'imagenet')
age_model = Sequential()
age_model.add(BatchNormalization(input_shape = (240, 240, 3)))
age_model.add(base_efficientnet_model)
age_model.add(BatchNormalization())
age_model.add(GlobalMaxPooling2D())
age_model.add(Dropout(0.5))
age_model.add(Dense(4, activation = 'softmax'))
Documentation
Firstly, we measured performances on Deepface and a pre-trained model so we could compare it with our custom model.
Figure 1 - Deepface performance, Figure 2 - Pre-trained model performance
While Deepface had problems with classifications by putting almost every subject into the same age group, a pre-trained model worked pretty well and got high results. The concern about a pre-trained model is in the dataset used for verification because the authors didn't mention if they used the same dataset for training.
Figure 1 - Custom model performance, Figure 2 - Pre-trained model performance
Results and progress on our custom 8 group model, based on EfficientNetB1, were really low and the reason is in the unbalanced dataset, which had the most photos of people between (25, 32). Then we introduced 4 combined groups: 0 - (0, 2) and (4, 6), 1 - (8, 12) and (15, 20), 2 - (25, 32) and (38, 43), 3 - (48, 53) and (60, 100). The model worked pretty poorly and gave almost the same results as previous.
The problem was in the facial features like makeup, glasses, mask, the position of the face, and more. With longer training and in combination with few other methods, the model should be giving better results. A comparison of these results could be summed up to this: Deepface network and our network had the same issues, the same problems, and gave similar confusion matrix, while a pre-trained model had no problems and did a decent job.
In this project, we have tested different algorithms that predict a persons' age. This is a problem that has a lot of different possible application, from determining the age of the buyer at the vending machines, law enforcement applications and verification of patient age in the health system, where such records are insufficient. We have tested the performance of both regression and classification models which try to evaluate the age directly or based on a set of predifined age groups, respectively. During the performance evaluation, the classification models proved to be more reliable, correctly predicting the majority of the age groups. Based on the knowledge obtained during the exploration of the existing methods, we attemped to develop a custom classification model. The custom model was built around an EfficientNet module, which would extract features from the image and additional custom classification layers. Verification of our custom model exposed a lack in classification performance, achieving AUC values of 0.5 to 0.51. The evaluated ability of our network was thus only marginally better than pure guessing. This could be due to many different reasons, including insufficient training due to the lack of time and resources. Generally, age determination turned out to be a complex problem. On one hand, facial age is difficult to reliably evaluate, since many factor beside age can influence the look of a person. Lifestyle, makup, eyewear and surgical procedures can change the way a person look. Additionally, classification algorithms cannot percieve the effects of surrounding, such as unhomogeneus illumination, offset in white balance and obstructions of part of the face due to objects (cellphone, for example). All these factors make the problem more difficult. Last but not the least, selection of appropriate dataset with high quality images for initial training could play an important role in training a reliable network, but a search for a good dataset is an ambitious and time consuming endeavour. In the future, we could try to implement a two-pronged approach; first, train a new neural network on different datasets for a longer period of time and second, develop a completely different age estimator. A candidate could be an algorithm comprising of two stages, where the first stage extracts fiducial points on the face based on eye, nose, mouth and ear positions. The combined estimated age along with the fiducial points data could then be fed into a shallow network which would perform evaluation of the persons' age. Since the network would be shallow, it would be both faster to train and need less input data to obtain a better performance. In summary, we have explored different existing algorithms and models, identifiying the key difficulties in the age estimation problem. We have proposed a new methodology to assess a persons age that is based on a combination of face fiducial points and could be combined with a more traditional network to obtain an even better age estimation performance.
Team was organized by segmenting the work to be performed to best suit the experiences and knowledge of each member. Barbara Breš, having experience with building websites was tasked with preparation of the web framework and final presentation. Dominik Babić took on the role of dataset analyst, searching for appropriate datasets and exploring the tagging metadata. Marin Drabić, being experience in the machine learning, took the role of Lead developer and designed the pipeline, structured the neural network and performed the training. Jošt Stergar, being the oldest member, took on the role of the team leader coordinating the work and taking care of reports. Additionally, due to his experience with integrating software for image acquisition systems, he created the demo application. All the members contributed to the literature overview and final data analysis.
Researcher, Dataset analyst
Web and Presentation chief, Researcher
Lead developer, Researcher
Team leader, Researcher