• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Evolution of Face ID Industry: Insights from Online Programme Master of Computer Vision

Evolution of Face ID Industry: Insights from Online Programme Master of Computer Vision

© iStock

The online Master’s programme ‘Master of Computer Vision’ recently held a webinar ‘Face recognition. How does it work and how does it break?‘ as part of its admission campaign, with the participation of the programme’s business partner, Huawei. Participants learned about the basic principles of Face ID algorithms and the features of face recognition technologies relative to other deep learning models, as well as some of the tricks that can mislead artificial intelligence systems.

The industrial webinar was hosted by Stepan Komkov, Senior Algorithm Engineer in Intelligent Image and Video Processing Lab at the Moscow Research Centre of Huawei. Following the webinar, Andrey Savchenko, Academic Supervisor of the Master of Computer Vision programme, Professor of the Department of Information Systems and Technologies at HSE University in Nizhny Novgorod, answered some questions about Face ID algorithms.

Andrey Savchenko
Photo: data.vk.company

Andrey, tell us about the tasks of machine perception of face images discussed at the seminar.

— Methods of face recognition for photos and videos have been studied for more than half a century. Since the 1990s, competitions have been held for the best commercial and open systems for face verification and identification. The task of verification is to determine whether two photographs show the face of the same person or two different people. It is this task that is solved, for example, when unlocking a smartphone by facial recognition. At the same time, the task of identification answers the question ofwhich person from a given list is depicted in the photo. Solving this problem helps us to create video surveillance systems, make biometric payments (FacePay), etc.

What methods are used to solve these tasks? 

—To solve both problems, similar methods are used, where each photo of a face or video frame is associated with a certain descriptor—a sequence of numbers that are similar for the photos of the same person, but differ significantly for the photos of different faces. Traditionally, in face recognition, researchers used the descriptors they specifically selected, for example, some geometric characteristics calculated from quite a large number of key face points.

How accurate were they, and when was the gap between human and machine facial recognition bridged?

— Such descriptors turned out to be extremely unstable: they changed significantly depending on the angle of the image, while the accuracy decreased significantly with varying lighting, presence of glasses, hats, beards, moustaches, etc. Therefore, in terms of bridging the gap in the quality of human and machine face recognition, we can go back to 2014, when the FaceNet neural network was introduced in the article ‘Closing the Gap to Human-Level Performance in Face Verification’. The network was trained using a specially assembled data set divided into classes (about 4.5 million photos of more than 4000 people). It was assumed that the feature vector obtained at the output of the penultimate layer of the network is representative and can be used as a descriptor for the verification and identification of other people not represented in the training set. As a result, its quality, although significantly higher than traditional counterparts, was still imperfect. A year later, a new FaceNet learning process was introduced, giving a powerful impetus to the growth of the industry. The study introduced the Metric Learning paradigm into the face verification system.

Why did the article about FaceNet spark an industry boom? 

— The model presented in the article achieved an unprecedented verification accuracy of 99.63% on the traditional LFW (Labeled Faces in the Wild) benchmark. As a result, the article about FaceNet was cited more than 13,000 times and entered the top ten most popular materials in the field of computer vision. The authors came up with a special loss function: when training the network with it, not only is the accuracy of face classification from the training data set optimised, but it also focuses only on exactly what is needed—the face descriptors extracted using this network, so that the distances between the feature vectors for different photos of a particular person were smaller than the distances from their feature vectors to the feature vectors of any other people.

With this work, computer vision researchers realised that they could, in addition to using more powerful datasets and deeper neural network models, look for mathematical ways to formulate the problems. Of course, the presence of big data at the training stage also improves the model quality. For example, in 2015, a closed set of face photos used for FaceNet training contained over 200 million images of eight million people. In recent years, datasets comparable in size have already been made available to the public. For example, in 2021, the University of Tijuana, together with Imperial College London, released the WebFace260M dataset of 260 million photos of four million people. The neural network models trained on it with the help of the development of FaceNet ideas, in particular the loss functions of ArcFace, turn out to be equivalent to the commercial analogues of well-known corporations.

What problems and tasks do the Face ID industry and researchers of your programme face today? 

— One of the important tasks that we are currently solving is related to increasing the computational efficiency of face recognition. There may be two different problems here. First, if the number of people to be recognised is large, then it may take a long time to calculate the distances between the input video frame descriptors and all their photos, so you will either have to abandon real-time decision-making or develop specialised high-speed algorithms, such as approximate nearest neighbour search or sequential analysis of descriptors. Second, the descriptor calculation time for each video frame can also be very long, especially when working on low-performance mobile devices, so there is the challenge of developing specialised mobile descriptors adapted to the architecture of a particular device.

Of course, there are other problems that researchers are currently working on. For example, the task of anti-spoofing is extremely relevant—countering the deception of Face ID systems by using specially prepared photos or videos of another person, as well as by using masks or special makeup.

How can one deceive Face ID? 

— This is not easy, but it is possible. During the seminar, Stepan Komkov talked about his development, AdvHat. This is a special type of adversarial attack on a neural network, as a result of which the resulting descriptor differs significantly from the descriptor of the same person’s faces. In particular, the speaker suggested the use of a special sticker put in front of the headgear, as a result of which the best neural network models (at the time of the article publication) were totally confused and recognised a completely different person. Of course, it is always possible to develop a special filter to combat such attacks, but it is impossible to foresee all possible attacks in a typical video surveillance system, simply as a result of the time limits for real-time decision-making.

How does the design of the Master of Computer Vision programme meet the challenges of the computer vision industry?

— The Master of Computer Vision programme was developed by HSE University scientists and leading experts from Huawei, SBER Lab, Intel, AIRI, and Xperience.ai, who are all involved in advanced research in the field of computer vision. The educational concept of the Master’s programme Master of Computer Vision combines current research in the field of artificial intelligence, data analysis, and machine learning, including deep learning, as well as modern practices in the field of computer vision including methods for processing, analysing, and synthesising images and videos. The purpose of the master’s programme  is to prepare specialists who will be able to work on any project related to object recognition, development of video surveillance systems, creation of 3D reconstructions and photo filters, mobile applications for object recognition in photos and videos, introduction of CV in all types of production in industry, retail, medicine, banking, etc. Programme graduates apply for positions such as Computer Vision Software Engineer, Perception Engineer, 3D Perception / Computer Vision Algorithm Engineer, Computer Vision Testing Engineer, Computer Vision Scientist, Data Scientist, and Machine Learning Engineer. Interactive practical exercises within the programme provide an opportunity to concentrate on solving business tasks from prominent market players and allow quick growth in the profession, building a career while still studying.