Register now After registration you will be able to apply for this opportunity online.
This opportunity is not published. No applications will be accepted.
End-to-end gaze estimation with pseudo-labels
Implement a conventional webcam-based eye tracking pipeline as a single neural network that takes a webcam image frame as input.
Keywords: Appearance-based Gaze Estimation, Head Pose Estimation, End-to-end CNNs
Gaze estimation from a single commodity camera (such as a webcam or front-facing camera on a mobile phone) is a chal-lenging task that has implications in automatically adapting user interfaces, user attention and awareness analysis for better informing designs of UIs, and low-cost aid for users with motor disabilities. Input images to the task of gaze estimation are typically created from full camera frames via 6 degrees-of-freedom head pose estimation and a so-called “data normaliza-tion” procedure [1]. This additionally involves face detection and facial landmarks localization. Accordingly, the accuracy and speed of each sub-step impacts the performance of the final eye tracker. It is thus important to both improve head pose es-timates, and to design and implement an end-to-end architecture which can process a given input image without unnecessary communication between the CPU and GPU.
A naïve method of doing this was attempted by Krafka et al. [2] where they demonstrated 15Hz performance on an old iPh-one device. However, their method would not necessarily generalize to new devices or cameras, or be robust to somewhat unseen head poses. In addition, following works showed that it is highly important to correctly apply the “data normaliza-tion” procedure for yielding best gaze direction estimation performance. As this procedure can be implemented in deep learning frameworks such as Tensorflow, it can be embedded as part of an end-to-end architecture. This in turn allows for the tuning of a head-pose estimation network directly via the final gaze estimation loss. This is an important point, because in-the-wild datasets such as GazeCapture [2] does not provide ground-truth information regarding head pose, and the pre-processing pipeline (Fig. 1) only offers crude but sufficiently effective estimates.
Gaze estimation from a single commodity camera (such as a webcam or front-facing camera on a mobile phone) is a chal-lenging task that has implications in automatically adapting user interfaces, user attention and awareness analysis for better informing designs of UIs, and low-cost aid for users with motor disabilities. Input images to the task of gaze estimation are typically created from full camera frames via 6 degrees-of-freedom head pose estimation and a so-called “data normaliza-tion” procedure [1]. This additionally involves face detection and facial landmarks localization. Accordingly, the accuracy and speed of each sub-step impacts the performance of the final eye tracker. It is thus important to both improve head pose es-timates, and to design and implement an end-to-end architecture which can process a given input image without unnecessary communication between the CPU and GPU.
A naïve method of doing this was attempted by Krafka et al. [2] where they demonstrated 15Hz performance on an old iPh-one device. However, their method would not necessarily generalize to new devices or cameras, or be robust to somewhat unseen head poses. In addition, following works showed that it is highly important to correctly apply the “data normaliza-tion” procedure for yielding best gaze direction estimation performance. As this procedure can be implemented in deep learning frameworks such as Tensorflow, it can be embedded as part of an end-to-end architecture. This in turn allows for the tuning of a head-pose estimation network directly via the final gaze estimation loss. This is an important point, because in-the-wild datasets such as GazeCapture [2] does not provide ground-truth information regarding head pose, and the pre-processing pipeline (Fig. 1) only offers crude but sufficiently effective estimates.
At the end of the project, the student should be able to quantitatively demonstrate improvements to point of regard (PoR) estimation on the GazeCapture dataset, and as a bonus be able to demonstrate real-time end-to-end gaze estimation.
At the end of the project, the student should be able to quantitatively demonstrate improvements to point of regard (PoR) estimation on the GazeCapture dataset, and as a bonus be able to demonstrate real-time end-to-end gaze estimation.