Google Summer of Code 2021
The Project
Eye tracking has many applications from driver safety to improved accessibility for people with disabilities. Current state of the art eye trackers are very expensive and tend to be bulky systems that need to be carefully setup and calibrated. The pervasiveness of handheld devices with powerful cameras have now made it possible to have high quality eye tracking right in our pockets!
In 2020, researchers at Google published a paper⁶ that reports an error of 0.6–1° at a viewing distance of 25–40cm for a smartphone. This means if you look at a spot on the phone from a distance of 25–40cm, the algorithm can predict the location of the spot within an error of 0.46±0.03cm.
The authors have not open sourced code or provided trained models. The aim of the project therefore is to replicate the results reported and then extend the functionality to also predict head position and more.
State of the Art Gaze Tracking
There are a variety of traditional trackers that range from specialized lenses or devices that are physically in contact with the pupil to a wide range of non contact methods. Current state of the art systems shine near Infrared (NIR) light onto the eye and based on reflections of this light, the gaze location is predicted.
Tobii¹, the current world leader in eye tracking, builds specialized glasses with integrated hardware to estimate the gaze location. These glasses come equipped with 16 illuminators and 4 eye cameras! They also build Screen-based eye trackers which can be plugged into a laptop or PC.
Other competitors as well build specialized hardware that use NIR to illuminate the face and eyes and then run complex, proprietary algorithms (generally not suited for mobile devices) to estimate the location of the gaze.
Most of the screen based eye trackers have an accuracy of 0.4°-1°.
Machine Learning based Gaze Tracking
An increase in computational power coupled with the advancements in deep learning research have now made it possible to predict gaze from regular RGB images taken on a phone or laptop. These techniques don’t require any additional hardware, are much cheaper than specialized trackers and can run efficiently on even mobile phones.
A few of the noteworthy machine learning based gaze tracking papers and their main contributions are listed below.
MPII Gaze² (2015)
Researchers at the Max Planck Institute for Informatics were among the first to release a largescale public gaze dataset along with a gaze estimation method that was based on convolutional neural networks (CNNs).
15 participants were asked to use a custom software on their laptops that displayed dots the participant had to focus on. They collected a total of 213,659 images from the webcams on the laptops. To further normalize the dataset, eye images were rotated and scaled to make sure that they appeared in line with the camera in the 3D space with the camera at the origin.
To estimate the gaze direction, they used a multimodal CNN that took the cropped eye image along with a head angle vector to produce yaw and pitch angle vectors of the gaze direction. Their method produced a mean error of 6.3°.
TabletGaze: Unconstrained Appearance-based Gaze Estimation in Mobile Tablets³ (2015)
Using a single Samsung tablet to capture images from 51 participants, researchers at Rice University created a new dataset for gaze estimation. They made the participants interact with the app that displayed dots in four positions- lying, slouching, sitting and standing which improved the variability of the dataset.
To detect the gaze location, they use only the cropped eye region, use traditional feature extractors like Histogram of Gradients(HoG) and LBP to generate a feature vector that is then passed to a regression model that predicts the x and y coordinates of the gaze. A combination of multilevel HoG (mHoG) as feature extractor and random forest as regressor gave the lowest error of 3.17±2.10 cm.
The paper further talks about how different aspects of the data affect the predicted value and show that data partitioning on the basis of race and body posture improves the estimation accuracy, while the factor of glasses does not significantly impact the result.
Eye Tracking for Everyone⁴ (2016)
Presented by researchers at MIT CSAIL at CVPR 2016, Eye Tracking for Everyone introduced an even larger with a lot of variation in terms of lighting, distance from camera, head angle etc.
Leveraging the power of the internet and the Amazon Mechanical Turk, they were able to capture 2,445,504 images from 1,474 participants. These were collected on apple devices (iPhones and iPads). The dataset also contains sensor data from the phone which gives us an idea of the position of the phone during image capture. This dataset is freely available and is hence the dataset we use during all our experiments.
They introduce iTracker, a deep network architecture to estimate the gaze location. From the original image, the face region and the two eye regions are cropped out and are each sent to a CNN. Additionally, a face grid matrix, which represents where the face is in the image, is also sent. The network outputs x and y coordinates of the estimated gaze. Their lowest error was 1.34cm.
Training Person-Specific Gaze Estimators from User Interactions with Multiple Devices⁵ (2018)
This 2018 paper showcased how data from a variety of sources used during training of a backbone model improves overall accuracy on almost all device specific models. The authors collect data from 22 participants as they gaze at different points on 5 different devices. These devices are a 5.1-inch mobile phone, a 10-inch tablet, a 14-inch laptop, a 24-inch desktop computer, and a 60-inch smart TV. Using this extensive dataset, they train a single CNN and then have device specific decoders that come into play when predicting the gaze location on a specific device. The additional knowledge gained due to the different screen sizes, difference in viewing angles, difference in viewing distance etc. helps the model generalize better.
They use an architecture based on the AlexNet architecture and report an error of of 1.4 cm on the mobile phone screen, 2.2 cm on the tablet, 2.5 cm on the laptop, 3.5 cm on the desktop computer, and 8.6 cm on the smart TV.
Accelerating Eye Movement Research via Accurate and Affordable Smartphone Eye Tracking⁶ (2020)
Finally, the paper that we are going to implement and extend. This paper by Google reports a high accuracy as well as efficient deployment. Data is collected from 26 participants using a phone placed on a mount. This dataset has not been opened to the public.
The main contribution of this paper is the introduction of per person calibration which further finetunes the model and provides very low error for a particular person. The calibration process is simple and standard in eye tracking research.
The architecture they use is a multimodal CNN where the eye crops along with eye corner landmarks are sent to the network which outputs x and y coordinates of the gaze location. As mentioned in the introduction, they report an error of 0.6–1° at a viewing distance of 25–40cm which is far better than all previous methods!
Conclusion
Before starting any project, it’s always a good idea to get a sense of what we’re working toward, why it’s important and what the current state of the art is. This article gives an introduction to the project as well as clearly defines the target we are trying to achieve.
In the coming weeks, we’ll dive deeper into the dataset as well as the method mentioned in the Google paper⁶ and build a robust, efficient and accurate eye tracker.
References
- Tobii Eye tracking — https://www.tobiipro.com/
- Zhang, Xucong, et al. “Appearance-based gaze estimation in the wild.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
- Huang, Qiong, Ashok Veeraraghavan, and Ashutosh Sabharwal. “TabletGaze: dataset and analysis for unconstrained appearance-based gaze estimation in mobile tablets.” Machine Vision and Applications 28.5 (2017): 445–461.
- Krafka, Kyle, et al. “Eye tracking for everyone.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
- Zhang, Xucong, et al. “Training person-specific gaze estimators from user interactions with multiple devices.” Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 2018.
- Valliappan, Nachiappan, et al. “Accelerating eye movement research via accurate and affordable smartphone eye tracking.” Nature communications 11.1 (2020): 1–12.