With the online advertising sector estimated to have spent US$740.3 billion in 2023, it is easy to understand why advertising companies invest a significant amount of resources in this particular computer vision research.
Although islands and conservative, the industry may occasionally publish research suggesting a more sophisticated and unique task in facial and eye gaze recognition, including the center of demographic analysis statistics, age recognition.
Estimating age in a wild advertising context is of interest to advertisers who may be targeting demographics of a particular age. In this experimental example of automatic face age estimation, performer Bob Dylan’s age has been tracked over the years. Source: https://arxiv.org/pdf/1906.03625
Although these studies rarely appear in public repositories such as Arxiv, they use legally compensated participants as the basis for AI-driven analyses that aim to determine how well and in how they are involved in advertising.
Histograms (HOGs) of orientation gradients in DLIBs are often used in facial estimation systems. Source: https://www.computer.org/csdl/journal/ta/2017/02/07475863/13rrunvyarn
Animal instinct
In this regard, of course, the advertising industry is interested in determining false positives (where analytical systems misinterpret subjects’ behavior) and establishing clear criteria when the person watching the commercial is not fully involved in the content.
As far as screen-based advertising is concerned, research tends to focus on two issues across two environments. The environment is either “desktop” or “mobile”, each has specific characteristics that require a custom tracking solution. And the problem is expressed by – from the advertiser’s point of view Owls’ behavior and lizard’s behavior – A tendency to not pay utter attention to advertisements that are in front of the audience.
Examples of the behavior of “owls” and “lizards” in the theme of the advertising research project. Source: https://arxiv.org/pdf/1508.04028
If you’re looking Get away From the intended ads throughout your head, this is the “owl” behavior. Your head pose is static, but your eyes Wandering From the screen, this is what a “lizard” does. When it comes to analyzing and testing new ads under controlled conditions, these are important actions that the system can capture.
A new paper from Smarteye’s acquisition of Efficiva addresses these issues and provides an architecture that leverages several existing frameworks to provide a combined set of features across all the required conditions and possible responses.
Examples of true and false positives detected by new attention systems for various distraction signals shown separately for desktop and mobile devices. Source: https://arxiv.org/pdf/2504.06237
Authors are *:
“Limited research has delved into monitoring attention during online advertising. These studies focused on estimating head pose or gaze direction to identify instances of repurposed gaze, but ignore important parameters such as device type (desktop or mobile), camera placement relative to the screen, and screen size. These factors have a significant impact on attention detection.
“In this paper, we propose an architecture for attention detection, including detection of various distractions, including both off-screen, speaking, drowsiness (through yawning, eye closure), and the behavior of both owls and lizards in both gazing the rest of the screen.
“Unlike our previous approach, our method integrates device-specific features such as device type, camera placement, screen size (for desktops), and camera orientation (for mobile devices), along with raw gaze estimation to improve the accuracy of attention detection.”
New title is included Monitoring viewers’ attention during online adsand comes from four researchers from Emphiva.
Methods and Data
Mainly due to the secrecy and closure nature of such systems, new papers show the findings only as an ablation study, rather than directly comparing the author’s approach to rivals. Furthermore, this paper generally does not adhere to the usual form of computer vision literature. Therefore, let’s look at the research presented.
The authors emphasize that only a limited number of studies addressed the detection of particular attention in the context of online advertising. The AffDex SDK, which provides real-time multiface recognition, only the attention is inferred from the head pose.
Examples of Affdex SDK, an emotional system that relies on head pose as an indicator of attention. Source: https://www.youtube.com/watch?v=c2cwb5jhmby
2019 collaboration Automatic measurement of visual attention to video content using deep learninga dataset of approximately 28,000 participants was annotated to various careless behaviors. I’m staring at me, Close your eyesor be involved Unrelated activitiesand CNN-LSTM models are trained to detect attention from facial appearance over time.
An example of the predicted attentional situation for viewers to watch video content from a 2019 paper. Source: https://www.jeffcohn.net/wp-content/uploads/2019/07/attention-13.pdf.pdf
However, the authors did not explain device-specific factors, such as whether participants were using desktop or mobile devices. We also didn’t consider the screen size or the camera arrangement. Furthermore, although the AFFDEX system focuses only on identifying gaze diversions and omits other sources of distraction, the 2019 work attempts to detect a broader set of behavior, the use of a single shallow CNN is insufficient for this task.
The authors observe that some of the most popular studies on this line are not optimized for advertising testing with different needs compared to domains such as driving and education. Camera placement and calibration are usually pre-fixed and instead rely on non-cooperative settings, operating within limited gaze ranges on desktop and mobile devices.
Therefore, they devise an architecture for detecting audience attention during online ads and leverage two commercial toolkits: affdex 2.0 and Smarteye SDK.
Example of face analysis from AffDex 2.0. Source: https://arxiv.org/pdf/2202.12059
These previous works extract low-level features such as facial expression, head pose, and gaze direction. These functions are processed to generate high-level indicators that include the position of the gaze on the screen. Talk with a yawn.
The system identifies four distraction types. Out-of-screen gaze; Sleepiness,; Talking;and Unmanned screen. It also adjusts gaze analysis depending on whether your audience is on a desktop or mobile device.
Dataset: Gaze
The authors used four datasets to power and evaluate the attention detection system. And the fourth depicted from a real-world advertising test session containing a mixture of distraction types.
Due to specific requirements of the work, a custom dataset was created for each of these categories. All curated datasets were sourced from a unique repository featuring sessions of millions of participants watching ads in a home or work environment, stating that due to author consent restrictions, datasets of new work cannot be published.
Build Eye In the dataset, participants were asked to follow dots moving to different points on the screen, including edges. After that, I repeated the sequence three times, turning my eyes off the screen in four directions (up, down, left, right). In this way, the relationship between capture and coverage was established.
Screenshots showing (a) desktop and (b) mobile devices gaze video stimuli. The first and third frames show the steps to follow the moving dots, with the second and fourth participants looking away from the screen.
Moved dot segments are labeled Attentionand as off-screen segments Not carefulgenerates a labeled dataset for both positive and negative examples.
Each video lasted about 160 seconds, with separate versions created for desktop and mobile platforms, with resolutions of 1920×1080 and 608×1080, respectively.
A total of 609 videos were collected, consisting of 322 desktops and 287 mobile recordings. Labels were automatically applied based on video content, and the dataset was split into 158 training samples and 451 for testing.
Dataset: Speak
In this context, one of the criteria for defining “carelessness” is when a person speaks Over 1 second (In that case it could be a momentary comment or a cough).
Because a controlled environment does not record or analyze audio, audio is inferred by observing the internal movement of the estimated facial landmarks. Therefore, we detect Talking Without audio, the author created a dataset based entirely on visual input, drawn from an internal repository, and split into two parts. These first parts included around 5,500 videos. Each one speaks or cannot speak by three annotators (4,400 of these are used for training and validation, 1,100 for testing).
The second consisted of 16,000 sessions automatically labeled based on session type: 10,500 feature participants quietly monitored the ads, showing that 5,500 participants expressed their opinions about the brand.
Dataset: Yoning
There are several “yoning” datasets that include YAWDD and driver fatigue, but the authors argue that it is not suitable for advertising testing scenarios as it is a feature of either simulation Includes facial distortions that can yawn or confuse fear, Or any other non-young action.
Therefore, the author uses 735 videos from the internal collection, Joe Drop It lasts more than a second. Each video was manually labeled by three annotators. active or Inactive yawn. Only 2.6% of frames containing active yawns highlighting class imbalances split the dataset into 670 training videos and 65 for testing.
Dataset: Distraction
Distraction Additionally, the dataset was drawn from the author’s ad test repository. There, participants were showing actual ads without assigned tasks. A total of 520 sessions (193 on mobile and 327 on desktop environment) were randomly selected and manually labeled by three annotators. Attention or Not careful.
Contains careless behavior Out-of-screen gaze, Talking, Sleepinessand Unmanned screen. Sessions span across diverse regions around the world, and desktop recording is more common due to flexible webcam placement.
Caution model
The proposed attention model handles low-level visual features, i.e. facial expressions. Head pose; gaze direction – extracted from the aforementioned AffDex 2.0 and Smarteye SDK.
These are then converted into high-level indicators, each distractor being processed by a separate binary classifier trained on its own dataset and for independent optimization and evaluation.
Schema of the proposed monitoring system.
Eye The model determines whether the viewer is off-screen using separate calibrations for desktop and mobile devices using normalized gaze coordinates. Supporting this process is a Linear Support Vector Machine (SVM), trained in spatial and temporal features, and incorporates a memory window to smooth out rapid gaze shifts.
Detect Speak without audiothe system used a 3D-CNN trained in both the conversational and non-conversion video segments, as well as the area of the cropped mouth. Labels are assigned based on session type, and temporary smoothing reduces false positives that can arise from short mouth movements.
Yakubi 3D-CNNs are trained with a manually labeled frame detected using full-face image crops, capturing wider facial movements, and manually labeled frames (although the task was complicated by the lower frequency of natural viewing of yoning, and similarity with other representations).
Screen abandonment Predictions made by the decision tree were identified as lack of face or extreme head poses.
Final Attention Status Determined using a fixed rule: If a module was detected carelessly, the viewer was marked Not careful – Sensitivity-first approaches and individually tailored for desktop and mobile contexts.
test
As mentioned earlier, the test follows the ablation method described in the results, the component has been removed and the component is followed.
The various categories of perceived carelessness identified in this study.
The gaze model used three key steps to identify off-screen behavior. Normalize raw gaze estimates, fine-tuning output, and estimate screen size of desktop devices.
To understand the importance of each component, the authors removed them individually and evaluated their performance on 226 desktops and 225 mobile videos drawn from two datasets. The results measured with G-MEAN and F1 scores are shown below.
Results showing the performance of the full gaze model along with the deleted version.
In either case, performance was degraded when the step was omitted. Normalization proved particularly valuable on desktops where camera placement is different than mobile devices.
This study also evaluated how visual features predict mobile camera orientation. Face position, head pose, and gaze scored 0.75, 0.74, and 0.60, the combination reached 0.91, and the author highlighted the advantages of integrating multiple cues.
Talking The model trained at vertical lip distance achieved ROC-AUC of 0.97 on the manually labeled test set and 0.96 on the larger labeled data set, showing consistent performance on both.
Yakubi The model reached a ROC-AUC of 96.6% using only the mouth aspect ratio. This has improved to 97.5% when combined with AffDex 2.0 action unit prediction.
As a classification moment of the at-screen model Not careful When both Affdex 2.0 and Smarteye were unable to detect faces for more than a second. To assess the validity of this, the authors manually annotated all such infaced events. A real distraction Identify the underlying cause of each dataset, the dataset. Ambiguous cases (such as camera occlusion and video distortion) were excluded from the analysis.
As shown in the results table below, only 27% of “no face” activations were caused by the user physically leaving the screen.
In certain instances, there have been a wide variety of reasons why faces were not found.
The paper states:
“The unmanned screen was activated for other reasons that showed carelessness, such as staring out of the screen at extreme angles, excessive movement, or blocking the face with an object/hand, despite only making up 27% of the instances that triggered silent signals.”
In the final quantitative test, the authors evaluated the method of gradually adding different distraction signals – off-screen gaze (via gaze and head pose), drowsiness, speaking, and unattended screens affected the overall performance of the attention model.
The test was run on two datasets A real distraction Datasets and Test Subsets Eye Dataset. G-MEAN and F1 scores were used to measure performance (though speaking with sleepiness was excluded from the gaze dataset analysis due to limited relevance in this context).
Attention detection has been consistently improved as the distraction types have been added, as shown below. Out-of-screen gazethe most common distractor and provides the strongest baseline.
The effect of adding a variety of distraction signals to the architecture.
Of these results, the paper states:
“From the results, we can conclude that integration of all distraction signals contributes to enhanced attention detection.
Second, the improvements in attention detection are consistent across both desktop and mobile devices. Third, in real dataset mobile sessions, they are easily detected when viewed, and provide better performance on mobile devices compared to desktops. Fourth, the addition of a sleepiness signal provides a relatively slight improvement compared to other signals. It usually rarely occurs.
“Finally, the unclear screen signal allows for relatively large improvements in mobile devices compared to desktops, as it allows mobile devices to be easily left unattended.”
The authors also compared the model to ADDEX 1.0, previous systems used in advertising testing. Even head-based gaze detection in the current model outperforms AffDex 1.0 on both device types.
“This improvement is the result of incorporating head movements in both yaw and pitch directions and normalizing head poses to explain minor changes. With significant head movements within the actual mobile dataset, the head model was performed similarly to AffDex 1.0.
The author closes the paper with a (probably rather interesting) qualitative test round, as shown below.
Samples the output from the attention model across desktop and mobile devices, with each line providing examples of true and false positives for various distraction types.
The author states:
‘Results show that our model effectively detects various distractors in uncontrolled settings. However, it can cause false positives in certain edge cases, such as severe head tilt, while still maintaining images of on-screen gaze, blocked mouth, excessively blurred eyes, or very dark faces. ‘
Conclusion
The results represent more measured but meaningful advances than previous work, but the deeper value of research lies in the glimpse it offers to a permanent drive to access the internal state of the audience. Although data were collected in agreement, methodology refers to future frameworks that can expand beyond structured market research settings.
This rather paranoid conclusion is only strengthened by the corridors, constrained, and jealously protected nature of this particular study.
* Author inline quotes have been converted to hyperlinks.
First released on Wednesday, April 9th, 2025