Enabling pleasant consumer experiences by way of predictive fashions of human consideration – Google AI Weblog


Folks have the exceptional capacity to soak up an incredible quantity of knowledge (estimated to be ~1010 bits/s coming into the retina) and selectively attend to a couple task-relevant and attention-grabbing areas for additional processing (e.g., reminiscence, comprehension, motion). Modeling human consideration (the results of which is usually referred to as a saliency mannequin) has due to this fact been of curiosity throughout the fields of neuroscience, psychology, human-computer interplay (HCI) and pc imaginative and prescient. The power to foretell which areas are more likely to entice consideration has quite a few vital purposes in areas like graphics, pictures, picture compression and processing, and the measurement of visible high quality.

We’ve beforehand mentioned the potential for accelerating eye motion analysis utilizing machine studying and smartphone-based gaze estimation, which earlier required specialised {hardware} costing as much as $30,000 per unit. Associated analysis consists of “Look to Converse”, which helps customers with accessibility wants (e.g., individuals with ALS) to speak with their eyes, and the not too long ago printed “Differentially non-public heatmaps” method to compute heatmaps, like these for consideration, whereas defending customers’ privateness.

On this weblog, we current two papers (one from CVPR 2022, and one simply accepted to CVPR 2023) that spotlight our current analysis within the space of human consideration modeling: “Deep Saliency Prior for Decreasing Visible Distraction” and “Studying from Distinctive Views: Consumer-aware Saliency Modeling”, along with current analysis on saliency pushed progressive loading for picture compression (1, 2). We showcase how predictive fashions of human consideration can allow pleasant consumer experiences reminiscent of picture modifying to attenuate visible litter, distraction or artifacts, picture compression for sooner loading of webpages or apps, and guiding ML fashions in the direction of extra intuitive human-like interpretation and mannequin efficiency. We concentrate on picture modifying and picture compression, and focus on current advances in modeling within the context of those purposes.

Consideration-guided picture modifying

Human consideration fashions often take a picture as enter (e.g., a pure picture or a screenshot of a webpage), and predict a heatmap as output. The expected heatmap on the picture is evaluated in opposition to ground-truth consideration knowledge, that are sometimes collected by an eye fixed tracker or approximated by way of mouse hovering/clicking. Earlier fashions leveraged handcrafted options for visible clues, like colour/brightness distinction, edges, and form, whereas more moderen approaches routinely study discriminative options based mostly on deep neural networks, from convolutional and recurrent neural networks to more moderen imaginative and prescient transformer networks.

In “Deep Saliency Prior for Decreasing Visible Distraction” (extra data on this challenge web site), we leverage deep saliency fashions for dramatic but visually sensible edits, which may considerably change an observer’s consideration to completely different picture areas. For instance, eradicating distracting objects within the background can cut back litter in images, resulting in elevated consumer satisfaction. Equally, in video conferencing, decreasing litter within the background could enhance concentrate on the principle speaker (instance demo right here).

To discover what varieties of modifying results may be achieved and the way these have an effect on viewers’ consideration, we developed an optimization framework for guiding visible consideration in photographs utilizing a differentiable, predictive saliency mannequin. Our technique employs a state-of-the-art deep saliency mannequin. Given an enter picture and a binary masks representing the distractor areas, pixels throughout the masks shall be edited beneath the steering of the predictive saliency mannequin such that the saliency throughout the masked area is lowered. To ensure the edited picture is pure and sensible, we fastidiously select 4 picture modifying operators: two commonplace picture modifying operations, specifically recolorization and picture warping (shift); and two discovered operators (we don’t outline the modifying operation explicitly), specifically a multi-layer convolution filter, and a generative mannequin (GAN).

With these operators, our framework can produce quite a lot of highly effective results, with examples within the determine under, together with recoloring, inpainting, camouflage, object modifying or insertion, and facial attribute modifying. Importantly, all these results are pushed solely by the one, pre-trained saliency mannequin, with none further supervision or coaching. Notice that our aim is to not compete with devoted strategies for producing every impact, however quite to exhibit how a number of modifying operations may be guided by the data embedded inside deep saliency fashions.

Examples of decreasing visible distractions, guided by the saliency mannequin with a number of operators. The distractor area is marked on prime of the saliency map (crimson border) in every instance.

Enriching experiences with user-aware saliency modeling

Prior analysis assumes a single saliency mannequin for the entire inhabitants. Nonetheless, human consideration varies between people — whereas the detection of salient clues is pretty constant, their order, interpretation, and gaze distributions can differ considerably. This gives alternatives to create customized consumer experiences for people or teams. In “Studying from Distinctive Views: Consumer-aware Saliency Modeling”, we introduce a user-aware saliency mannequin, the primary that may predict consideration for one consumer, a gaggle of customers, and the final inhabitants, with a single mannequin.

As proven within the determine under, core to the mannequin is the mix of every participant’s visible preferences with a per-user consideration map and adaptive consumer masks. This requires per-user consideration annotations to be out there within the coaching knowledge, e.g., the OSIE cellular gaze dataset for pure photographs; FiWI and WebSaliency datasets for net pages. As an alternative of predicting a single saliency map representing consideration of all customers, this mannequin predicts per-user consideration maps to encode people’ consideration patterns. Additional, the mannequin adopts a consumer masks (a binary vector with the scale equal to the variety of individuals) to point the presence of individuals within the present pattern, which makes it doable to pick a gaggle of individuals and mix their preferences right into a single heatmap.

An summary of the consumer conscious saliency mannequin framework. The instance picture is from OSIE picture set.

Throughout inference, the consumer masks permits making predictions for any mixture of individuals. Within the following determine, the primary two rows are consideration predictions for 2 completely different teams of individuals (with three individuals in every group) on a picture. A standard consideration prediction mannequin will predict similar consideration heatmaps. Our mannequin can distinguish the 2 teams (e.g., the second group pays much less consideration to the face and extra consideration to the meals than the primary). Equally, the final two rows are predictions on a webpage for 2 distinctive individuals, with our mannequin displaying completely different preferences (e.g., the second participant pays extra consideration to the left area than the primary).

Predicted consideration vs. floor reality (GT). EML-Web: predictions from a state-of-the-art mannequin, which can have the identical predictions for the 2 individuals/teams. Ours: predictions from our proposed consumer conscious saliency mannequin, which may predict the distinctive desire of every participant/group accurately. The primary picture is from OSIE picture set, and the second is from FiWI.

Progressive picture decoding centered on salient options

Apart from picture modifying, human consideration fashions may also enhance customers’ searching expertise. One of the vital irritating and annoying consumer experiences whereas searching is ready for net pages with photographs to load, particularly in situations with low community connectivity. A method to enhance the consumer expertise in such circumstances is with progressive decoding of photographs, which decodes and shows more and more higher-resolution picture sections as knowledge are downloaded, till the full-resolution picture is prepared. Progressive decoding often proceeds in a sequential order (e.g., left to proper, prime to backside). With a predictive consideration mannequin (1, 2), we will as a substitute decode photographs based mostly on saliency, making it doable to ship the info essential to show particulars of essentially the most salient areas first. For instance, in a portrait, bytes for the face may be prioritized over these for the out-of-focus background. Consequently, customers understand higher picture high quality earlier and expertise considerably lowered wait occasions. Extra particulars may be present in our open supply weblog posts (submit 1, submit 2). Thus, predictive consideration fashions can assist with picture compression and sooner loading of net pages with photographs, enhance rendering for giant photographs and streaming/VR purposes.


We’ve proven how predictive fashions of human consideration can allow pleasant consumer experiences by way of purposes reminiscent of picture modifying that may cut back litter, distractions or artifacts in photographs or images for customers, and progressive picture decoding that may tremendously cut back the perceived ready time for customers whereas photographs are absolutely rendered. Our user-aware saliency mannequin can additional personalize the above purposes for particular person customers or teams, enabling richer and extra distinctive experiences.

One other attention-grabbing course for predictive consideration fashions is whether or not they can assist enhance robustness of pc imaginative and prescient fashions in duties reminiscent of object classification or detection. For instance, in “Instructor-generated spatial-attention labels increase robustness and accuracy of contrastive fashions”, we present {that a} predictive human consideration mannequin can information contrastive studying fashions to attain higher illustration and enhance the accuracy/robustness of classification duties (on the ImageNet and ImageNet-C datasets). Additional analysis on this course may allow purposes reminiscent of utilizing radiologist’s consideration on medical photographs to enhance well being screening or prognosis, or utilizing human consideration in advanced driving situations to information autonomous driving programs.


This work concerned collaborative efforts from a multidisciplinary group of software program engineers, researchers, and cross-functional contributors. We’d wish to thank all of the co-authors of the papers/analysis, together with Kfir Aberman, Gamaleldin F. Elsayed, Moritz Firsching, Shi Chen, Nachiappan Valliappan, Yushi Yao, Chang Ye, Yossi Gandelsman, Inbar Mosseri, David E. Jacobes, Yael Pritch, Shaolei Shen, and Xinyu Ye. We additionally need to thank group members Oscar Ramirez, Venky Ramachandran and Tim Fujita for his or her assist. Lastly, we thank Vidhya Navalpakkam for her technical management in initiating and overseeing this physique of labor.


Please enter your comment!
Please enter your name here