A Model of Visual Attention addresses the observed and/or predicted behavior of human and non-human primate visual attention. Models can be descriptive, mathematical, algorithmic or computational and attempt to mimic, explain and/or predict some or all of visual attentive behavior. A Computational Model of Visual Attention not only includes a process description for how attention is computed, but also can be tested by providing image inputs, similar to those an experimenter might present a subject, and then seeing how the model performs by comparison.
Contents
|
This article presents an overview of a wide variety of models of visual attention that have been presented over the course of the past few decades. A number of model classes will be defined within an organizational taxonomy in an attempt to organize a literature that is rapidly growing and with a view towards guiding future research. The taxonomy will reflect the differing schools of thought as well as the different modeling strategies. Further, it is important to keep in mind that not all models were developed with the same goals and that modelers do not always follow only one school of thought or strategy. Motivations for all models come from two sources. The first is interest to understand the human perceptual capability to select, process and act upon parts of one's sensory experience differentially from the rest. The second is the need to reduce the quantity of sensory information processed by a perceptual system (see Computational Foundations for Attentive Processes).
This article focuses on models whose goal is to provide an understanding of all or part of human or non-human primate visual attention. The bulk of models that focus primarily on the development of artefacts for computer vision or robotic systems will not be mentioned, even if they might include significant biological inspiration. Biological relevance is the key here, that is, research that attempts to model a particular set of experimental observations and simultaneously makes predictions that would extend that set and could be verified by future experiments. We try to not judge any model but to provide factual information about modeling in general, about kinds of models (or modeling ‘camps’), and about the kinds of functions different models cover. Interested readers can draw their own conclusions.
An important class of models is not covered here solely because of the emphasis on models that have claims on explaining the biology of attention. Those are many efforts to use aspects of attentive processing in applied settings, in robotics, for surveillance and other applications. Fortunately, a recent excellent survey exists for those interested (Frintrop et al. 2010).
A Model of Visual Attention is a description of the observed and/or predicted behavior of human and non-human primate visual attention. Models can employ natural language, system block diagrams, mathematics, algorithms or computations as their embodiment and attempt to mimic, explain and/or predict some or all of visual attentive behavior. Of importance are the accompanying assumptions, the set of statements or principles devised to provide the explanation, and the extent of the facts or phenomena that are explained. These cannot all be laid out here due to the resulting article length but the reader is encouraged to follow the citations provided. Models must be tested by experiments, and such experiments replicated, both with respect to their explanations of existing phenomena but also to test their predictive validity.
A Computational Model of Visual Attention is an instance of a model of visual attention, and not only includes a formal description for how attention is computed, but also can be tested by providing image inputs, similar to those an experimenter might present to a subject, and then seeing how the model performs by comparison. The bulk of this article will focus on computational models. It should be pointed out that this definition differs from the usual, almost casual, use of the term ‘computational’ in the area of neurobiological modeling. It has come to mean almost any model that includes a mathematical formulation of some kind. Mathematical equations can be solved and/or simulated on a computer, and thus the term computational has seemed appropriate to many authors. Marr’s levels of analysis (Marr 1982) provide a different view. He specified 3 levels of analysis: the computational level (a formal statement of the problems that must overcome), the algorithmic level (the strategy that may be used), and the implementation level (how the task is actually performed in the brain or in a computer, solving the problems laid out at the computational level, using the strategies of the algorithmic level and adding in the details required for their implementation). Our use of the term ‘computational model’ is intended to capture models that specify all three of Marr’s levels in a testable manner. Our description of the functional elements of attention in Section 3 corresponds to Marr's first level of analysis, the problems that must be addressed. The terms ‘descriptive’, ‘data-fitting’ and ‘algorithmic’ as used here describe three different methodologies for specifying Marr’s algorithmic level of analysis. Section 2 will provide definitions and further discussion on the model classification strategy used here.
Models of attention are complex providing mechanisms and explanations for a number of functions all tied together with a control system; this is basically the specification at Marr’s ‘computational level’ of analysis. More detail on each of these tasks is provided in Visual Attention. Due to their complexity, model evaluation is not a simple matter and objective conclusions are still elusive.
The point of the next section is to create a context for such models; this enables one to see their scientific heritage, to distinguish models on the basis of their modeling strategy, and to situate new models appropriately to enable comparisons and evaluations.
We present a taxonomy of models with computational models clearly lying in the intersection of how the biological community and how the computer vision community view attentive processes. See Figure 2. There are two main roots in this lattice – one for the uses of attentive methods in computer vision and one for the development of attention models in the biological vision community. Although both have proceeded independently, and indeed, the use of attention appears in the computer vision literature before most biological models, the major point of intersection is the class of computational models (using the definition given above).
It is quite clear that the motivations for all the modeling efforts come from two sources. The first is the deep interest to understand the perceptual capability that has been observed for centuries, that is, the ability to select, process and act upon parts of one's sensory experience differentially from the rest. The second is the need to reduce the quantity of sensory information entering any system, biological or otherwise, by selecting or ignoring parts of the sensory input. Although the motivation seems distinct, the conclusion is the same, and in reality the motivation for attention in any system is to reduce the quantity of information to process in order to complete some task (see Computational Foundations for Attentive Processes). But depending on one's interest, modeling efforts do not always have the same goals. That is, one may be trying to model a particular set of experimental observations, one may be trying to build a robotic vision system and attention is used to select landmarks for navigation, one may have interest in eye movements, or in the executive control function, or any one or more of the functional elements described in Visual Attention. As a result, comparing models is not straightforward, fair, or useful. Comparing pieces that represent the same functionality is more relevant, but there are so many of these combinations that it would be an exercise beyond the scope of this overview.
The use of attentive methods has pervaded the computer vision literature demonstrating the importance for reducing the amount of information to be processed. It is important to note that several early analyses of the extent of the information load issue appeared (Uhr 1972, Feldman and Ballard 1982, Tsotsos 1987) with converging suggestions for its solution, those convergences appearing in a number of the models below (particularly those of Burt 1988 or Tsotsos 1990). Specifically, the methods can be grouped into four categories. Within modern computer vision, there are many, many variations and combinations of these themes because regardless of the impressive rapid increases in power in modern computers, the inherent difficulty of processing images demands attentional processes (see Computational Foundations or Attentional Processes).
One way to reduce the amount of an image to be processed is to concentrate on the points or regions that are most interesting or relevant for the next stage of processing (such as for recognition or action). The idea is that perhaps 'interestingness' can be computed in parallel across the whole image and then those interesting points or regions can be processed in more depth serially. The first of these methods is due to Moravec (1981) and since then a large number of different kinds of 'interest point' computations have been used. It is interesting to note the parallel here with the Saliency Map Hypothesis described below.
The computational load is not only due to the large number of image locations (this is not so large a number as to cause much difficulty for modern computers), but rather it is due to the combinatorial nature of combinations of positions or regions. In perceptual psychology, how the brain might organize items is a major concern, pioneered by the Gestaltists (Wertheimer 1923). Thus, computer vision has used grouping strategies following Gestalt principles in order to limit the possible subsets of combinatorially defined items to consider. The first such use appeared in Muerle and Allen (1968) in the context of object segmentation.
Human eyes move, and humans move around their world in order to acquire visual information. Active vision in computer vision uses intelligent control strategies applied to the data acquisition process depending on the current state of data interpretation (Bajcsy 1985, Tsotsos 1992). A variety of methods have appeared following this idea, perhaps the earliest one most relevant to this discussion is the robotic binocular camera system of Clark and Ferrier (1988), featuring a salience-based fixation control mechanism.
The application of domain and task knowledge to guide or predict processing is a powerful tool for limiting processing, a fact that has been formally proved (Tsotsos 1989; Parodi et al. 1998). The first use was for oriented line location in a face-recognition task (Kelly 1971). The first instance for temporal window prediction was in a motion recognition task (Tsotsos et al. 1980).
Clearly, in this class, the major motivation has always been to provide explanations for the characteristics of biological, especially human, vision. Typically, these have been developed to explain a particular body of experimental observations. This is a strength; the authors usually are the ones who have done some or all of the experiments and thus completely understand the experimental methods and conclusions. Simultaneously, however, this is also a weakness because usually the models are often difficult to extend to a broader class of observations. Along the biological vision branch, the three classes identified here are:
These models are described primarily using natural language and/or block diagrams. Their value lies in the explanation they provide of certain attentional processes; the abstractness of explanation is also their major problem because it is typically open to interpretation. Classic models, even though they were motivated by experiments in auditory attention, have been very influential. Early Selection (Broadbent 1958), Late Selection (Deutsch & Deutsch 1963, Moray 1969, Norman 1968), and Attenuator Theory (Treisman, 1964) are all descriptive models. Others such as Feature Integration Theory (Treisman and Gelade 1980), Guided Search (Wolfe et al. 1989), Animate Vision (Ballard 1991), Biased Competition (Desimone and Duncan 1995), FeatureGate (Cave 1999), Feature Similarity Gain Model (Treue &Martinez-Trujillo 1999), RNA (Shipp 2004), and the model of Knudsen (2007) are also considered descriptive. The Biased Competition Model has garnered many followers mostly due to the conceptual aspect of it combining competition with top-down bias, concepts that actually appeared in earlier models (such as Grossberg 1982 or Tsotsos 1990). These are conceptual frameworks, ways of thinking about the problem of attention. Many have played important, indeed foundational, roles in how the field has developed.
These models are mathematical and are developed to capture parameter variations in experimental data in as compact and parsimonious form as possible. Their value lies primarily in how well they provide a fit to experimental data, and in interpolation or extrapolation of parameter values to other experimental scenarios. Good examples are the Theory of Visual Attention (Bundesen 1990) and the set of models that employ normalization as a basic processing element. An early one is the model of Reynolds et al. 1999) that proposed a quantification of the Biased Competition model. Subsequently, this was refined further onto the Normalization Model of Attention, a marriage of divisive normalization with biased competition (Reynolds & Heeger 2009). At the same time a further normalization model appeared due to (Lee & Maunsell 2009), the Normalization Model of Attentional Modulation, showing how attention changes the gain of responses to individual stimuli and why attentional modulation is more than a gain change when multiple stimuli are present in a receptive field.
These models provide mathematics and algorithms that govern their performance and as a result present a process by which attention might be computed and deployed. They, however, do not provide sufficient detail or methodology so that the model might be tested on real stimuli. These models often provide simulations to demonstrate their actions. In a real sense they are a combination of descriptive and data-fitting models; they provide more detail on descriptions so they may be simulated while showing good comparison to experimental data at qualitative levels (and perhaps also quantitative). The best known of these models is the Saliency Map Model (Koch and Ullman 1985 - defined in Section 2.3.2); it has given rise to many subsequent models. It is interesting to note that the Saliency Map Model is strongly related to the Interest Point Operations on the other side of this taxonomy. Other algorithmic models include Adaptive Resonance Theory (Grossberg 1982), Temporal Tagging (Niebur et al. 1993; Usher and Niebur 1996), Shifter Circuits (Anderson and Van Essen 1987), Visual Routines (Ullman 1984), CODAM (Taylor and Rogers 2002), and a SOAR-based model (Wiesmeyer & Laird 1990).
As mentioned earlier, the point of intersection between the computer vision and biological vision communities is represented by the set of computational models in the taxonomy. Computational Models not only include a process description for how attention is computed, but also can be tested by providing image inputs, similar to those an experimenter might present a subject, and then seeing how the model performs by comparison. The biological connection is key and pure computer vision efforts are not included here. Under this definition, computational models generally provide more complete specifications and permit more objective evaluations as well. This greater level of detail is a strength but also a weakness because there are more details that require experimental validation.
Many models have elements from more than one class so the separation is not a strict one. Computational models necessarily are Algorithmic Models and often also include Data-Fitting elements. Nevertheless, in recent years four major schools of thought have emerged, schools that will be termed 'hypotheses' here since each has both supporting and detracting evidence. In what follows, an attempt is made to provide the intellectual antecedents for each of these major hypotheses. The taxonomy is completed in Section 2.4 when several instances of each of the classes are added.
This hypothesis focuses on how attention solves the problems associated with stimulus selection and then transmission through the visual cortex. The issues of how signals in the brain are transmitted to ensure correct perception appear, in part, in a number of works. Milner (1974), for example, mentions that attention acts in part to activate feedback pathways to the early visual cortex for precise localization, implying a pathway search problem. The complexity of the brain’s network of feed-forward and feedback connectivity highlights the physical problems of search, transmission and finding the right path between input and output (see Felleman and Van Essen 1991). Anderson and VanEssen's Shifter Circuits proposal (Anderson & VanEssen 1987) was presented primarily to solve these physical routing and transmission problems using control signals to each layer of processing that shift selected inputs from one path to another. The routing issues, described in (Tsotsos et al. 1995), are: 1) A single unit at the top of the visual processing network receives input from a sub-network of converging inputs, and thus from a large portion of the visual field (the Context Problem - see Figure 3a); 2) A single event at the input will affect a large number of units in the network due to a diverging feed-forward signal resulting in a loss of localization information (the Blurring Problem - see Figure 3b); 3) Two separate visual events in the visual field will activate two overlapping sub-networks of units and connections, whose region of overlap will contain units whose activity is a function of both events. Thus, each event interferes with the interpretation of other events in the visual field (the Cross-Talk Problem - see Figure 3c).
Any model that uses a biologically plausible network of neural processing units needs to address these problems. One class of solutions is that of an attentional 'beam' through the processing network as shown in Figure 3d.
Models that fall into the Selective Routing class include Pyramid Vision (Burt 1988), Olshausen et al. (1993), Selective Tuning (Tsotsos et al. 1995; Zaharescu et al. 2004, Tsotsos et al. 2005; Rodriguez-Sanchez et al. 2007, Rothenstein et al. 2008), NeoCognitron (Fukushima 1986), and SCAN (Postma et al. 1997).
This hypothesis has its roots in Feature Integration Theory (Treisman and Gelade 1980) and appears first in the class of algorithmic models above (Koch and Ullman 1985). It includes the following elements (see Figure 4): (i) an early representation composed of a set of feature maps, computed in parallel, permitting separate representations of several stimulus characteristics; (ii) a topographic saliency map where each location encodes the combination of properties across all feature maps as a conspicuity measure; (iii) a selective mapping into a central non-topographic representation, through the topographic saliency map, of the properties of a single visual location; (iv) a winner-take-all (WTA) network implementing the selection process based on one major rule: conspicuity of location (minor rules of proximity or similarity preference are also suggested); and, (v) inhibition of this selected location that causes an automatic shift to the next most conspicuous location. Feature maps code conspicuity within a particular feature dimension. The saliency map combines information from each of the feature maps into a global measure where points corresponding to one location in a feature map project to single units in the saliency map. Saliency at a given location is determined by the degree of difference between that location and its surround. The models of Clark & Ferrier (1988), Sandon (1990) - the first implementation of the Koch & Ullman model -, Itti et al. (1998), Itti & Koch (2000), Walther et al. (2002), Navalpakkam & Itti (2005), Itti & Baldi (2006), SERR Humphreys & Müller (1993), Zhang et al. (2008), and Bruce & Tsotsos (2009) are all in this class. The drive to discover the best representation of saliency or conspicuity is a major current activity; whether or not a single such representation exists in the brain remains an open question with evidence supporting many potential loci (summarized in Tsotsos et al. 2005).
The earliest conceptualization of this idea seems to be due to Grossberg who between 1973 and 1980, presented ideas and theoretical arguments regarding the relationship among neural oscillations, visual perception and attention (see Grossberg 1980). His work led to the ART model that provided details on how neurons may reach stable states given both top-down and bottom-up signals and play roles in attention and learning (Grossberg 1982). Milner also suggested that the unity of a figure at the neuronal level is defined by synchronized firing activity (Milner 1974). von der Malsburg (1981) wrote that neural modulation is governed by correlations in temporal structure of signals and that timing correlations signal objects. He defined a detailed model of how this might be accomplished, including neurons with dynamically modifiable synaptic strengths that became known as von der Malsburg synapses. Crick & Koch (1990) later proposed that an attentional mechanism binds together all those neurons whose activity relates to the relevant features of a single visual object. This is done by generating coherent semi-synchronous oscillations in the 40-70Hz range. These oscillations then activate a transient short-term memory. Models subscribing to this hypothesis typically consist of pools of excitatory and inhibitory neurons connected as shown in Figure 5. The actions of these neuron pools are governed by sets of differential equations; it is a dynamical system. Strong support for this view appears in a nice summary by Sejnowski and Paulsen (2006). The model of Hummel & Biederman (1992) and those from Deco's group - Deco & Zihl (2001), Corchs & Deco (2001), Deco, Pollatos & Zihl (2002) - are within this class. A number of other models exist but do not conform to our definition of computational model; they are mathematical models that only provide simulations of their performance. As such, we cannot include them here but do provide these citations because of the intrinsic interest in this model class (Niebur et al. (1993), Usher & Niebur (1996), Kazanovich & Borisyuk (1999), Wu & Guo (1999)). Clearly, there is room for expansion of these models into computational form. This hypothesis remains controversial (see Shadlen and Movshon 1999).
The emergent attention hypothesis proposes that attention is a property of large assemblies of neurons involved in competitive interactions (of the kind mediated by lateral connections) and selection is the combined result of local dynamics and top-down biases (see Figure 6). In other words, there is no explicit selection process of any kind. The mathematics of the dynamical system of equations leads through its evolution alone to single peaks of response that represent the focus of attention. Duncan (1979) provided an early discussion of properties of attention having an emergent quality in the context of divided attention. Grossberg's 1982 ART (Adaptive Resonance Theory) model played a formative role here. Such an emergent view took further root with work on the role of emergent features in attention by Pomerantz and Pristach (1989) and Treisman and Paterson (1984). Later, Styles (1997) suggested that attentional behaviour emerges as a result of the complex underlying processing in the brain. Shipp's review (2004) concludes that this is the most likely hypothesis. The models of Heinke and Humphreys SAIM (1997, 2003), Hamker (1999; 2000; 2004; 2005; 2006), Spratling (2008), Deco and Zihl (2001), and Corchs and Deco (2001), belong in this class among others. Clearly, there must be mechanisms that support the process behind this; Hamker's model provide a good view of how this might be accomplished and shows, for example, how interactions between hierarchical representations are employed. Desimone and Duncan (1995) view their biased competition model as a member of this class, writing "attention is an emergent property of slow, competitive interactions that work in parallel across the visual field". In turn, many of the models in this class are also strongly based on Biased Competition.
A number of models have appeared over the years that borrow from the major attentional hypotheses and as noted earlier, many borrow from more than one. This section will classify a number of computational models conforming to the definition presented earlier. The directory of models follows while Figure 1 groups them according to their foundational ideas.
Model Directory:
AIM Bruce & Tsotsos (2005; 2009) ART Grossberg (1975; 1982), Carpenter et al. (1998) ClaFer Clark & Ferrier (1988) DraLio Draper & Lionelle (2005) FastGBA Sharma (2016) Hamker Hamker (1999; 2000; 2004; 2005; 2006) HumBie Hummel & Biederman (1992) LanDen Lanyon & Denham (2004) LeeBux Lee et al. (2003) LiZ Li (2001) MORSEL Mozer (1991) NeoCog Fukushima (1986) NeurDyn Deco & Zihl (2001), Corchs & Deco (2001), Deco, Pollatos & Zihl (2002) NowSej Nowlan & Sejnowski (1995) OliTor Oliva et al. (2003) OlshAn Olshausen et al. (1993) PC/BC-DIM Spratling (2008) PyrVis Burt (1988) SAIM Heinke & Humphreys (1997,2003) Sandon Sandon (1990) SCAN Postma et al. (1997) SERR Humphreys & Müller (1993) SM Itti et al. (1998) SMOC Itti & Koch (2000) SMSurp Itti & Baldi (2006) SMTask Navalpakkam & Itti (2005) ST Tsotsos et al. (1995) STActive Zaharescu et al. (2004) STBind Tsotsos et al. (2008), Rothenstein et al. (2008) STFeature Rodriguez-Sanchez et al. (2007) STRec Tsotsos et al. (2005) SUN Zhang et al. (2008) SunFish Sun et al. (2008) UshNie Usher & Niebur (1996) vaHeGi van de Laar et al. (1997) VISIT Ahmad (1992) WalItt Walther et al. (2002)
Figure 1 makes clear that the Saliency Map hypothesis seems most popular. Further, it is evident, that few of the possible combinations of hypotheses seem explored. We would suggest that those empty joint classes are potentially valuable avenues of exploration because it is clear that no single hypothesis covers the full breadth of attentional behavior, as was argued in Section 2 and also further discussed in Section 3.
What are the functional elements of attention that a complete modeling effort must include? This is a difficult question and there have been several previous papers that attempt to address it. Itti and Koch (2001), for example, review the state of attentional modeling, but from the point of view that assumes attention is primarily a bottom-up process based largely on their notion of saliency maps. Knudsen (2007) provides a more recent review; his perspective favors an early selection model. He provides a number of functional components fundamental to attention: working memory, competitive selection, top-down sensitivity control, and filtering for stimuli that are likely to be behaviorally important (salience filters). In his model, the world is first filtered by the salience filters in a purely-bottom manner, creating the various neural representations on which competitive selection is based. The role of top-down information and control is to compute sensitivity control affecting the neural representations by incorporating the results of selection, working memory and gaze. A third functional structure is that of Hamker (1999), whose work is an excellent example of the neuro-dynamical approach. The focus is on excitatory and inhibitory neural pools, the ordering of their effects as well as the neural sites affected and top-down bias is really a simple bias arising from area IT. 'What' and 'where' functions are separated - features are computed and represented in the ventral stream and spatial location in the dorsal. A review by Rothenstein & Tsotsos (2008) presents a classification of models with details on the functional elements each includes. Finally, Shipp (2004) provides yet another useful overview where he compares several different models along the dimension of how they map onto system level circuits in the brain. He presents his Real Neural Architecture (RNA) model for attention, integrating several different modes of operation – parallel or serial, bottom-up or top-down, pre-attentive or attentive – found in cognitive models of attention for visual search.
It would seem that there is value in providing an additional perspective, namely, one that is orthogonal to the neural correlates of function and that is independent of model and modeling strategy. This alternate functional decomposition is presented in Visual Attention and covers the breadth of visual attention from information reduction, to representations, to control to external manifestations of attentional behavior. It is fair to say that a complete model should account for each; it is also fair to say that no model yet comes close. These functional elements are listed below. We invite modelers to annotate each with a brief description of how their model provides the functionality listed; those details are beyond the scope of this article. The main elements of attention are now given. They are detailed further in Visual Attention where one can also see all the appropriate citations and biological evidence.
An important point here is that models of visual attention should be able to deal with each of these. It would be in the best interests of the readers of this article that each modeler provide some annotation through this article (maybe through the use of a SUB-PAGE) on how their model incorporates these attentional elements. It would form a major contribution to the comparison of models.
The above lists of elements are unlikely to be complete nor the optimal partitioning of the problem but are representative of most current thinking. The effectiveness of any model, regardless of type as laid out in Section 2, is determined by how well it provides explanations for what is known about as many of the above functional elements as possible. As important, models must be falsifiable, that is, they must make testable predictions regarding new behaviors or functions not yet observed - behaviors that are not easily deduced from current knowledge, that are counterintuitive - that would enable one to support or reject the model. To test all the models on these criteria is beyond the scope of this article but is a necessary task for anyone wishing to answer the question "Which is the best model of visual attention?"
Nevertheless, several authors are making strong attempts at comparative evaluation using large databases of images and providing executable code that others can use. Primarily, these evaluations are for models that focus on representations of saliency that drive fixation models in the Saliency Map Hypothesis class. Itti's Neuromorphic Vision Toolkit was the first; more recently others, such as Bruce, Draper and Lionnelle, and Zhang et al. show serious evaluations and provide public databases for others to use. We add that Draper & Lionelle (2003) laid out the first steps for a principled comparative evaluation. This is very positive even though statistical validity of databases and the relevant comparative dimensions remain issues needing more work.
We thank Mazyar Fallah, Heather Jordan, Fred Hamker and an anonymous reviewer for their comments on earlier drafts.
Computational Models of Attention
Biology (general link to the category Vision)