The ability to pay visual attention in an environment and combine information with other incoming senses (e.g. auditory information) comes as a multisensory, evolutionary advantage. This is because localising a snake, for example, is easier when combining spatial information (movement in the undergrowth with a ‘rattle’ sound), than locating a predator with a single modality on its own. Attentional mechanisms therefore are adapted to be integrated so that they can pick out and combine the saliency of any given environment as needed even without the conscious control of the observer. In the visual domain, space-based attention, as demonstrated by Posner (1980), is an attentional prioritisation of a location. Posner’s well known cueing paradigm used a central fixation point and peripherally located objects to successfully evidence that when fixating in the centre of a display screen, exogenous cueing (any non-informative deliberate event before the target onset that captures attention) can manipulate how quickly a spatial location can be attended to without direct eye movements. Thus, it is possible to attend psychologically to an area (using covert attention) without direct eye contact. Subsequent investigation demonstrated the variability of spatial attention. It was likened to a metaphorical spotlight, able to adapt and accommodate attentional spread according to the task (Eriksen and Eriksen, 1974). Thus, local features can either help or hinder depending on their similarity, or their salience (LaBerge, 1983). Furthermore, Egly, Driver and Rafal (1994) demonstrated that spatial attention was object based since reaction times were quicker for targets within the same object than an alternative object. Therefore, attention spreads within the cued object fastest. However, beyond a 250 millisecond (msec) timeframe, the benefit is reversed and response delayed, a concept known as inhibition of return (Klein and Ivanoff, 2008). During an IOR task, participants fixate on an uninformative central location within a visual display, and reaction time to locate a target appearing peripherally is measured. Sometimes the target is correctly cued and at other times, not. Consequently, it is possible to manipulate which side will be responded to quickest because IOR enables new stimuli to be prioritised over old by the use of inhibitory tagging (Klein and Ivanoff, 2008). Further, moving objects equally show this tendency (Tipper, Driver and Weaver, 1991). Since object movement does not affect this preference, this study will consider what happens to IOR when the object’s identity and location are ambiguous. Moreover, with this ambiguity, can spatial attention be manipulated cross-modally by the addition of sound? For instance, visual motion perception is altered by the addition of sound when discs approach one another from either end of a visual display and the point of coincidence is occluded. When no additional stimulus is present, the discs are perceived as passing (streaming). However, a sound input at the moment of coincidence changes perception to discs bouncing off each other (Sekuler, Sekuler and Lau, 1997). This is because the senses are designed to cooperate (Spence et al., 2000). Interestingly, IOR has equally been demonstrated as supramodal in this way (Spence et al., 2000). Thus, since IOR can travel with an object (Tipper et al., 1991) and motion perception can be altered by sound, it leaves open the question of whether IOR could be manipulated to travel with an object during a bounce/stream paradigm. This investigation will seek to identify whether a combination of sound and vision can affect how objects are perceived and attended to. To achieve this, a typical Posner cueing task (1980) will be adapted into the bounce/stream paradigm demonstrated by Sanabria, Correa, Lupianez and Spence (2004). If IOR tends to follow this stream/bounce perception, it is expected that inhibition should apply to different sides depending on whether or not the sound is present. With the sound, inhibition should return to the starting location and RTs should be slower when cue and target are on the same side of the screen.



An opportunity sample of sixty-eight healthy volunteers, 17–50 years of age (mean age 21 years), took part in the study. All were naive to the purposes of the experiment. Six participants were excluded (see results section). The remaining sixty-two participants consisted of fifty-seven right handed and five left handed individuals. All reported having normal hearing and normal or corrected-to-normal vision.


The Stimulus presentation, response times (RT) and error rates were controlled and recorded by a Mac computer (Superlab) and keyboard. Headphones delivered the sound.


The display layout can be seen in Figure 1 . Display measurements were as follows: Discs 1.2 cm (diameter), distance of discs to top and bottom of display 6 cm, from discs to occluder 8 cm, visual occluder 9.7 cm x 4.7 cm, distance of occluder to top and bottom of display 6 cm, display box 23.5 cm x 13.5 cm, central fixation 1.1 cm x 1.1 cm.


A 2x2 within subjects design was used. The first factor was block type with two levels: sound or silent. The second factor was side of target with two levels: same (screen side as cue) or different (opposite screen side to cue) randomised throughout a block. The dependent variable was RT in msec. The order of the blocks was decided beforehand (by coin toss) and was alternated between sound and silent conditions for a total of 4 blocks.


Care was taken to ensure that participants did not see the experiment beforehand and participants were advised that the experiment was designed to see whether what they hear and see affected how quickly they reacted to a visual target. All participants were tested in individual booths and given the same information. Participants chose a preferred responding hand in advance and were told to react as quickly and accurately as possible. Clear instruction was given to focus on the fixation point (central black cross) throughout trials and to have their chosen hand readied at the keyboard. For each participant, a ten trial practice was given to familiarise themselves with the task, in same block-type as had been allocated as the starting block. If more practice was required, more trials were given. Participants wore headphones in all blocks.

For silent trials, a white horizontal rectangle was centrally placed within a black screen. Inside were black discs either side of the screen equidistant from the centre (as shown in figure 1 ). At the start of every trial the display was shown for 500 msec before a non-informative white circle (cue) flashed for 200 msec within one of either discs to initiate covert attention. Afterwards, the discs moved to the opposite side of the screen taking 1000 msec ( figure 1, T3 ). Whilst occluded, ( figure 1 , T2 ) the fixation momentarily flashed. On same side trials, the target (white asterisk) subsequently appeared on the same side spatially within a disc. In the different side trials, the target appeared on the opposite side spatially. On catch trials no target appeared and after 2 seconds (if correctly rejected), the trial terminated. For half of the trials the target appeared immediately after movement stopped, whilst for the other half there was a 200 msec delay. Participants responded with the keyboard space bar ending the trial and initiating the next.

For the sound block the same format was used, but during occlusion the sound (click) was delivered for 100 msec at the coincidence point of the discs. Each block contained a mixture of same, different and catch trials randomly presented. Each block had sixty trials, twenty-four same/different side trials randomised amongst twelve catch trials.

Figure 1 A schematic illustration for a sound trial (encouraging the perception of a bounce), and the silent trial (encouraging the perception of streaming). The black arrows below the discs indicate the direction of movement (before and after the occlusion) for that trial. T1, onset of motion at the beginning of the trial (after the cue has been deployed) T2, the occlusion point with central fixation point flashing, T3, the re-emergence of discs after coincidence.


The data from six participants was excluded from analysis; four were omitted due to a lack of understanding and two were omitted due to a failure to complete. Trials where the target was not present (catch trials) were placed into four categories and counted. The trials were grouped according to whether or not the sound was present, those that were accurately ignored (correct rejections) and those that were accidently responded to (false alarms). The number of false alarms was 4.1%. The trials where the target was present were counted using the criterion that the response was made between 50–1500 msec after the target onset. These trials were counted (as hits) and the rest as misses (2.6% misses overall). Target trials were grouped according to whether the target and cue had appeared on the same, or on separate sides. For each participant and each combination of sound (present/absent) and side cue (same/different), the median reaction time (RT) was calculated using only the trials that were hits. Overall, four scores were calculated for each participant. The inter-participant means of median RTs were then calculated for each condition (see figure 2 ) .

Figure 2 Inter-participant means of median reaction times to locate a same screen side (dark grey) versus different screen side (light grey) cued target by condition (sound versus silent). Silent condition standard deviations were 68.5 (same), 65.7 (different).

A two-way ( sound by side ) within subjects ANOVA was conducted with mean of median RT as the dependent variable. There was a significant main effect of condition ( sound vs silent ) F (1, 61) = 6.65 p = .013, and a significant main effect of side ( same vs different ) F (1, 61) = 134.98 p <.001. There was also a significant interaction of sound by side , F (1, 61) = 51.13 p <.001. Pair-wise planned comparisons (t- tests) revealed that participants’ RTs were significantly faster for different side trials compared to same side trials in the silent condition, t (61) = 12.30 p <.001, showing no object-based IOR in the silent condition as responses were quickest for the cued object (in new location). Observers were also significantly faster to react to different side trials compared to same side trials in the sound condition t( 61) = 5.24 p <.001. Successfully evidencing IOR for the object after a perceptual bounce facilitated by sound. Thus, sound may be able to influence IOR in visual perception, but without concise evidence of object-based IOR in the silent condition, this cannot be inferred.


The aim of the study was to investigate whether the addition of sound could alter visual perception. To achieve this, the Posner style cuing task (1980) was modified with a bounce/stream paradigm Sanabria et al. (2004), in an attempt to influence the perception of IOR. Contrary to expectation, the results yielded no evidence of IOR (consistent with a stream) as evidenced by Sanabria et al. (2004), since object based IOR should have travelled with the object creating an advantage for the side originally cued. This would be consistent with object-based IOR as evidenced by Tipper et al. (1991). Rather, observers were faster on the different side (with the cued object). Interestingly, for the sound condition, IOR was demonstrated (in line with the perception of a bounce) suggesting that IOR can be influenced by the addition of sound. Unfortunately, without coherence between conditions this cannot be concluded. Therefore, further consideration of the results is required.

Given that IOR is a known component of spatial attention that can be produced in most attentional experiments (Klein and Ivanoff, 2008) and that it still applies even when objects move (Tipper et al., 1991) and can be shown between different sensory modalities, (Spence et al., 2000), it is likely that methodological issues account for the findings. One contingency could be eye position. The central fixation point is an essential element in all covert attention tasks and whilst the importance of central fixation was conveyed, eye movements were not tracked and therefore cannot be relied upon. Posner and Cohen (1984) for example demonstrated that IOR requires that attention is drawn back to the fixation point after cuing. Consequently, if eyes move within the trial, the inhibitory effect could remain at the environmental position of where the cue had occurred. This is consistent with both findings, and consistent with the reduced effect ( figure 2 ) evidenced in the sound compared to the silent condition since, regardless of the perceptual set, the same side (cue to target) would be inhibited if this was the last place of energy. Thus in the sound condition, the findings, may not have arisen from multisensory visual perception, but from failure to remain fixated.

Equally, Tipper et al. (1991) evidenced that IOR in static displays has an additional inhibitory component compared to moving displays. Thus with moving displays, covert attention may be more easily disrupted since object-based attention requires additional mental resources (Chen, 2012). For example, during Tipper et al. (1991) moving paradigm, the objects rotated around the fixation point remaining at an equidistant point from fixation at all times. Moreover, at no point were they occluded. In the current paradigm however, the objects were occluded behind the fixation area. Therefore, if the inhibitory tagging was disrupted during the period of occlusion IOR may have diminished. This is possible because ambiguity of movement has been highlighted as a potential pitfall in moving displays (Reppa, Schmidt and Leek, 2012). Tipper, Brehaut and Driver (1990) however, evidenced object based IOR with occluding columns. Therefore, interference may have increased in this instance, compared to Tipper et al. (1991) by the proximity of the objects to fixation, which changed throughout the task substantially and even connected with the object in which fixation was presented.

Importantly, the substantial changes in movement could have altered the attentional spread beyond fixation to include the incoming discs (Eriksen and Eriksen, 1974), especially since the similarity of surrounding features is known to interfere (LaBerge, 1983) and that distance between objects is also crucial (Franconeri, Johnathan and Scimeca, 2010). Thus, without any extra perceptual cues available, and coupled with objects similarity, it is conceivable that these factors compromised covert attention more markedly in the silent block compared to the sound block. Moreover, the task combination may have diminished the inhibitory tagging (Egly, et al., 1994) by interference (Franconeri et al., 2010). However, without tracking eye movements, this claim cannot be substantiated.

Evidencing eye movements was one of the criticisms of Spence et al. (2000) during his supra-modal paradigm; the use of eye tracking with sub-set of participants in a further replication of this study would be beneficial. More importantly, with eye movements accounted for, the other methodological issues such as ambiguity of movement across fixation could be considered. However, as demonstrated in other cross-modal investigations (Kennett, Spence and Driver, 2002), visual restriction can be an important feature in studying multisensory perception. Indeed, for the perception of a stream/bounce effect, the visual restriction is warranted. Thus, it may be possible to just relocate the ambiguity to a location away from fixation. The movement would still need to be directed across both left and right visual hemifields ( Sereno and Kosslyn, 1991 ), so as not to increase the difficulty of the task. However, the space between objects and fixation could be changed to reduce the likelihood of interference (Franconeri et al., 2010). For example, Tipper, Jordan and Weaver (1999) evidenced scene-based inhibition at different multiple locations on the screen; thus, if the display was moved equidistantly above fixation, it may be possible to eradicate or at least re-evaluate the aforementioned possible limitations.


In conclusion, the demonstration of IOR in a cross modal attention paradigm such as this can be difficult to produce. However, the results show promise towards the cross-modal influence of sound upon spatial attention. Given that the display used has important methodological constraints, a change in the spatial qualities of the display is needed to rule out interference between fixation and covert attention. Equally, since eye movements were not ruled out, future investigations would need to also factor eye tracking into the methodology. With these alterations, and the eventual observation of the object-based IOR, it would be possible to advance the understanding on how the influence of sound can affect perception and attention. In addition, it would facilitate better understanding of the cross modal elements of spatial attention more generally.