How do second language learners go about their listening when they view captioned videos? Replication studies of Taylor (2005), Winke et al. (2013) and Rodgers and Webb (2017)

Michael Yeldham

doi:10.1017/S0261444823000228

How do second language learners go about their listening when they view captioned videos? Replication studies of Taylor (2005), Winke et al. (2013) and Rodgers and Webb (2017)

Published online by Cambridge University Press: 13 July 2023

Michael Yeldham

Show author details

Michael Yeldham*: Affiliation:
Jilin University, Changchun, China
*: Email: mayeldham@hotmail.com

Article contents

Abstract
Introduction
The three studies proposed for replication
Taylor, G. (2005). Perceived processing strategies of students watching captioned video. Foreign Language Annals, 38(3), 422–427
Winke, P., Gass, S., & Sydorenko, T. (2013). Factors influencing the use of captions by foreign language learners: An eye-tracking study. Modern Language Journal, 97(1), 254–275
Rodgers, M. P. H., & Webb, S. (2017). The effects of captions on EFL learners’ comprehension of English-language television programs. CALICO Journal, 34(1), 20–38
Conclusion
Footnotes
References

Rights & Permissions

Abstract

In recent years, interest in using captioned videos for second language learning has grown immensely, partly owing to the explosion of available materials and the rapid increase in viewing platforms. The captioning affords many learners access to authentic videos ordinarily out of their reach, and teachers often employ the videos to help improve their learners’ listening. However, there is the view that learners mainly just read the captions, and that the viewing largely enhances their reading skills, instead. There is an increasing amount of research investigating this issue, much of which needs to be further verified through replication. This article outlines how three key relevant studies may be replicated, with an emphasis on examining the impact of the captioned viewing on the learners’ listening. Two of the studies, by Taylor (2005) and Winke et al. (2013), examine viewers’ processing strategies, which can include the use of the audio, caption and visual modalities. The other study, by Rodgers and Webb (2017), examines how viewing over the long term impacts learners’ comprehension.

Type: Replication Research
Information: Language Teaching , First View , pp. 1 - 13

DOI: https://doi.org/10.1017/S0261444823000228 [Opens in a new window]
Copyright: Copyright © The Author(s), 2023. Published by Cambridge University Press

1. Introduction

Captioning (or same language subtitling) is commonly used to scaffold video viewing for second language (L2) learners, with the captioning affording many such learners access to authentic videos ordinarily out of their reach. Apart from their motivating qualities, and potential for increasing learners’ L2 knowledge (especially vocabulary learning),Footnote ¹ the videos are shown mainly to improve the learners’ listening abilities. However, there is concern that the learners may largely just read the captions at the expense of listening to the speakers (Borras & Lafayette, Reference Borras and Lafayette1994; Vandergrift & Cross, Reference Vandergrift and Cross2014), thus doing little more than serving to bolster their reading skills.

Interest in captioned viewing for L2 learning has grown immensely, partly owing to the vast amounts of captioned materials available, which are able to be viewed on a range of devices, both inside and outside the classroom (Montero Perez, Reference Montero Perez2022). Its use in the classroom still often needs to shake off the tag of mainly being a leisure activity (Vanderplank, Reference Vanderplank2019), although such resistance to its use is likely to lessen if research continues to show its potential benefits for language learning. Replication studies are part of this effort. Such studies can provide more insight into the validity, reliability and credibility of the original study, increase its confirmatory power, contribute to a better understanding of it, and help the field generalize from it or reach further conclusions about it (McManus, Reference McManus and Plonsky2021; Porte, Reference Porte2013; Porte & McManus, Reference Porte and McManus2019). Consequently, in this article, three studies are proposed for replication that, taken together, examine how learners tend to process captioned videos and also whether such viewing helps to develop learners’ listening.

Listening comprehension, as it is commonly defined (Buck, Reference Buck2001; Field, Reference Field, Schwieter and Benati2019), involves processing the speaker's audio message, together with accompanying contextual information, often in the form of visual cues; this is distinct from reading comprehension, which involves processing written information. Much of the research conducted on L2 captioned video viewing has examined its short-term effectiveness in enhancing learners’ comprehension of the usually limited number of short video clips shown in the particular study (e.g. Hayati & Mohmedi, Reference Hayati and Mohmedi2011; Huang & Eskey, Reference Huang and Eskey1999–2000). However, despite findings that the captions commonly do improve comprehension, it has been unclear whether this was a function of the learners’ improved listening ability or mainly just because of the assistance offered by reading the captions when viewing those particular videos. As Vanderplank (Reference Vanderplank2016) argues in reference to whether captioned viewing improves learners’ listening:

The question [ … ] is not so much whether a learner understands better watching a video clip or TV programme with captions rather than without, but whether watching captioned audiovisual material can help to make a learner a better listener (p. 76).

[ … ] whether watching captioned programmes and films over the longer period helps develop the knowledge and skills required to follow and understand a wide variety of speech in the foreign language without captions (pp. 55, 56).

Only a small number of studies have examined whether learners actually listen while viewing captioned videos. Most of these have involved showing the learners captioned videos then eliciting, through self-report methods, their perceptions of how they processed the videos (e.g., Chai & Erlam, Reference Chai and Erlam2008; Sydorenko, Reference Sydorenko2010; Taylor, Reference Taylor2005). Other studies have examined the learners longitudinally, looking for any longer-term impact of the viewing, including whether the learners’ listening, and/or particular aspects of their listening, tends to benefit (Charles, Reference Charles2017; Mitterer & McQueen, Reference Mitterer and McQueen2009; Vanderplank, Reference Vanderplank1988, Reference Vanderplank2019). However, as Yeldham (Reference Yeldham2018) pointed out in his overview of such studies, results have sometimes been contradictory, which has often been seen as a function of differing design aspects between the studies and/or possible shortcomings in some of the methods and procedures used in them.

2. The three studies proposed for replication

The first of the three studies proposed for replication, by Taylor (Reference Taylor2005), directly examines the viewers’ balance of listening, caption reading and use of visuals. It garners learners’ perceptions, after a video-viewing session, of how they tended to process the video, and also uses comprehension tests to compare the viewing performance of more-experienced and less-experienced L2 learners in caption and no-caption conditions. The second study, by Winke et al. (Reference Winke, Gass and Sydorenko2013), relies largely on eye-tracking methodology to provide insight into viewers’ real-time processing of captioned video, showing directly how the impact of the captioning may effect this processing through the video. The third study, an experimental study of learners’ captioned viewing, by Rodgers and Webb (Reference Rodgers and Webb2017), examines how viewing over the longer term can influence the learners’ comprehension.

The three studies are chosen for replication because they are quite influential (based on the large number of citations in Google Scholar), and have helped to increase, in their own way, our understanding of L2 captioned video use. Replications of the studies could potentially further help to address the key questions of (1) whether learners listen, and to what extent, during captioned viewing, (2) what factors may influence this, and (3) whether learners’ listening benefits from the captioned viewing over time.

3. Taylor, G. (2005). Perceived processing strategies of students watching captioned video. Foreign Language Annals, 38(3), 422–427

3.1 Background to the study

As mentioned, only a small number of studies have examined how L2 learners process captioned video in terms of the balance of the three main modalities of audio, captions and visuals, with most of the studies being cross-sectional ones. Most of these cross-sectional studies have focused solely on either less-proficient viewers (e.g., Caimi, Reference Caimi2006; Sydorenko, Reference Sydorenko2010), or proficient ones (e.g., Chai & Erlam, Reference Chai and Erlam2008).

Many of the studies have found that, consistent with cognitive load theory (Mayer, Reference Mayer2009; Sweller, Reference Sweller1988), less-proficient viewers or viewers watching challenging videos often find their attentional resources overloaded by the burden of information from all three modalities. Consequently, many such viewers resort to reading the more salient captions in the video (Caimi, Reference Caimi2006; Pujolà, Reference Pujolà2002; Sydorenko, Reference Sydorenko2010), at the expense of listening to the speakers' more transitory and often less clear aural message (owing to connected speech, and so on).

By comparison, studies have also found that proficient learners tend to make more effective use of the aural/caption/visual information, serving to enhance their comprehension (Pujolà, Reference Pujolà2002; Vanderplank, Reference Vanderplank1988, Reference Vanderplank2016). This is consistent with Paivio's (Reference Paivio2007) dual processing theory, which contends that efficiently combining the verbal and non-verbal information facilitates comprehension.

However, results have sometimes been inconsistent, as the tendency to rely on reading the captions has also been found among proficient learners (Chai & Erlam, Reference Chai and Erlam2008), not just lower-proficiency ones, prompting the need for further investigation. Taylor (Reference Taylor2005) is one of the few studies that has examined learners at two different ability levels. It is also one of the very few studies where the primary concern has been to examine the viewers’ use of listening/reading/visuals (other studies have often made this a secondary concern behind other factors, such as vocabulary learning). Taylor's (Reference Taylor2005) study also includes both qualitative and quantitative data, to triangulate its findings; replication would give the opportunity to modify key aspects of the study and re-examine its major findings.

3.2 The original study

Taylor (Reference Taylor2005) investigated US university second-semester Spanish foreign language (FL) students’ use of captioned video. There were two components of the study. One, using learner reflective written protocols, examined: (a) the processing strategies learners used while watching a captioned video; and (b) whether these strategies differed between two learner groups based on their years of experience in learning the L2, with one group having one year of experience (first-year students) and the other three to four years (referred to by Taylor as 3-year students, but here as third-year students).

The other component of the study used language tests to examine the interaction between viewing type (caption [treatment] vs no-caption [control]) and years of L2 experience (first-year [less experience] vs third-year [more experience]). Taylor's decision to base the groupings on years of experience learning the L2 (rather than proficiency level) was because of his view that those “exposed to more opportunities for reading and listening in the L2 [ … ] would have developed more and better strategies for coping with the input” (p. 424).

The students watched a 10-minute documentary about the history and consumption of Spanish foods, from a first-year Spanish text book, Puentes, which also included four native speakers each talking about making different Spanish foods. The video had been piloted and considered suitable for the participants’ level. After viewing the video, two tests were used to assess the participants’ comprehension of it: a free recall test, then a 15-item multiple choice (MC) test. Then data concerning the caption group's viewing strategies were gathered through the learner reflective written protocols, where “they explained how they utilized the captions, audio, and video in their attempts to understand (strategy use)” (Taylor, Reference Taylor2005, p. 424). Thirty-five students completed these reflective protocols: 17 first-year and 18 third-year students.

Results from the reflective protocols were reported as follows. Firstly, six of the 17 first-year students stated they found the captions distracting or confusing or they stated they had difficulty using the three modalities of image, sound and captions simultaneously. By contrast, only two of the 18 third-year students reported these difficulties. Secondly, more third-year students (nine) than first-year students (four) reported being able to utilize all the three modalities in their effort to understand the video. Finally, overall, most of the students (26 of 35) reported to have at least attempted to listen to the video – although with little difference between the groups (13 for both first-year and third-year students).

In assessing the learners’ comprehension of the video's content, for the multiple-choice (MC) test, use of t-tests found there was no difference between the caption and no-caption group for either the first-year or third-year students. The t-test results for the free recall also found no difference between the caption group (N = 18) and no-caption group (N = 12) for the third-year students. However, for the first-year students on this test, surprisingly the no-caption group (N = 24) significantly outperformed the caption group (N = 17). This finding lent some support to the first-year students’ perceptions in their reflective written protocols, that the captioned viewing tended to overload many of them cognitively – thus impeding their comprehension of the video.

3.3 Approach to the replication of Taylor (Reference Taylor2005)

Close and approximate replications aim to examine how intentionally changing one variable (close replication), or more than one variable (approximate replication), may impact the findings of the original study (McManus, Reference McManus and Plonsky2021; Porte & McManus, Reference Porte and McManus2019). To observe the effect of these changes and enable sound comparison with the original study, though, both approaches require maintaining the other aspects of the study as faithfully as possible to the original study. However, Taylor's study tended to lack rigor in some areas of its design and analysis, and some key aspects of its methodology were not reported, making such replications impossible to execute correctly.

A conceptual replication of the study, though, would be feasible. Such replication commonly involves addressing the same research questions and/or theoretical basis as the original study but makes changes to the methodology, largely to provide better insights into the original findings (Porte & McManus, Reference Porte and McManus2019). Consequently, I would suggest the methodological changes, summarized below. It may be advisable, though, to make these changes in successive replications, perhaps in the order in which they are presented here. This would serve to isolate the impact of each change in comparing the results from the modification with the results of the original study.

One such change would be to employ a standardized language test to distinguish the two learner groupings based on proficiency level rather than on their years of studying the L2, the criteria used in the original study. It would be essential to include a component in the test to assess listening – and also most probably one to assess reading and, possibly, reading speedFootnote ² to determine the learners’ likely abilities to utilize the captions (Chai & Erlam, Reference Chai and Erlam2008; Vanderplank, Reference Vanderplank2016). Basing the learner groups on years of study means there could have been potentially large proficiency differences within these groups, which may have influenced the results. Learners’ processing strategies, in terms of their success or otherwise, are often a function of their available attentional resources (Sweller, Reference Sweller1988), and proficiency level would play a major role in that. Thus, this modification could test the veracity of the original findings, from both its experimental and qualitative components.

Another inclusion in a replication would be to use separate videos tailored to the two learner groups’ proficiency levels,Footnote ³ rather than the single video used in the original study. This is because showing the one video to two groups of differing proficiency really just amounts to the lower-proficiency group viewing a video that is difficult for them, and the higher-proficiency group viewing a video that is comparatively easy for them. More valid results would be derived from having each group view a similar type of video, but with the difficulty tailored for learners at their level. That would facilitate examination of the learners’ processing strategies based on their proficiency level (see the benefits, above), rather than on text difficulty.Footnote ⁴ The original study was concerned with the impact of listener ability rather than text difficulty on viewing outcome, so this modification in the replication study could further test the credibility of the original study's results. It would also make the study a more valuable addition to captioned viewing theory, as amongst researchers, proficiency level has commonly been considered a more salient feature than text difficulty (e.g., Chai & Erlam, Reference Chai and Erlam2008; Pujolà, Reference Pujolà2002).

Note that with this approach, the learners would either need to be allocated to the groups based on the results of the standardized test, or be drawn from different study levels (e.g., first year vs second year students) with their differences in proficiency confirmed by the test. Ideally here, the sample sizes of the four learner groups could be increased (in the original study one group only contained 12 learners) to enhance generalizability of the results.

A conceptual replication could also modify the qualitative aspect of the study. In the original study, qualitative insights into the learners’ processing strategies relied on the learners’ reflective written protocols. Researchers have often acknowledged how some learners find it difficult to be aware of and/or explain their cognitive processes (Winke et al., Reference Winke, Gass and Sydorenko2013). Also, the amount of listening going on amongst the learners was fairly loosely assessed in the original study (i.e., it was based on the number of learners who said they attempted to listen). Consequently, it would be valuable to employ a more objective measure to gather this data – which in turn could serve to examine the robustness of the original study's qualitative results.

One such method is paused transcription (Field, Reference Field2008; Yeldham, Reference Yeldham2017), which can add insight into whether the viewer is listening to the speakers or reading the captions – and quantify this information. With this method, the video is intermittently paused, and at each pause a black screen appears for a long enough period for the learners to transcribe the last five or six words they heard immediately prior to the pause: this is when the words are still active in working memory (Field, Reference Field2008). In preparation for the task, the researcher arranges the captions so that at approximately half of these pauses, one or more of the words on the caption purposely mismatches those that are spoken (for example, like this without realizing it [spoken]/like this and not realize it [caption]).Footnote ⁵ Therefore, what the learners write down tends to reflect whether they were listening to the speaker or reading the captions. Note that the method works most effectively when the video is a suitable difficulty level for the learners (for an explanation, see Yeldham, Reference Yeldham2018, pp. 382, 383). Consequently, the method would be ideal to use in a conceptual replication that employs different texts tailored to each learner proficiency grouping, rather than just one text for all the learners as in Taylor's study..

Paused transcription has its drawbacks, though. Some researchers may consider it too invasive on the viewing process, and it does not examine the learners’ use of the visual modality (although, this modality is somewhat peripheral to the more central issue of listening vs reading in captioned viewing). Consequently, an alternative to this paused transcription, or perhaps an adjunct to it, would be to gather quantifiable self-report data from the learners through closed-ended questionnaire questions. As used by Sydorenko (Reference Sydorenko2010), the learners could provide rating scale rankings of their perceived (1) attention to, and (2) benefits from, each of the (a) audio, (b) caption and (c) visual modalitiesFootnote ⁶ of the video. This could be followed by a question asking the learners to explain their responses. The scaffolding of this structured approach could probably help to elicit useful data from the learners.

Finally, as mentioned at the beginning of this section, the original study had some deficiencies and at some stage of the replication process, the researcher ought to consider addressing some of those that remain, which are summarized below. Addressing these aspects could improve the study's reliability – and also help facilitate any further future replication of the study.

First, the researcher would need to establish and report key aspects of the two comprehension tests used in the study, especially their reliability.Footnote ⁷ Reliability of the MC test could be established through an item-total correlation analysis of learners’ scores from either a pilot study or from the actual study. For the free recall test, it would be essential to establish and report how the protocols are marked – whether they are based on idea units, t-units, or some other unit (e.g., see Foster et al., Reference Foster, Tonkin and Wigglesworth2000). It would also be important to calculate and report the inter-rater reliability, and perhaps also intra-rater reliability, of the marking for this test.

In analyzing the data from these tests, adding the use of two-way ANOVA (rather than just a series of t-tests, as in the original study) would add rigor to the analysis. The two-way ANOVA could test for an interaction effect between the independent variables of viewing type (caption vs no-caption) and L2 experience/proficiency (less vs more), on the dependent variable of comprehension. Also, the data analysis would be strengthened by adding effect size measures.

4. Winke, P., Gass, S., & Sydorenko, T. (2013). Factors influencing the use of captions by foreign language learners: An eye-tracking study. Modern Language Journal, 97(1), 254–275

4.1 Background to the study

The background to Winke et al. (Reference Winke, Gass and Sydorenko2013) is similar to that outlined earlier in discussing the background for Taylor (Reference Taylor2005). As with many of those cross-sectional studies mentioned earlier (e.g., Caimi, Reference Caimi2006; Sydorenko, Reference Sydorenko2010), the study by Winke et al. (Reference Winke, Gass and Sydorenko2013) also examined lower-proficiency L2 learners’ captioned viewing strategies. However, unlike the other studies, which relied mainly on learner self-report methods to obtain a generalized view of the learners’ processing, Winke et al. (Reference Winke, Gass and Sydorenko2013) was more concerned with examining such processing more precisely, in real time as the learner progressed through the video. Also, while Winke et al. (Reference Winke, Gass and Sydorenko2013) examined “what learners attend to when watching foreign language videos with captions” (p. 259), the focus was chiefly on the learners’ use of captions. Consequently, the researchers used eye-tracking as their principle data gathering method (supplemented by a post-video interview), pointing out that eye-tracking provided “a more direct measure” (p. 259) than self-report. Other eye-tracking studies have been conducted with L2 viewers, but have mainly been concerned with issues such as vocabulary learning from the captions, and the effect of different caption types on learner uptake (e.g., Montero Perez et al., Reference Montero Perez, Peters and Desmet2015). Winke et al. (Reference Winke, Gass and Sydorenko2013) was published in Modern Language Journal, and with a few modifications, replicating it could examine the veracity of its main results, and also potentially provide further useful fine-grained insight into how viewers strategically integrate and/or alternate their use of listening, reading and visuals through the duration of a captioned video.

4.2 The original study

Winke et al. (Reference Winke, Gass and Sydorenko2013) examined the video processing strategies employed by learners of different FLs: Arabic, Chinese, Spanish and Russian. The main rationale for including these various FLs was that given their different levels of distance from English (including their scripts), the learners of each FL may employ different strategies to process the video.

The participants were 33 second-year US university students. They were in their fourth semester of study of their FL; the American Council on the Teaching of Foreign Languages (ACTFL) guidelines considers such learners to be low-intermediate (Arabic, Chinese) or intermediate (Russian, Spanish) level. The numbers learning each FL were Arabic (N = 7), Chinese (N = 7), Russian (N = 8) and Spanish (N = 10).

In the study, each learner viewed two 2–3-minute nature documentary clips, which had been dubbed and captioned according to the FL they were studying. One clip was of salmon migration, and the other of a bear protecting its cub; the former topic was seen as common knowledge, so was considered to have greater content familiarity for the viewers than the latter.Footnote ⁸ Four instructors had considered these clips to be at a suitable difficulty level for the participants.

The eye-tracking technology at the time of the study required the learner's head to be stabilized via a chin rest for the tracker to follow their gaze. To analyze caption-viewing time, the researchers used the amount of time spent (i.e., the sum of fixation durations) on the captioned area on the screen, analyzing data from the periods when the captions were on the screen. They employed this approach to make the results comparable with past first language (L1) eye-tracking studies that had investigated the amount of time spent by viewers reading subtitles.

The most important finding from the eye tracking data was that the viewers fixated on the captioned area 68% of the time. This addressed research question (RQ) 1, which asked how often FL learners read captions. RQ2 asked how this applied to the specific FL, and RQ3, how it varied with content familiarity. In relation to RQ2, some tendency was found for a greater reliance on captions the wider the distance of the L2 and its script from English. In particular, the Arabic learners spent more time reading the captions than the Russian and Spanish learners. Also, in relation to RQ3, the Chinese learners relied significantly less on the captions when viewing the familiar content video than the unfamiliar content one, with no such significant difference found for the other three FL groups. Overall, though, the study found a complex effect of various factors on viewers’ processing strategies and caption use, including the L2 and L2 script, content familiarity, variation in movement in the videos, and learner proficiency.

The post-video interview focused mainly on the learner's opinion of the videos, and their use of the captions, relying on insights into the learner's use of the other modalities to emerge as they discussed factors influencing their caption use. While many learners were unable to explain their behaviors, some, though, revealed: (1) that their use of modality could vary through a text; (2) that during challenging sections of the text they found reading easier than listening for deciphering meaning; and (3) that reading also helped them segment words from the stream of speech. Some learners, though, especially learners of Chinese, tended to look at the captions in the hope of being able to understand the video, but sometimes found these difficult to read.

4.3 Approach to the replication of Winke et al. (Reference Winke, Gass and Sydorenko2013)

In the research, Winke et al. meticulously outlined their methodology, potentially accommodating a close or approximate replication of the study. However, an obstacle that might deter researchers from choosing to replicate it was its involvement of four different learner groups each studying a different L2. Consequently, to make things easier for the researcher, an optional initial change for a replication study would be to reduce the number of L2s involved.

One particular result from the original study worth further investigating, through a close/approximate replication, was the tendency for learners to rely more greatly on reading the captions the wider the distance their L2 and its script was from English (found in addressing RQ2). Given the leeway to use fewer L2s than in the original study, this could be investigated using (a minimum of) two L2s, both of which would differ from those in the original study. In other words, this replication could examine two groups of L1 English speakers, with one group being learners of an L2 that shares various features with English, including its Latin script (e.g., French or German); and a second group being learners of an L2 that differs greatly from English, and uses a non-Latin script (e.g., Japanese or Thai). Outcomes similar to those in the original study could then endorse the findings of that study.

Note that in this replication, to examine the effect of these differing L2s, other factors should be maintained as closely as possible to the original study. As in the original study, one such factor is for the participants to be students in their fourth semester of studying the L2. Winke et al., however, noted there was some inconsistency among their fourth semester students’ proficiency levels, wondering at times through the study whether, aside from their L2, this variation in proficiency influenced their viewing behaviors. Consequently, a second replication could be undertaken, in which those participating were limited to a common proficiency level (perhaps around low-intermediate level), based on the results of a multilingual language test. This homogeneity amongst the participants could enhance the reliability of this study's results, further testing the credibility of the results in the original study.

A conceptual replication of the study is also strongly advised. Chiefly, this would involve modifying how the eye-tracking data was analyzed. Rather than basing it on the overall time learners fixate on the captioned area, as in the original study, a replication could focus on learners’ fixations and their duration. Winke et al. (Reference Winke, Gass and Sydorenko2013, p. 296) acknowledged their approach did not capture the “individuals’ depth or level of caption processing.” They pointed out the alternative use of fixations is used widely in eye-tracking research, and discussed various possibilities involved in this more nuanced use of the data, possibilities that are relevant to a replication study.Footnote ⁹ In particular, they pondered whether some caption fixations may not actually constitute reading. This fits in with Bisson et al.'s (Reference Bisson, Van Heuven, Conklin and Tunney2014) observation that even when viewers are looking at a caption it does not guarantee they are reading it – that often a caption may simply attract the viewer's gaze because of its sudden appearance or salience on the screen. (It could be that in these situations, learners are listening, or perhaps monitoring, or thinking about, what they have comprehended.) Indeed, if findings from the replication study showed a multitude of non-word fixations in the captioned area, this might suggest the 68% caption viewing figure of the original study (found in addressing RQ1) inflated the actual time spent reading.

Note that this fixation analysis could also capture the viewers’ use of accompanying visuals, including their shifts between the captions and visual images (d'Ydewalle & Gielen, Reference d'Ydewalle, Gielen and Rayner1992). In particular, viewers’ fixation ratesFootnote ¹⁰ on features such as objects, movements, and the characters’ interactions, gestures and expressions could indicate the semantic importance of these to the viewer (Suvorov, Reference Suvorov2015), thus feeding into the wider issue of how the modalities interact in captioned viewing.

During the eye-tracking, it could also be beneficial to utilize an online introspective technique, particularly verbal report. Winke et al. commented on the challenge of eliciting expansive and reliable data from the post-video interview. In the verbal reports, the learners could report their thoughts at intermittent brief stoppages through the text. Notably, this could potentially elicit the learners’ perceptions of their use of the audio/captions/visual modalities at various stages through the text. Eye-tracking technology has progressed beyond when Winke et al.'s participants had to watch the videos with their heads stabilized on a chin rest, and current eye trackers allow more freedom of movement, thus better accommodating such researcher-viewer interaction.

Perhaps in addition to the verbal report, or in preference to it if the verbal reporting process was deemed too intrusive on the viewing task, stimulated recall could be used (Gass & Mackey, Reference Gass and Mackey2016). Preferably, this method could be employed once the eye-tracking (and verbal report) data were analyzed – ideally soon after the data were gathered. In this stimulated recall, the learners would again watch the videoFootnote ¹¹ to stimulate a retrospective account of their thoughts during the task. They could also respond to researcher questions on areas of interest from the earlier analyzed data.

If these self-report measures of verbal report and/or stimulated recall produced somewhat similar results as the original study, it could give the original results more credibility.

5. Rodgers, M. P. H., & Webb, S. (2017). The effects of captions on EFL learners’ comprehension of English-language television programs. CALICO Journal, 34(1), 20–38

5.1 Background to the study

Only a relatively small number of longitudinal studies have been conducted to investigate the influence of captioned viewing on L2 learners’ listening over time. Vanderplank's (Reference Vanderplank1988, Reference Vanderplank2016, Reference Vanderplank2019) influential qualitative studies of upper-intermediate to advanced level learners have produced many key findings. These include that: (1) viewers tend to overcome initial inexperience or attitudinal obstacles to the use of captions (Vanderplank, Reference Vanderplank1988); (2) many viewers improve their abilities to the extent of being able to alternate their use of spoken and caption cues as needed, or to utilize the spoken, visual and caption cues in unison (Vanderplank, Reference Vanderplank1988); (3) less-proficient learners tend to use captions as a scaffold, while proficient learners use them more as a backup to their listening, when needed (Vanderplank, Reference Vanderplank2016); and (4) caption-dependence to the exclusion of listening tends to occur, but usually on particularly challenging videos (Vanderplank, Reference Vanderplank2019).

Another strand of research with a longitudinal leaning has examined the development of learners’ bottom-up listening skills – skills that are essential for effective listening (Vandergrift, Reference Vandergrift2004) – resulting from captioned viewing (Charles, Reference Charles2017; Charles & Trenkic, Reference Charles, Trenkic, Gambier, Caimi and Mariotti2015; Mitterer & McQueen, Reference Mitterer and McQueen2009). These phonological skills are specific to listening, so if they developed significantly, it suggested the viewers were listening during the video, and learning the skills through mapping these with the captioned information. In one experimental study, Mitterer and McQueen (Reference Mitterer and McQueen2009) found captioned viewing could help retune English learners’ phonological listening skills, in terms of significantly helping them to improve their comprehension of broad English accents. In another study, Charles (Reference Charles2017) found English learners’ word segmentation skills developed significantly from captioned viewing. Charles (Reference Charles2017) also found the learners’ listening comprehension significantly improved on a standardized listening test over the relatively short four-week period of the study.

Rodgers and Webb's (Reference Rodgers and Webb2017) study is one of the few experimental studies that tracked learners’ comprehension of captioned video over a lengthy period. Interestingly, findings from the study, where learners watched multiple episodes of a TV series, were divided over the value of the captioning. The study was published in CALICO Journal, and is worth replicating because, with some adjustments to its design, this can potentially produce more valid insights into the impact of such viewing.

5.2 The original study

Rodgers and Webb's (Reference Rodgers and Webb2017) experimental study examined learners’ comprehension of captioned video viewed over ten weeks. The learners watched ten episodes of an American TV comedy-drama series called Chuck. A pilot group had rated the series as very enjoyable, and the researchers considered this a series the learners were likely to watch on their own outside the classroom. (Vanderplank, Reference Vanderplank2016, has emphasized the need for learner interest in the videos, to enhance their learning potential.) The researchers investigated: (1) whether captioned viewing increased learners’ comprehension for each of the episodes, or whether this varied between episodes; and (2) whether captioned viewing increased learners’ comprehension over time (longitudinally), that is, from the first to the tenth episode.

The participants were first- and second-year students from a Japanese university, classes from which learners are considered to have English proficiency of around pre-intermediate to intermediate levels. Their proficiency was also further assessed through the vocabulary levels test.Footnote ¹² Fifty-one of the students watched the series with captions, while 321 watched it without captions as a control group.

The learners usually watched one episode per week. Each of the ten episodes was, on average, 43 minutes long, so there was around 430 minutes, or seven hours, of viewing. To assess the learners’ comprehension of each episode, they took a listening comprehension test immediately after viewing each one. Each of these tests, meticulously designed based on Buck's (Reference Buck2001) default listening construct, contained true/false, MC, and sequencing questions, with an average of 74 items per test, after being subjected to item-analysis procedures. The items were translated into Japanese (partly to eliminate the learners’ need to read the items in English, to isolate the listening construct) and the reliability of each test was established at an earlier piloting stage, which involved ascertaining “whether the test items actually measure the intended underlying trait they were designed to measure: listening comprehension ability” (p. 28). To address memory constraints in the test, each episode was divided into six “viewing sections” of roughly seven minutes each. Learners would preview the questions relevant to an upcoming section, and the episode would be stopped at the end of the section for the learners to complete the questions.

Results for the weekly comparisons showed that over the ten weeks, the caption group outscored the no-caption group each week. However, a series of one-way ANOVAs found this difference was only significant for the three most difficult episodes (the initial episode and episodes 4 and 7; these were judged most difficult because both groups scored slightly lower for them on the test); this was when the scaffolding from the captions appeared to be most useful. For the initial episode, this was when neither group was used to the characters, context or general storyline.

The researchers noted the advantage for the captioned group over the no-caption group was very small, though, compared with past studies (e.g., Huang & Eskey, Reference Huang and Eskey1999–2000; Montero Perez et al., Reference Montero Perez, Peters and Desmet2014). They suggested this may have been because the no-caption group's accumulated knowledge of the series and its characters over the successive episodesFootnote ¹³ helped bolster their comprehension of it – an advantage not afforded in the past studies, which often just used a small number of short, unrelated informational clips.

The other main concern of the study was whether captioning improved learners’ comprehension over its duration. Here, after the caption group outscored the no-caption group on the initial episode, there were fairly similar gains for both groups by the tenth episode. The researchers concluded that after this initial episode the captioning, thus, had little impact on viewers’ development. They suggested again, this was likely owing to the ability of the no-caption group to utilize accumulated knowledge of the series to compensate for the absence of captioning.

5.3 Approach to the replication of Rodgers and Webb (Reference Rodgers and Webb2017)

Rodgers and Webb (Reference Rodgers and Webb2017) made a number of suggestions for replication of their research, many of which could be worth pursuing. One was that the “study could [ … ] be replicated with learners with other L1s” (p. 35).

This could be conducted as a close replication with learners from an L1 that shares the Latin script with English. In the original study, it could be that the experimental group of Japanese-speaking participants, whose L1 script obviously differs greatly from that of English, gained only minor advantage from the captions (compared with the no-caption control group) because they had difficulties reading these captions at sufficient speed (see Winke et al., Reference Winke, Gass and Sydorenko2013). The replication study might find the learners, with their L1 use of Latin script, have fewer difficulties reading the English captions, and thus gain greater advantage from them. Such a result would suggest the findings from the original study were an artefact of the participants’ L1. (Conversely, finding no such difference between the studies could increase the original study's confirmatory power.) Note that such a close replication would be feasible, as the materials and procedures in the original study were clearly described, in great detail.Footnote ¹⁴

There are also possible conceptual replications of the study that could bear fruit in different ways. One such replication could examine the result related to the learners' longitudinal development, that beyond the first episode of the series, captioning appeared to hold little benefit for such development. The researchers emphasized throughout their study that the assessment of the learners’ comprehension of the ten episodes, and thus of their longitudinal development, was done through listening comprehension tests. However, in a replication, to assess the learners’ longitudinal development of their listening comprehension, a standardized listening test might be used. Unlike the testing in the original study, this would assess the learners’ development of their listening, detached from the video content.

Use of this test harks back to Vanderplank's (Reference Vanderplank2016) comment that one can only really gauge improvement in learners’ listening comprehension over time from captioned viewing if the viewing helps them “develop the knowledge and skills required to follow and understand a wide variety of speech in the foreign language without captions” (p. 56). Consequently, this standardized listening test would provide such an assessment, and likely a reliable one, of whether the two viewer groups improved their listening ability over the course of the study and also compare the amount of this development between the two groups.

This test would ideally be conducted: (1) prior to the first episode of the video series; (2) after about the third episode of the series (by which time the learners should be accustomed to its various enduring characteristics – according to the original study); and (3) also after the final episode of the series. As well as permitting examination of the findings from the original study, results from the test could also contribute further to the wider issue in L2 captioned viewing of whether, and to what extent, captioned viewing can develop L2 learners’ listening abilities (i.e., free of captions) over time.

Another replication could adhere to Rodgers and Webb's (Reference Rodgers and Webb2017) suggestion that: “Replication of this study utilizing a variety of television programs [ … ] may provide a more conclusive assessment of the effects of captions on language learning through viewing television” (p. 28). As suggested here, a number of episodes of the main TV series could be replaced by episodes from other TV series, maybe with each type alternated through the study. The results from this replication, both in terms of its longitudinal results and results for each of the ten post-episode tests, could thus perhaps serve to rule out, or verify, whether the close results between the caption and no-caption group in the original study were owing to the homogeneity of its episodes. In other words, the replication would test the veracity of the researchers’ view in the original study that the results were, thus, largely an artefact of the learners’ accumulated familiarity with the series and its characters – that this may have helped the original control group compensate for their non-use of captions in comprehending the videos.Footnote ¹⁵

Finally, probably at or towards the end of any chain of replications, the researcher could consider modifying the post-episode tests to enhance their validity, and thus the reliability of the results – both of that study and of any future replications of that study. In discussing Rodgers and Webb's (Reference Rodgers and Webb2017) research, Vanderplank (Reference Vanderplank2016) criticized the dual procedures in the tests of allowing question preview before each of the six viewing sections, and then frequently stopping the video after each section for a considerable time for learners to answer the questions. He argued that this priming and disruption, respectively, “does raise questions about the validity of the design and procedures, and the generalisations that might be made from the findings” (p. 121). While he acknowledged the need for the tests, Vanderplank felt the procedures that were used moved how the learners viewed the videos too far away from how they would ordinarily view them. Consequently, to modify the tests, question preview could probably be dispensed with, and for each episode the number of items could easily be halved from their average number of 74. Around 40 items could still be sufficient for reliable insight into the learners’ comprehension (and also make the construction of each test less onerous for the researcher). This could allow the number of viewing sections to be reduced from six ideally down to three, with only two stoppages in the video for the first two banks of questions, with the final questions answered post-episode. Utilizing mostly questions that test comprehension of the video's main ideas, rather than its details, could lesson memory effects for the viewers.

6. Conclusion

There is obviously still much to be learned about the relationship between captioned viewing and L2 learner listening. It is crucial to know more about this because many teachers currently employ captioned videos with the primary aim of improving their learners’ listening. It is, thus, important to know how much listening actually tends to take place (compared with reading the captions and utilizing the visual information), what influences this, and how much it actually improves the learners’ listening comprehension. Hopefully, replication of these three influential studies can increase and extend our understanding of these studies’ original findings, and in turn, enhance our knowledge of learner listening in this multi-modal environment of captioned viewing.

Michael Yeldham is a Professor in the School of Foreign Language Education at Jilin University, Changchun, China. He has been involved in TESOL for over 30 years as a teacher and researcher. His main research interest is L2 listening, on which he has published numerous articles, including a systematic review of studies examining the role of listening in L2 captioned video viewing.

Footnotes

¹ Learning the aural forms of words does help develop learners’ listening abilities, but the point here is that vocabulary learning is associated more with L2 acquisition, or “listening to learn” as opposed to “learning to listen” (Vandergrift, Reference Vandergrift2004).

² Or alternatively use a short questionnaire asking the learners to self-rate these reading abilities (Vanderplank, Reference Vanderplank2019).

³ It is important to first get permission from the copyright holders before using videos in this or any similar study.

⁴ Immediately before viewing the video, Taylor presented and pronounced for the learners a list of words from it as an aid to its comprehension. Avoid doing this, to make the viewing task more authentic.

⁵ Each pause is inserted at irregular intervals, with a minimum of ten seconds of dialogue usually separating each one. The rationale behind the irregular placements is so the learners do not anticipate when a pause will come, and the rather lengthy time between pauses is to encourage use of the learners’ regular video comprehension processes – rather than a potential focus on bottom-up processing to prepare for the pauses.

⁶ And perhaps for comparison, have the control group rate their use of the audio and visual modalities.

⁷ I acknowledge there may have been legitimate reasons for some of these shortfalls, such as space restrictions for the article, or perhaps largely just because of common practice at the time of the study.

⁸ After viewing each video, the learner answered MC comprehension tests. These were given to encourage the learner to pay attention to the videos. These tests also presumably focused the learner on trying to comprehend the videos rather than to use them mainly for some other purpose, such as to learn new vocabulary.

⁹ Longer fixations on specific words likely indicate their greater importance to the viewer (Rayner, Reference Rayner1998; Suvorov, Reference Suvorov2015) – although one might need to factor in other influences: readers tend to fixate for longer on long words, as do poor readers on words they find hard to understand (Rayner, Reference Rayner1998). For further information on fixations and eye-tracking caption research, see Montero Perez (Reference Montero Perez2022) and Montero Perez et al. (Reference Montero Perez, Peters and Desmet2015).

¹⁰ Fixation rate is the number of fixations (eye movements that stabilize over a stationary object) per second (Suvorov, Reference Suvorov2015).

¹¹ Or if they had earlier provided a verbal report while watching it, perhaps a video recording of them undertaking the original task, with the recording showing on the screen both the original video and their reactions to it and any comments they made about it during the verbal report.

¹² This study was part of Rodgers’ (Reference Rodgers2013) Ph.D. study that also examined vocabulary development from captioned viewing – so use of the vocabulary test appears to be largely associated with that.

¹³ Listening relies on top-down and bottom-up information (Vandergrift, Reference Vandergrift2004), so this accumulated knowledge may have involved various aspects of the context, storylines, characters and their relationships (top-down information), and the characters’ accents and other features of their speech (bottom-up information).

¹⁴ The control group sample size need not be as large as the massive sample of 321 in the original study, though.

¹⁵ It might also be beneficial to gather qualitative data to help explain the results of some of the replications, perhaps at the least a question after the learners’ final, week 10 session, asking them to reflect on their viewing strategies and whether/how these had changed through the study (Vanderplank, Reference Vanderplank2016).

References

Bisson, M. J., Van Heuven, W. J., Conklin, K., & Tunney, R. J. (2014). Processing of native and foreign language subtitles in films: An eye tracking study. Applied Psycholinguistics, 35(2), 399–418. doi:10.1017/S0142716412000434.CrossRef Google Scholar

Borras, I., & Lafayette, R. C. (1994). Effect of multimedia courseware subtitling on the speaking performance of college students of French. Modern Language Journal, 78(1), 61–75. doi:10.2307/329253.CrossRef Google Scholar

Buck, G. (2001). Assessing listening. Cambridge University Press.CrossRef Google Scholar

Caimi, A. (2006). Audiovisual translation and language learning: The promotion of intralingual subtitles. The Journal of Specialised Translation, 6, 85–98.Google Scholar

Chai, J., & Erlam, R. (2008). The effect and the influence of the use of video and captions on second language learning. New Zealand Studies in Applied Linguistics, 14(2), 25–44. https://search.informit.org/doi/10.3316/informit.997746379301329 Google Scholar

Charles, T. (2017). The role of captioned video in developing speech segmentation for learners of English as a second language. [Unpublished doctoral dissertation]. University of York.Google Scholar

Charles, T., & Trenkic, D. (2015). Speech segmentation in a second language: The role of bi-modal input. In Gambier, Y., Caimi, A., & Mariotti, C. (Eds.), Subtitles and language learning (pp. 173–198). Peter Lang.Google Scholar

d'Ydewalle, G., & Gielen, I. (1992). Attention allocation with overlapping sound, image, and text. In Rayner, K. (Ed.), Eye movements and visual cognition: Scene perception and reading (pp. 415–427). Springer-Verlag.CrossRef Google Scholar

Field, J. (2008). Bricks or mortar: Which parts of the input does a second language listener rely on? TESOL Quarterly, 42(3), 411–432. doi:10.1002/j.1545-7249.2008.tb00139.xCrossRef Google Scholar

Field, J. (2019). Second language listening: Current ideas, current issues. In Schwieter, J. W. & Benati, A. (Eds.), The Cambridge handbook of language learning (pp. 283–319). Cambridge University Press.CrossRef Google Scholar

Foster, P., Tonkin, A., & Wigglesworth, G. (2000). Measuring spoken language: A unit for all reasons. Applied Linguistics, 21(3), 345–375. doi:10.1093/applin/21.3.354CrossRef Google Scholar

Gass, S. M., & Mackey, A. (2016). Stimulated recall methodology in applied linguistics and L2 research (2nd ed.). Routledge.CrossRef Google Scholar

Hayati, A., & Mohmedi, F. (2011). The effect of films with and without subtitles on listening comprehension of EFL learners. British Journal of Educational Technology, 42(1), 181–192. doi:10.1111/j.1467-8535.2009.01004.xCrossRef Google Scholar

Huang, H.-C., & Eskey, D. E. (1999–2000). The effects of closed-captioned television on the listening comprehension of intermediate English as a foreign language (ESL) students. Journal of Educational Technology Systems, 28(1), 75–96. doi:10.2190/RG06-LYWB-216Y-R27CrossRef Google Scholar

Mayer, J. (2009). Multimedia learning (2nd ed.). Cambridge University Press.CrossRef Google Scholar

McManus, K. (2021). Replication and open science in applied linguistics research. In Plonsky, L. (Ed.), Open science in applied linguistics. John Benjamins. (Preprint version, accepted for publication 27 October 2021).Google Scholar

Mitterer, H., & McQueen, J. M. (2009). Foreign subtitles help but native language subtitles harm foreign speech perception. PLoS One, 4(1), e7785. doi:10.1371/journal.pone.0007785CrossRef Google Scholar PubMed

Montero Perez, M. (2022). Second or foreign language learning through watching audio-visual input and the role of on-screen text. Language Teaching, 55(2), 163–192. doi:10.1017/S0261444821000501CrossRef Google Scholar

Montero Perez, M., Peters, E., & Desmet, P. (2014). Is less more? Effectiveness and perceived usefulness of keyword and full captioned video for L2 listening comprehension. ReCALL, 26(01), 21–43. doi:10.1017/S0958344013000256CrossRef Google Scholar

Montero Perez, M., Peters, E., & Desmet, P. (2015). Enhancing vocabulary learning through captioned video: An eye-tracking study. Modern Language Journal, 99(2), 308–328. doi:10.1111/modl.12215CrossRef Google Scholar

Paivio, A. (2007). Mind and its evolution: A dual coding theoretical approach. Erlbaum.Google Scholar

Porte, G. (2013). Who needs replication? Calico Journal, 30(1), 10–15. doi:10.11139/cj.30.1.10-15CrossRef Google Scholar

Porte, G., & McManus, K. (2019). Doing replication research in applied linguistics. Routledge.Google Scholar

Pujolà, J.-T. (2002). CALLing for help: Researching language learning strategies using help facilities in a web-based multimedia program. ReCALL, 14(2), 235–262. doi:10.1017/S0958344002000423CrossRef Google Scholar

Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124(3), 372–422. doi:10.1037/0033-2909.124.3.372CrossRef Google Scholar PubMed

Rodgers, M. P. H. (2013). English language learning through viewing television: An investigation of comprehension, incidental vocabulary acquisition, lexical coverage, attitudes, and captions. [Unpublished doctoral dissertation]. Victoria University of Wellington.Google Scholar

Rodgers, M. P. H., & Webb, S. (2017). The effects of captions on EFL learners’ comprehension of English-language television programs. CALICO Journal, 34(1), 20–38. doi:10.1558/cj.29522CrossRef Google Scholar

Suvorov, R. (2015). The use of eye tracking in research on video-based second language (L2) listening assessment: A comparison of context videos and content videos. Language Testing, 32(4), 463–483. doi:10.1177/0265532214562099CrossRef Google Scholar

Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257–285. doi:10.1207/s15516709cog1202_4CrossRef Google Scholar

Sydorenko, T. (2010). Modality of input and vocabulary acquisition. Language Learning and Technology, 14(2), 50–73. http://dx.doi.org/10125/44214 Google Scholar

Taylor, G. (2005). Perceived processing strategies of students watching captioned video. Foreign Language Annals, 38(3), 422–427. doi:10.1111/j.1944-9720.2005.tb02228.xCrossRef Google Scholar

Vandergrift, L. (2004). Listening to learn or learning to listen? Annual Review of Applied Linguistics, 24, 3–25. doi:10.1017/S0267190504000017CrossRef Google Scholar

Vandergrift, L., & Cross, J. (2014). Captioned video: How much listening is really going on? Contact, 40(3), 31–33.Google Scholar

Vanderplank, R. (1988). The value of teletext subtitles in language learning. ELT Journal, 42(4), 272–281. doi:10.1093/elt/42.4.272CrossRef Google Scholar

Vanderplank, R. (2016). Captioned media in foreign language learning and teaching. Palgrave Macmillan.CrossRef Google Scholar

Vanderplank, R. (2019). ‘Gist watching can only take you so far’: Attitudes, strategies and changes in behaviour in watching films with captions. The Language Learning Journal, 47(4), 407–423. doi:10.1080/09571736.2019.1610033CrossRef Google Scholar

Winke, P., Gass, S., & Sydorenko, T. (2013). Factors influencing the use of captions by foreign language learners: An eye-tracking study. Modern Language Journal, 97(1), 254–275. doi:10.1111/j.1540-4781.2013.01432.xCrossRef Google Scholar

Yeldham, M. (2017). Techniques for researching L2 listeners. System, 66, 13–26. doi:10.1016/j.system.2017.03.001CrossRef Google Scholar

Yeldham, M. (2018). Viewing L2 captioned videos: What's in it for the listener? Computer Assisted Language Learning, 31(4), 367–389. doi:10.1080/09588221.2017.1406956CrossRef Google Scholar

Article contents

How do second language learners go about their listening when they view captioned videos? Replication studies of Taylor (2005), Winke et al. (2013) and Rodgers and Webb (2017)

Abstract

1. Introduction

2. The three studies proposed for replication

3. Taylor, G. (2005). Perceived processing strategies of students watching captioned video. Foreign Language Annals, 38(3), 422–427

3.1 Background to the study

3.2 The original study

3.3 Approach to the replication of Taylor (Reference Taylor2005)

4. Winke, P., Gass, S., & Sydorenko, T. (2013). Factors influencing the use of captions by foreign language learners: An eye-tracking study. Modern Language Journal, 97(1), 254–275

4.1 Background to the study

4.2 The original study

4.3 Approach to the replication of Winke et al. (Reference Winke, Gass and Sydorenko2013)

5. Rodgers, M. P. H., & Webb, S. (2017). The effects of captions on EFL learners’ comprehension of English-language television programs. CALICO Journal, 34(1), 20–38

5.1 Background to the study

5.2 The original study

5.3 Approach to the replication of Rodgers and Webb (Reference Rodgers and Webb2017)

6. Conclusion

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests