Volume 4 (2) ~ November 2012

ISSN # 2150-5772 – This article is the intellectual property of the authors and CIT. If you wish to use this article in your teaching or in another format, please credit the authors and the CIT International Journal of Interpreter Education.

A Study of Interpreting Accreditation Testing Formats in Australia

Sedat Mulayim
RMIT University, Melbourne, Australia
Correspondence to: sedat.mulayim@rmit.edu.au

Download PDF (213 KB)


Testing modes used in assessing actual interpreting skills vary around the globe from live practical demonstrations before a panel of assessors to recorded audio sessions subsequently assessed by examiners. Whether testing modes have an impact on assessment outcome has been a point of debate among assessors and assessees. In Australia, three testing modes are commonly used: audio, video, and live-simulated. The National Accreditation Authority for Translators and Interpreters (NAATI), the authority that oversees interpreter accreditation in Australia, uses an audio mode in interpreter accreditation tests. In the paraprofessional interpreter accreditation test, the candidate interprets two audio-recorded dialogues of approximately 300 words each. The test also includes sections on social/cultural understanding and ethics of the profession. The test is administered by a supervisor who records the candidate’s performance; the recorded test then is forwarded to two examiners for marking. NAATI-approved training providers, on the other hand, have traditionally used a live-simulated mode in which two role players read two dialogues and candidates individually interpret the dialogues in the presence of two examiners. Recently, some training providers, including RMIT University in Melbourne, Australia, have introduced a video mode in accreditation tests as well as in training. In this mode, a candidate interprets two video-recorded dialogues set as per NAATI accreditation test standards. The candidate’s performance is video-recorded and then forwarded to examiners. Concerns were expressed about whether the video and audio formats would assess a candidate’s actual interpreting skills accurately. The view was that interpreting tests administered using audio or video modes limit the opportunity for interaction between interpreters and speakers. An opposing view was that audio testing was a better method because core interpreting skills can be tested and, because candidates will not feel as pressured or anxious, cognitive processing capacity will be reserved for concentrating on content, thus more capacity will remain available to undertake transfer between two languages.

Testing Modes and Impact on Assessment

There is some literature from the interpreting discipline investigating the impact on quality of service that the three testing modes—face-to-face, telephone, and video—may have, in mainly medical, legal, and conferencing interpreting settings (Locatis et al., 2010; Moser-Mercer, 2005; Swaney, 1997). Overall, the studies do not indicate significant differences in terms of key quality indicators such as accuracy or unjustified additions or omissions. However, the opinions from users and interpreters seem to vary between those who prefer interpreter participation via video or telephone for reasons such as maintaining privacy or professional distance and others who prefer video and telephone interpreting because of reduced interference by the interpreter in the professional setting, such as patient-doctor consultation where the interpreter is not physically present. Some professionals, however, prefer a face-to-face interpreter as they believe that close interaction between the client and the interpreter is essential for successful communication.
In other disciplines, there appears to be significant interest in similar testing/assessment modes that are of relevance to interpreting skills assessment. Straus, Miles, and Levesque (2001) studied the effects of video-conference, telephone, and face-to-face media on interviewer and applicant judgments in employment interviews. The three types of communication media are similar to the modes used in interpreter accreditation tests. In mock interviews, the researchers asked the six interviewers to rate 60 applicants’ general abilities, likability, communication understanding, and conversation fluency. The researchers also measured applicants’ self-consciousnessand the degree to which they felt at ease during the interviews. Results show that interviewers evaluated applicants more favorably in telephone versus face-to face interviews. The difference was stronger for less physically attractive applicants, which suggests that the telephone filtered negative visual cues. The researchers also provided an alternative explanation for this finding, which is that communicating by telephone imposes lower cognitive workload, and consequently applicants may have been better able to focus on the content of their responses. Both interviewers and applicants reported more difficulty regulating and understanding conversations in video-conference interviews versus face-to-face settings. Applicants reported being much more self-conscious about their nonverbal behavior in face-to-face versus telephone interviews.
Riddle et al. (2002) studied the differences in audio-recorded versus video-recorded doctor-patient interactions. The researchers recruited 47 patients, 12 doctors, and eight raters. The raters were asked to rate the doctor-patient interactions using various audio and video observational systems commonly used in assessing doctor-patient interactions. The findings of this study indicate that ratings of audio-recorded doctor-patient interactions are not equivalent to ratings of video-recorded encounters, even though raters were using the same coding system and analyzing the same doctor-patient exchanges, and that analyses based on audio-only data may not be sufficiently sensitive to raters’ interpretations of behaviors, especially when judgments need to be made to address incongruence between verbal and nonverbal cues.
In another study, Ryan and Costa-Giomi (2004) asked raters from three groups to rate 10 piano performances from audiotapes (sound only) and from videotapes (sound and image). Group 1 comprised 18 boys and 14 girls age 12; Group 2 comprised non-music undergraduate students who were taking an introductory piano class; and Group 3 comprised undergraduate music students. The participants were also asked to rate the attractiveness of the performers from brief videos. Results from this study support the existence of an attractiveness bias in the evaluation of musical performance. The results also show that evaluations of audio-visual recordings of musical performances are judged more reliably than are audio recordings. The researchers argue that care should be taken when using the more reliable means of evaluation, such as videotapes or DVDs, at the risk of favoring a particular group of performers.
The above studies suggest that the modes in question may impact on performance and consequently may affect the assessment result to varying degrees. This article discusses whether the findings of these studies apply to interpreting skills testing. It provides some empirical data that may inform the debate on testing modes and offers some guidance, albeit limited, for training providers. The data collection mainly focused on whether the testing mode impacts on student achievement in an accreditation test and, if so, whether there were any identifiable factors within that mode that contributed to this, such as test anxiety or advantages in visual cues.

The Use of the Three Dialogue Interpreting Assessment Modes in Australia

In Australia, NAATI sets standards for various levels of interpreting accreditation and conducts interpreter accreditation tests. A number of approved interpreter training providers conduct their own interpreter accreditation tests, the results of which are recognized by NAATI. The tests are predominantly given in one of three modes: audio, video, and live.
The accreditation test at the paraprofessional level, which is the focus of this discussion, involves two bilingual dialogues of approximately 300 words each. The maximum word count in each segment cannot exceed 35 words. The test also has social/cultural understanding and ethics-of-the-profession sections where candidates are expected to answer four questions in each section (two questions in each language). However, approved training programs may only administer the dialogue-interpreting section of the accreditation test, as social/cultural understanding and ethics content is covered in other subjects and assessed separately. At present, NAATI (2008) requires students studying with approved training providers to achieve a minimum result of 70% (or competent with distinction) out of a nominal mark of 100 in the dialogue-interpreting test held under NAATI testing standards, and a minimum result of a pass (or competent) in the other subjects.

Audio Mode

In this testing mode, mainly used by NAATI, candidates undertake testing in a soundproof recording room. The assessment supervisor operates the recording equipment. In the paraprofessional interpreter accreditation test, two scripted dialogues are played to the candidate from a master recording. A briefing in English of the dialogue scenario is played before each of the dialogues (the briefings do not have to be interpreted). The candidate is prompted to start interpreting after the briefing. The candidate can signal the assessment supervisor if s/he needs to have a segment repeated; only a limited number of repetitions are permitted without penalty. The assessment supervisor will not answer questions regarding the content of the dialogue. The candidate’s interpreting performance is audio-recorded for later marking.

Video Mode

In this testing mode, candidates attempt two video-recorded scripted dialogues. A briefing in English for each scenario is played; these do not have to be interpreted. The candidate is required to interpret the dialogues segment by segment. There is a pause after each segment to allow for interpreting. The candidate can signal the assessment supervisor if s/he needs to have a segment repeated, but the assessment supervisor will not answer questions regarding the content of the dialogues. The candidate can ask for a limited number of repetitions without being penalized. The candidate’s interpreting performance is video-recorded for later marking.

Live-Simulated Mode

A test in this mode is conducted in a classroom. During the test, two role players read out two scripted dialogues. One of the role players is a native English speaker; the other is a native speaker of the candidate’s other language. A briefing in English for the scenario is read out (the briefing does not have to be interpreted). The candidate is required to interpret the dialogues segment by segment. The role players pause at the end of each segment to allow for the candidate to interpret. Two examiners observe the performance of the candidate and assess in the room.

Data Collection


For the purposes of generating data, students from the Chinese (Mandarin) language group who had completed two thirds of a 6-month full-time program were invited to participate in the study. The selection of Mandarin language speakers was purely due to resourcing considerations such as availability of markers and recording booths. Five student interpreters from the Mandarin language group volunteered to take part. The Mandarin language group is part of the Diploma of Interpreting, which is a 6-month full-time vocational education and training program. Student selection for the program is through a bilingual intake test in which applicants are expected to demonstrate a sufficient level of bilingual skills to be admitted to the program. The Diploma of Interpreting program is an approved program by NAATI, and the final interpreting test, which is designed according to NAATI accreditation test standards, is also assessed against NAATI accreditation standards. Those students who achieve a minimum of 70 out of 100 in both dialogues are recommended to NAATI for accreditation at the paraprofessional level.
Two scripted dialogues for each mode were designed and produced according to the NAATI paraprofessional accreditation test standards in terms of complexity and word-count ranges for each segment and for the whole dialogue. The topics of the dialogues were from common community interpreting topics such as welfare, education, and community health and included broad and routine language as stipulated in NAATI accreditation test standards. Each student was asked to interpret a set of two dialogues under each mode (a total of six different dialogues per student): a live-simulated setting with two role players and two examiners present, an audio setting where a test administrator was present, and a video-recorded setting where a test administrator was present.
Two examiners were asked to mark the students’ performances in each mode. The examiners did not have any direct contact with the students prior to the study. The students’ performances in all three modes were video-recorded.


Analysis of variance (ANOVA) was used to process the data collected (see Appendix A). SPSS for Windows Version 16.0 was used for statistical analyses. The variables of marker and student were added to the ANOVA as it was perceived that they may impact on the result. Student and marker were treated as random factors.
Table 1. Mean scores and standard deviations for test mode

Test mode Mean (out of 100) Standard deviation
Audio 68.10 6.73
Video 65.80 8.34
Live 66.82 7.79

Table 2. Results of ANOVAs

Source F  test(*) p value (**)
Test Mode 0.904 .568
Marker 7.147 .184
Student 12.381 .057
Test Mode × Marker 0.901 .444
Test Mode × Student 1.025 .487
Marker × Student 1.138 .404
Test Mode × Marker × Student 1.502 .198

*An F test is used to compare the means of two or more groups involved in the study. It can be used with any sample size higher than two.
**If the p value is   p < .05 or  p < .01, the result is regarded as statistically significant. That is, there is significant difference between the means. If the result is p > .05, there is not significant difference between the means.
Variances in the three testing modes, between markers, and among students were analyzed, as well as variances in marking consistency under different test modes. In addition, variances were assessed in marks each student received under different testing modes, in marks each marker gave to students, and in marks under different test modes among markers.
Overall, analyses of student performance under the different testing modes showed p > .05, which means there were not any differences that were statistically significant.  Transfer skills—that is, comprehension in one language and expression in another—did not appear to be significantly impacted by the mode of delivery. The results achieved by the students ranged from 53 to 80, mostly within the 60%–69% (credit) range, which is quite acceptable at this stage of their studies but less than the minimum of 70% (distinction) required for NAATI accreditation. Only Student 5 achieved a clear NAATI accreditation result in the audio and live-simulated modes.

Observations and Data Collected Through Questionnaires and Interviews

The students were also asked to complete a brief questionnaire and attend a short, semistructured interview to collect data about their experiences under each testing mode (see Appendix B). Most participants believed that the live-simulated setting was more “real,” with real interactions with speakers. They also believed that physical appearance does affect the assessor’s judgement and said they would choose more formal attire when attending tests using live-simulated mode, because “that will give examiners a good impression.” Another advantage they stressed was that they could ask questions when they encountered unfamiliar words and expressions. Most of the participants asked the role players to explain unfamiliar words or concepts at least once. However, most of them reported nervousness in the live-simulated setting and anxiety about how their body language and nervousness may have given a bad impression to the examiners. One of the participants reported being distracted by the role players’ facial expressions after each of her renditions: “The reaction of the role players distracted me. I could not help thinking about the role player’s expressions—did I omit something? Was my rendition right or wrong?”
Most of the participants claimed they felt less anxious when sitting in the video and audio interpreting booth and said they could concentrate more on the interpreting task; however, they thought the setting was less “real.” The students all noted the fact of not being able to ask questions as a downside of these two modes. Most participants reported they did not find much difference between the audio mode and the video mode in terms of having visuals in video mode as an additional feature, as they were busy taking notes most of the time.
The markers were also interviewed. Both markers believed that the live-simulated setting provided a more complete view of candidate performance and said that they did not believe the candidate’s appearance, or lack of it, had an impact on their assessment of a performance.

Discussion of Findings

Test Anxiety

Test anxiety, a factor that clearly affects test performance, has attracted interest among psychologists and researchers from different disciplines. According to Moshe Zeidner (1998), “it has been found to interfere with performance both in laboratory settings as well as in real life testing situations in schools or colleges. The higher the reported test anxiety scores, the greater the problems reported in the processing of information” (p. 215). Test anxiety is multidimensional, affecting people in different ways, and different test settings generate test anxiety differently. Researchers in other disciplines such as management have examined test anxiety in three communication media—videoconferencing, telephone, and face-to-face—similar to the three interpreter testing modes.
Research on test anxiety has emerged in the field of language testing, an area closely related to interpreting and translation studies. Bachman (1990) stated that “test performance is affected by the characteristics of the methods used to elicit test performance. . . . Some test takers may perform better in the context of an oral interview than they would sitting in a language lab speaking into a microphone. . . . Performance on language tests varies as function both of an individual’s language ability and of the characteristics of the test method” (pp. 111–113).
Zeidner (1998) provided a list of information processing deficits in high-test-anxious individuals, for example,

Research suggests that the presence of an external observer may negatively impact upon examinees’ anxiety in an evaluative situation (Zeidner, 1998). “The presence of an external observer or audience in the test situation may be particularly debilitating for high-test-anxious subjects, who may be more responsive to the potential evaluation of others and react to such evaluation with increased levels of anxiety” (p. 228). This is relevant to live-simulated test settings where role players are, in effect, observers during the assessment. As mentioned earlier, most participants in our study reported nervousness in live-simulated test settings.

Trading a More “Realistic” Interaction for a Less Anxious Mind

Although research suggests that in-person encounters are more highly rated by interpreting users and interpreters, (Locatis et al., 2010), no significant difference in exam results between audio/video settings and the live-simulated setting has been noted. This may be due to negative impacts on performance caused by the comparatively high level of anxiety in the live-simulated setting offsetting benefits associated with this mode (i.e., being “more real,” allowing real interactions with speakers, and “allowing candidates to ask questions”). This explanation is supported by the study of the effects of video-conference, telephone, and face-to-face media on interviewer and applicant judgments in employment interviews mentioned earlier, in which the researchers provided an alternative explanation for their finding. According to Straus and colleagues (2001, p. 374), “Communicating by telephone imposes lower cognitive workload, and consequently, applicants may have been better able to focus on the content of their responses.”

Fairness-Related Issues in the Three Testing Modes

Fairness in testing refers to testing that is free of bias, equitable treatment of all examinees in the testing process, and fairness and equality in the outcomes of testing (AERA, APA, & NCME,1999). In this study, it was noted that some issues with testing can potentially negatively affect the fairness of the tests, which in turn may affect the accreditation testing system as a whole. It is acknowledged that “absolute fairness to every examinee is impossible to attain, if for no other reason than the fact that tests have imperfect reliability and that validity in any particular context is a matter of degree” (AERA, APA, & NCME, 1999, p. 73). However, it is worthwhile to bring these issues to the attention of test developers and researchers.


Bias on the part of the assessors or raters of a particular professional performance can impact on assessment outcome. As Richard Stiggins, an expert in performance assessment, has stressed, it is critical that the scoring procedures are designed to assure that “performance ratings reflect the examinee’s true capabilities and are not a function of the perceptions and biases of the persons evaluating the performance” (as cited in Linn, Baker, & Dunbar, 1991, p. 18).

Appearance Bias

Appearance bias is particularly relevant to the study of the three testing modes. Although both markers in our study responded that appearance was not a consideration in marking, studies conducted in other disciplines demonstrate the existence of an attractiveness bias in the evaluation of professional performance. Therefore, it can be argued that, compared with candidates who take tests administered in an audio mode, candidates who take tests administered in video or live-simulated settings are more likely to be subject to appearance bias. Different degrees of risk in relation to biases resulting from differences in testing modes may translate into differences in scores. This is supported by findings in the study of doctor-patient interactions, in which ratings of audio-taped doctor-patient interactions were not equivalent to ratings of videotaped encounters, even though the raters used the same coding system and analyzed the same doctor-patient exchanges.

Cost and Efficiency

Interpreting accreditation tests are labor intensive and costly. As Clifford (2001) stated, an assessment may be valid, reliable, and equitable, but high cost and unreasonably elaborate procedures may prevent its use. The live-simulated mode is the most costly and least efficient of the three testing modes.


For each language group, the live-simulated mode involves at least two role players reading scripted dialogues to each candidate. The examiners and role players are required to sit through the entire assessment process. Compared with the live-simulated mode, video and audio modes cost much less. The passages are prerecorded; no role players are required during the assessment process; and just one exam supervisor is needed to operate the recording equipment. Due to higher requirements for technical sophistication and resources, the costs of producing prerecorded video dialogues and video mode renditions are higher than administering tests in the audio mode.

In a live-simulated test setting, both of the markers have to be present and watch every rendition at the same time. In audio and video test settings, in contrast, a number of recordings can be done at the same time. Markers do not have to be in the same room at the same time and sit through the entire interpreting process; they can mark the renditions at a later time.


This study has a number of limitations. First, the scale of the study is small, with a small number of candidates, markers, dialogues, topics, and only one spoken language group, so the results must be interpreted with caution. Students who volunteered for the study had completed two thirds of a 6-month full-time program, accredited at paraprofessional level. This limits the representativeness of the findings. Additionally, the students involved knew the purpose of the study and that this was not a real accreditation test. If students had believed they were taking a real test for accreditation, they may have reacted differently (e.g., become more nervous).


In this article, I have reported on a comparison of three interpreter accreditation assessment modes currently in use in Australia. The results from this study did not reveal significant statistical differences in terms of marks achieved by students under the three testing modes and does not show that the testing mode has a significant impact on the result. This study did not show any evidence that testing mode may have a significant impact on the performance of a candidate/student during the actual transfer process. By exploring various discussions and research conducted in other disciplines on how different settings (i.e., face-to-face, audio, and audio-visual) may affect performance and rating, this study indicates a need to further examine different aspects of the three common interpreter test modes as part of a more in-depth study of interpreter testing design.


American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME).  (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford University Press.
Clifford, A. (2001). Discourse theory and performance-based assessment: Two tools for professional interpreting.     Meta, 46(2), 365–378.
Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment:
Expectations and validation criteria. Educational Researcher, 20(8), 15–21
Locatis, C., Williamson, D., Gould-Kabler, C., Zone-Smith, L., Detzler, I., Roberson, J., . . . Ackerman, M. (2010). Comparing in-person, video, and telephonic medical interpretation. Journal of General Internal Medicine, 25(4),345–350.
Moser-Mercer, B. (2005). Remote interpreting: The crucial role of presence. Bulletin VALS-ASLA, 81.
National Accreditation Authority for Translators and Interpreters (NAATI). (2008). NAATI examiners’ manual. Canberra, Australia: Author.
Riddle, D. L., Albrecht, T. L., Coovert, M. D., Penner, L. A.,  Ruckdeschel, J., Blanchard, C., . . . Urbizu, D. (2002). Differences in audio-taped versus videotaped physician-patient interactions. Journal of Nonverbal Behavior, 26(4), 219–239.
Ryan, C., & Costa-Giomi, E. (2004). Attractiveness bias in the evaluation of young pianists’ performances. Journal of Research in Music Education, 52(2), 141–154.
Straus, S. G., Miles, J. A., & Levesque, L. L. (2001). The effects of videoconference, telephone, and face-to-face media on interviewer and applicant judgments in employment interviews. Journal of Management, 27(3), 363–381.
Swaney, L. 1997. Thoughts on live vs. telephone and video interpretation. NAJIT Proteus, 6(2).
Zeidner, M. (1998). Test anxiety: The state of the art. New York, NY: Plenum Press.

Appendix A: Marks Achieved in Each Testing Mode

Note that each examiner marked two dialogues for each candidate.

Audio Mode

Student Marks (out of 100 as per NAATI accreditation test marking guidelines)
Examiner A Examiner B
1 74 64
70 70
2 68 74
65 66
3 65 64
64 74
4 62 59
62 54
5 79 74
79 75


Video Mode

Student Marks (out of 100 as per NAATI accreditation test marking guidelines)
Examiner A Examiner B


Live-Simulated Mode


Marks (out of 100 as per NAATI accreditation test marking guidelines)
Examiner A Examiner B


64 60
64 65


70 68
70 72


73 57
75 65


57 55
55 53


80 72
80 76

Appendix B: Post-Recording Questionnaire

Please answer the following questions. You may provide your answers in either Chinese or in English. Please use a separate sheet if you need more space to write.
1. Of the three assessment modes, which mode do you believe allowed you to demonstrate your ability the most?
a) Audio mode
b) Video mode
c) Live-simulated mode
2. Please list in your opinion both the positive and negative sides of the three modes.

3. Do you think physical appearance affects your judgement of a student’s performance?
a) Yes
b) No
4. Do you think dress code is important for interpreting evaluation session? Has your teacher ever talked to you about dress code?
a) No
b) Yes
5. Do you think your body language during the assessment would affect the   assessors’ judgement of your performance?
a) No
b) Yes