馬場 正太郎
日本言語テスト学会誌 (ISSN:21895341)
vol.22, pp.44-64, 2019 (Released:2020-02-07)

The purpose of this study is to propose a way of generating beneficial washback effect by using high-stakes testing from a perspective of educational psychology. As can be seen in recent educational climate in Japan, it is required that high school students make the best use of language tests, such as EIKEN, GTEC, TEAP, TOEFL, IELTS, Cambridge English Exam, in order to improve four-skill English proficiency. Although this educational reform has been criticized due to its aggressive performance, there has been little discussion on how to induce its beneficial washback effect while reducing negative one. Therefore, it is necessary not only to argue the flaws of the reform, but also to seek the practical solution to the reform. In this paper, the recent standards-based educational reform efforts in Japan will be reviewed briefly. Next, the concept of washback and validity will be introduced to argue that washback effect should be considered as a consequential aspect of validity. Then, an effective way to induce beneficial washback effect will be discussed based on the previous studies on educational psychology. Specifically, by introducing the research on learners’ beliefs about tests, this study illustrates what kind of beliefs will lead to generating beneficial washback effect. Lastly, the practical implication of this study and the need for future research will be discussed.
Takeshi KATO
Japan Language Testing Association
日本言語テスト学会誌 (ISSN:21895341)
vol.22, pp.23-43, 2019 (Released:2020-02-07)

Over the last four decades, the constructs of complexity, accuracy, and fluency have been in focus in the analysis of language learners’ performance. However, due to the polysemous nature of complexity, more and more sub-constructs have been assumed, making holistic measurement difficult. This study aims to construct a more appropriate measurement model of L2 complexity by implementing finer-grained and relatively novel linguistic indices for capturing subordinate constructs that could not be measured by conventional indices. By utilizing five natural language processing tools, conventional and fine-grained indices of complexity were computed from 503 argumentative essays written by Japanese English learners. First, exploratory factor analysis was performed on linguistic index values and the extracted factor structures behind them. Second, confirmatory factor analysis was conducted to confirm whether the structure fits the data. Finally, a structural equation model of complexity constructs to predict essay scores was tested to evaluate its applicability to writing evaluation. The result of a series of factor analyses showed that the extracted factor structures reasonably fitted to the data for syntactic complexity (CFI = .901 and RMSEA = .071) and for lexical complexity (CFI = .978 and RMSEA = .051). Furthermore, the result of Structural Equation Modeling (SEM) analysis, which was proposed as a predictive model, accounted for 32.3 % of the variance of essay scores (CFI = .916 and RMSEA = .077). Overall, the findings showed the effectiveness of the proposed approach, which combined conventional linguistic features with fine-grained and relatively novel indices.
vol.15, pp.1-23, 2012

A test specification (spec) is a generative document from which equivalent language test items or tasks can be produced. There are many formats for specs, but all share two elements: sample(s) of the items/tasks and guiding language that describes the sample(s). Through consensus-building and feedback, specs evolve and stabilize. A complete illustrative test spec is presented, based on workshops held in Japan in late 2011. A new problem in spec-driven test development is posed in this paper: releasability, which refers to whether a spec should be shared outside of the test development team, and if so, when and in what form. The illustrative spec is again used to explore releasability. A number of theoretical questions are posed about spec release, and future research about spec release is encouraged.
David ALLEN Tatsuro TAHARA
Japan Language Testing Association
日本言語テスト学会誌 (ISSN:21895341)
vol.24, pp.3-22, 2021 (Released:2022-05-25)

Washback research in language education aims to demonstrate, explain, and ultimately predict, the impact of tests on teaching and learning in educational contexts. A recent review in the international arena (Cheng et al., 2015) has revealed a rapidly growing field of empirical washback research, yet only two studies were identified as occurring in the Japanese context. The present article therefore sought to more fully document the washback research conducted in Japan prior to 2021 with the aim of facilitating future research in this important area. Following an extensive online search, 32 empirical washback studies in the Japanese context were identified. These studies were analyzed in terms of the following information: publication details, test (s) involved, context and participants, methodology, aspects of washback investigated, and type of consequence targeted. The review reveals a wealth of empirical literature that has adopted a variety of research methods and designs to investigate the impact of a variety of tests, notably that of university entrance exams. On the basis of these previous studies, a series of recommendations are made for future washback research in Japan.
Japan Language Testing Association
日本言語テスト学会誌 (ISSN:21895341)
vol.22, pp.3-22, 2019 (Released:2020-02-07)

This study aims to reveal whether any specific type of vocabulary learning strategy (VLS) leads to higher scores on semi-contextualized word meaning tests—a multiple-choice gap-filling format in which short written contexts are provided. A total of 132 first-year university students learning English as a foreign language completed a VLS questionnaire and a semi-contextualized word meaning test. The relationship between these two variables was examined using Pearson’s correlation analysis, confirmatory factor analysis, and exploratory factor analysis. The results demonstrated that the relationships between VLS use and test scores were very weak (less than rs = .20), regardless of the strategy type. The smaller correlations compared to those reported in previous studies using vocabulary size tests may be caused by the more complicated constructs involved in the semi-contextualized word meaning test, which requires not only receptive knowledge about word meanings, but also reading comprehension skills and knowledge about word forms and usage in a sentence. However, imagery strategies, such as creating a mental image of word forms, had a very weak but significant positive correlation with the test scores. Based on these results, this study further discusses how Japanese high school students who will take examinations that employ the semi-contextualized word meaning test format should learn vocabulary.
LEE WonKey
vol.15, pp.25-42, 2012

Korea's experiment of using an internet-based speaking test of English for university admission qualifications is unprecedented and hence still controversial. In this paper, how to test and rate test-takers' speaking performance of English by internet is briefly discussed, and the possibilities and problems of using this speaking test for university admission qualifications are discussed.
Japan Language Testing Association
日本言語テスト学会誌 (ISSN:21895341)
vol.22, pp.65-88, 2019 (Released:2020-02-07)
1 1

Integrated writing tasks are becoming popular in the field of language testing, but it remains unclear how teachers assess integrated writing tasks holistically and/or analytically and which is more effective. This exploratory study aims to investigate teacher-raters’ holistic and analytic ratings for reliability and validity and to reveal their perceptions of grading the integrated writing task on the Test of English as a Foreign Language Internet-based Test (TOEFL iBT). Thirty-six university students completed a reading-listening-writing task. Seven raters scored the 36 compositions using both a holistic and an analytic scale, and completed a questionnaire about their perceptions of the scales. Results indicated that the holistic and analytic scales exhibited high inter-rater reliability and there were high correlations between the two rating methods. In analytic scoring, which contained four dimensions, namely, content, organization, language use, and verbatim source use, the dimensions of content and organization were highly correlated to the overall analytic score (i.e., the mean score of the four dimensions). However, the dimension of verbatim source use was found to be peculiar in terms of construct validity for the analytic scale. The analyses also indicated various challenges the raters faced while scoring. Their perceptions varied particularly regarding verbatim source use: Some raters tended to emphasize the intricate process of textual borrowing while others stressed the difficulty in judging multiple types and degrees of textual borrowing. Pedagogical implications for the selection and use of rubrics as well as the teaching and assessment of source text use are suggested.
日本言語テスト学会誌 (ISSN:21895341)
vol.19, no.2, 2016

第1章:JLTAの歴史的経緯1 大友賢二2 ランディ・スラッシャー7第2章:JLTAとテスト研究10 ジェームズ・ディーン・ブラウン11 クォン・オリャン12 アンソニー・グリーン13 ジョン・リード15 フレッド・デヴィッドソン16 デイヴィッド・ベグラー18 バリー・オサリバン19 ウォンキー・リー21 浪田克之介21 中村優治22 木下正義24 池田央26 羽鳥博愛27 田中正道28 野呂忠司30 柳瀬陽介32第3章:これからのテスト研究333.1 適切なテスト使用のために必要な事項34 3.1.1 妥当性と信頼性35 3.1.2 波及効果と影響41 3.1.3 公平性,倫理規範と標準化453.2 適切なテスト作成・使用の原則49 3.2.1 テストの使用目的と構成概念50 3.2.2 テスト細目56 3.2.3 受容技能のテスト形式61 3.2.4 産出技能のテスト形式65 3.2.5 項目作成とタスクデザイン69 3.2.6 評価尺度の開発73 3.2.7 評価者による評価(採点)と評価者訓練77 3.2.8 テストの標準化と等化81 3.2.9 妥当性検証90 3.2.10 利害関係者への結果のフィードバック943.3 言語知識と技能の評価99 3.3.1 リスニングの評価100 3.3.2 リーディングの評価104 3.3.3 スピーキングのモノローグの評価108 3.3.4 スピーキングの対話の評価112 3.3.5 技能統合的スピーキングの評価116 3.3.6 独立的ライティングの評価122 3.3.7 技能統合的ライティングの評価128 3.3.8 語彙の評価132 3.3.9 文法の評価138 3.3.10 綴り(スペリング)の評価142 3.3.11 発音の評価146 3.3.12 第二言語としての日本語の評価1503.4 評価の新しい方向性155 3.4.1 Can-Do評価156 3.4.2 ヨーロッパ言語共通参照枠と評価の関連づけ160 3.4.3 子どもの学習者の言語力の評価165 3.4.4 英語教師の評価168 3.4.5 教室における評価173 3.4.6 特定の目的のための評価178 3.4.7 コンピュータ適応型テスティング〔理論編〕182 3.4.8 コンピュータ適応型テスティング〔実践編〕1863.5 言語テスト研究の理論と方法191 3.5.1 古典的テスト理論192 3.5.2 一般化可能性理論196 3.5.3 二値項目のラッシュ分析201 3.5.4 多相ラッシュ分析207 3.5.5 項目応答理論211 3.5.6 潜在ランク理論217 3.5.7 認知的診断モデリング223 3.5.8 差異項目機能228 3.5.9 確認的因子分析232 3.5.10 マルチレベル分析236 3.5.11 メタ分析240 3.5.12 質的方法244
三上 明洋
日本言語テスト学会誌 (ISSN:21895341)
vol.21, pp.82-101, 2018 (Released:2018-12-24)

The aims of this study are to evaluate the content validity of a reflection tool for EFL teachers’ professional development in Japan, called Self-Evaluation Checklist for EFL Teachers (SECEFLT), and to provide validity evidence for interpreting and using SECEFLT scores through Kane’s (2006) argument-based approach. SECEFLT was originally developed by Mikami (2015) to promote EFL teachers’ reflection on their professional competencies. It was revised by Mikami (2018) through the validation process of construct validity, using both exploratory and confirmatory factor analyses. To gather further validity evidence related to content aspect for the revised SECEFLT, a survey was conducted with a panel of experts including six English language teachers (all English language education majors) at teacher education departments in national universities in Japan. The experts were asked to evaluate the extent to which each item in the revised SECEFLT was relevant to the content domain it aimed to measure, as well as the overall extent of relevance of the revised SECEFLT to the content domain it aimed to measure. The results showed that each individual item in the scale was appropriate in content validity and the whole scale was also appropriate judging from individual item evaluations. It was confirmed that experts judged the revised SECEFLT as content-valid when asked directly whether it was appropriate overall. Based on the study results, interpretive arguments are discussed using Kane’s (2006) framework for indicators of theoretical constructs.
Japan Language Testing Association
日本言語テスト学会誌 (ISSN:21895341)
vol.21, pp.3-20, 2018 (Released:2018-12-24)

Constructed-response tasks have captured the attention of testers and educators for some time (e.g., Cureton, 1951), because they present goal-oriented, contextualized challenges that prompt examinees to deploy cognitive skills and domain-related knowledge in authentic performances. Such performances present a distinct advantage when teaching, learning, and assessment focus on what learners can do rather than merely emphasizing what they know (Wiggins, 1998). Over the past several decades, communicative performance tasks have come to play a crucial role in language assessments on a variety of levels, from classroom-based tests, to professional certifications, to large-scale language proficiency exams (Norris, 2009, 2016). However, the use of such tasks for assessment purposes remains contentious, and numerous language testing alternatives are available at potentially lower cost and degree of effort. In order to facilitate decisions about when and why to adopt task-based designs for language assessment, I first outline the relationship between assessment designs and their intended uses and consequences. I then introduce two high-stakes examples of language assessment circumstances (job certification and admissions testing) that suggest a need for task-based designs, and I review the corresponding fit of several assessments currently in use for these purposes. In relation to these purposes, I also suggest some of the positive consequences of task-based designs for language learners, teachers, and society, and I point to the dangers of using assessments that do not incorporate communicative tasks or do so inappropriately. I conclude by highlighting other circumstances that call for task-based designs, and I suggest how advances in technology may help to address associated challenges.
MIN Hoky
vol.17, pp.3-15, 2014

This paper introduces the writing section of the National English Ability Test (NEAT) developed for college entrance exam in Korea. It will discuss the nature of the writing tasks of the NEAT, rating domains and rater training procedure.
vol.17, pp.43-58, 2014

In construction of Sentence Repetition Test (hereafter, SR), both sentence length and pause between the presentation of the target sentence and its repetition, have been found to play an important role in SR performance. The present study investigated how these two factors influence the task difficulty and the concurrent validity of SR with TOEIC scores. In order to achieve these aims, 79 Japanese high school students participated in this study and undertook the English SR under two conditions-with and without a pause. The SR sentences were varying lengths. The results showed that, while long sentences were more difficult to recall correctly than short ones, the pause made the short sentences more difficult and long sentences easier to repeat. Multiple regression analyses indicated that the performance under the most difficult condition, where the test-takers repeated long sentences without any pause, showed the highest concurrent validity with TOEIC. The implications of the present findings, along with some suggestions for further research, will be discussed to calibrate SR for second language test-takers.
vol.17, pp.19-39, 2014

This study investigates how distractors function in multiple-choice listening tests. Conventionally, each distractor is evaluated by its attractiveness. In other words, distractors that can plausibly be chosen by test takers are considered to perform well on the test. On the other hand, distractors that are chosen by few test takers are recognized as performing poorly. Considering that test takers have to choose only one among several options, it can be assumed that some unselected distractors may have in fact performed adequately. Therefore, it is prudent to independently analyze the attractiveness of distractors. A total of 75 Japanese university students evaluated their confidence in selecting both correct and incorrect answers. The results indicated that (a) the least chosen distractors were not always the least attractive, (b) less proficient listeners were more likely to be allured by distractors, and (c) more proficient listeners were more likely to answer with higher confidence. The researcher explains the process of eliminating distractors and reevaluates the unselected distractors.
vol.17, pp.59-80, 2014

The purpose of the present study is to examine the appropriateness of item response theory (IRT) to language testing. Although it has been discussed by many researchers (e.g., Henning, 1992; Blais & Laurier, 1995) for a long time, its appropriateness has not been demonstrated using language testing data, especially listening test data. Therefore, the dimensionality of the listening test data was examined by several approaches in the present study. The 50 questions in the listening section of the TOEFL Sample Test (6th edition) were administered to 392 students as a part of their usual General English classes. The data were analyzed using two approaches: the factor-analytic approach and the principle component analytic approach. The analyses of the 30 questions in Part A showed less possibility of the existence of a second dimension. As for the analyses of the 20 questions in Part B and C, however, two out of the five analyses affirmed the possibility of a second dimension. These mixed results suggested that multidimensionality may be detected depending on the methods used. Moreover, it was found that the different tasks tended to measure different dimensions, even though they seem to measure the same language skill. In addition, there was a fairly large amount of unexplained variance in the data. It is possible to say that there was a great deal of noise in the data that could not be aligned along dimensions. An implication is that tests using IRT should be more construct-valid. Lastly, it is observed that unidimensionality is a continuum rather than one position of a binary variable. Its appearance depends on the methods used to seek it.
日本言語テスト学会誌 (ISSN:21895341)
vol.15, pp.101-114, 2012

This study investigated the relationship between the TOEIC Bridge and TOEIC test scores, in particular, the extent to which the TOEIC Bridge test scores can predict the TOEIC test scores. The participants in this study were 292 non-English major students who took both the TOEIC Bridge and TOEIC tests in 2009. They were first-year students enrolled in a private university in Western Japan. Their scores on both tests were statistically examined using regression analysis. The results of the study showed that (1) the scores of the TOEIC Bridge and TOEIC tests were moderately correlated and (2) the TOEIC Bridge scores significantly predicted the TOEIC scores. Equations for estimating the TOEIC scores using the TOEIC Bridge scores were also specified, from which a comparison of the predicted TOEIC scores from the ETS study and the present study was constructed. The results of the comparison showed that the predicted scores from the two studies had similar intercepts and slopes for a certain range of TOEIC Bridge scores, but that the predicted scores diverged above this range.