著者
馬場 正太郎
出版者
日本言語テスト学会
雑誌
日本言語テスト学会誌 (ISSN:21895341)
巻号頁・発行日
vol.22, pp.44-64, 2019 (Released:2020-02-07)

The purpose of this study is to propose a way of generating beneficial washback effect by using high-stakes testing from a perspective of educational psychology. As can be seen in recent educational climate in Japan, it is required that high school students make the best use of language tests, such as EIKEN, GTEC, TEAP, TOEFL, IELTS, Cambridge English Exam, in order to improve four-skill English proficiency. Although this educational reform has been criticized due to its aggressive performance, there has been little discussion on how to induce its beneficial washback effect while reducing negative one. Therefore, it is necessary not only to argue the flaws of the reform, but also to seek the practical solution to the reform. In this paper, the recent standards-based educational reform efforts in Japan will be reviewed briefly. Next, the concept of washback and validity will be introduced to argue that washback effect should be considered as a consequential aspect of validity. Then, an effective way to induce beneficial washback effect will be discussed based on the previous studies on educational psychology. Specifically, by introducing the research on learners’ beliefs about tests, this study illustrates what kind of beliefs will lead to generating beneficial washback effect. Lastly, the practical implication of this study and the need for future research will be discussed.
著者
前田 啓朗
出版者
日本言語テスト学会
雑誌
外国語教育評価学会研究紀要
巻号頁・発行日
no.3, pp.119-126, 2000-09-01

This paper firstly reviews what are regarded as the preferable ways to measure and analyse conceptual variables such as language proficiency, motivation, attitude, strategy use and so on. Since it is difficult to measure such concepts using only a few indexes, many observed variables are used for that purpose. The Exploratory Factor Analysis (EFA), which is a kind of multivariate analysis, can be used to assume latent variables (factors) behind observed ones. Here, attention should be paid as each solution offered by EFA is not absolute and is subject to variation according to the way the EFA is conducted. Secondly, this paper investigated five nation-wide journals (ARELE, JACET Bulletin, JALT Journal, JLTA Journal, Language Laboratory), which have been published in Japan and have broadly focused on language teaching and learning for these five years (from 1995 to 1999). All 15 EFAs conducted in them are analysed and described according to the method set out above. Finally, points to be aware of for further study are noted on the basis of the description of these five years' research tendency.
著者
DAVIDSON Fred
出版者
日本言語テスト学会
雑誌
日本言語テスト学会誌
巻号頁・発行日
vol.15, pp.1-23, 2012
被引用文献数
1

A test specification (spec) is a generative document from which equivalent language test items or tasks can be produced. There are many formats for specs, but all share two elements: sample(s) of the items/tasks and guiding language that describes the sample(s). Through consensus-building and feedback, specs evolve and stabilize. A complete illustrative test spec is presented, based on workshops held in Japan in late 2011. A new problem in spec-driven test development is posed in this paper: releasability, which refers to whether a spec should be shared outside of the test development team, and if so, when and in what form. The illustrative spec is again used to explore releasability. A number of theoretical questions are posed about spec release, and future research about spec release is encouraged.
著者
前田 啓朗
出版者
日本言語テスト学会
雑誌
日本言語テスト学会研究紀要
巻号頁・発行日
no.6, pp.140-147, 2004-08-30

This paper presents 1) what limitations causal analyses have, 2) how causal analyses are conducted in English language education research in Japan, 3) what problems are seen in those causal analyses, then, 4) how the problems can be improved for further research. A Causal analysis, especially an analysis according to Multiple Regression Model, is originally a powerful tool for predicting a dependent variable by some independent variables. However, when the degree of causal effect by each independent variable is focused, the problem of multi-collinearity, which is provided by correlations among dependent variables, arises. On the other hand, when stepwise method is adopted in deciding which dependent variables should be included, the problem of multi-collinearity may cause again by deleting the dependent variables which reasonably seem to contribute to independent variables. After reviewing those limitations of Multiple Regression Models, eleven articles in English language education research in Japan were reinvestigated in terms of those problems. Then, some suggestions, such as using a correlation analysis, are presented instead of regression models.
著者
LEE WonKey
出版者
日本言語テスト学会
雑誌
日本言語テスト学会誌
巻号頁・発行日
vol.15, pp.25-42, 2012

Korea's experiment of using an internet-based speaking test of English for university admission qualifications is unprecedented and hence still controversial. In this paper, how to test and rate test-takers' speaking performance of English by internet is briefly discussed, and the possibilities and problems of using this speaking test for university admission qualifications are discussed.
著者
田中 博晃 前田 啓朗
出版者
日本言語テスト学会
雑誌
日本言語テスト学会研究紀要
巻号頁・発行日
no.6, pp.128-139, 2004-08-30

The purpose of this study was to examine the construct of amotivation. When amotivation is measured, negative items in a questionnaire cause attenuation of correlation, and as a result, it would give bias to the construct of amotivation. A questionnaire was made on the basis on Noels, Pelletier, Clement, and Vallerand (2000), and it included both positive items (P-type) and negative items of amotivation (N-type). By analyzing the data from the questionnaire using Confirmatory Factor Analysis to correct attenuation, we examined a systematic error caused by negative items. The result showed that (1) an artificial factor was identified when positive and negative items of amotivation were analyzed by Exploratory Factor Analysis; (2) the construct of amotivation was supported when 7-factor model of motivation was examined by conducting Confirmatory Factor Analysis to P-type questionnaire; and (3) P-type questionnaire was more appropriate than N-type questionnaire as a measure of amotivation, because bipolarity between amotivation and self-determined forms of motivation was clearly identified in P-type questionnaire.
出版者
日本言語テスト学会
雑誌
日本言語テスト学会誌 (ISSN:21895341)
巻号頁・発行日
vol.19, no.2, 2016

第1章:JLTAの歴史的経緯1 大友賢二2 ランディ・スラッシャー7第2章:JLTAとテスト研究10 ジェームズ・ディーン・ブラウン11 クォン・オリャン12 アンソニー・グリーン13 ジョン・リード15 フレッド・デヴィッドソン16 デイヴィッド・ベグラー18 バリー・オサリバン19 ウォンキー・リー21 浪田克之介21 中村優治22 木下正義24 池田央26 羽鳥博愛27 田中正道28 野呂忠司30 柳瀬陽介32第3章:これからのテスト研究333.1 適切なテスト使用のために必要な事項34 3.1.1 妥当性と信頼性35 3.1.2 波及効果と影響41 3.1.3 公平性,倫理規範と標準化453.2 適切なテスト作成・使用の原則49 3.2.1 テストの使用目的と構成概念50 3.2.2 テスト細目56 3.2.3 受容技能のテスト形式61 3.2.4 産出技能のテスト形式65 3.2.5 項目作成とタスクデザイン69 3.2.6 評価尺度の開発73 3.2.7 評価者による評価(採点)と評価者訓練77 3.2.8 テストの標準化と等化81 3.2.9 妥当性検証90 3.2.10 利害関係者への結果のフィードバック943.3 言語知識と技能の評価99 3.3.1 リスニングの評価100 3.3.2 リーディングの評価104 3.3.3 スピーキングのモノローグの評価108 3.3.4 スピーキングの対話の評価112 3.3.5 技能統合的スピーキングの評価116 3.3.6 独立的ライティングの評価122 3.3.7 技能統合的ライティングの評価128 3.3.8 語彙の評価132 3.3.9 文法の評価138 3.3.10 綴り(スペリング)の評価142 3.3.11 発音の評価146 3.3.12 第二言語としての日本語の評価1503.4 評価の新しい方向性155 3.4.1 Can-Do評価156 3.4.2 ヨーロッパ言語共通参照枠と評価の関連づけ160 3.4.3 子どもの学習者の言語力の評価165 3.4.4 英語教師の評価168 3.4.5 教室における評価173 3.4.6 特定の目的のための評価178 3.4.7 コンピュータ適応型テスティング〔理論編〕182 3.4.8 コンピュータ適応型テスティング〔実践編〕1863.5 言語テスト研究の理論と方法191 3.5.1 古典的テスト理論192 3.5.2 一般化可能性理論196 3.5.3 二値項目のラッシュ分析201 3.5.4 多相ラッシュ分析207 3.5.5 項目応答理論211 3.5.6 潜在ランク理論217 3.5.7 認知的診断モデリング223 3.5.8 差異項目機能228 3.5.9 確認的因子分析232 3.5.10 マルチレベル分析236 3.5.11 メタ分析240 3.5.12 質的方法244
著者
柳瀬 陽介
出版者
日本言語テスト学会
雑誌
日本言語テスト学会研究紀要 (ISSN:2433006X)
巻号頁・発行日
vol.11, pp.77-95, 2008-09-20 (Released:2017-08-07)

この言語コミュニケーション力の三次元的理解は、これまでの言語コミュニケーション力論の議論の蓄積の上に、関連する諸概念を再構成したものである。もちろんただ用語を変えただけというものではなく、(a)読心力の働きの強調、(b)身体力の復活、(c)言語力における「知識」の二義性を明示した、ことが本論考の独自性の主なものである。だが、これらの論点は、これまでの言語コミュニケーション力論からは、まったく欠如していたと考えるのは行きすぎであろう。過去の言語コミュニケーション力論の諸概念と、本論文の概念をやや強引に関連づけたのがAppendix 1である。これらの改良により、本論文の「目的」で述べた、5つの課題は克服されただろうか。(1)の課題は、Bachmannの方略的能力概念よりも、言語の知識がコミュニケーションに使われる際の過程をより理論的に解明することであった。これについては、(a)の読心力の設定により、言語コミュニケーション以前に、コミュニケーションには「心の理論」に代表される他人の心を読むメカニズムが人間には働いており、ことに言語を高度に使ったコミュニケーションにおいては関連性の原理に従って言語使用がされていることを明らかにすることで課題を達成した。(2)の課題は、言語コミュニケーションにおける身体の働きを明示することであったが、これはBachman(1990)がかつて提唱していた「心身協調メカニズム」を「言語的身体力」で復活させただけでなく、「非言語的身体力」を設定することで、これまでの応用言語学が重んじていなかったが、日常生活では痛感されている領域があることを明らかにした。(3)の課題は、言語コミュニケーションの相互作用性を少しでも明らかにすることであったが、これは読心力概念を前面に出すことで、コミュニケーションの特定の相手を具体的に想定しない言語コミュニケーション力論は、コミュニケーションの理論としては不十分であることを示した。だが、これは、個人の中に他者を取り込んだ相互作用性に留まり、未だに個人主義的な発想であるともいえるかもしれない。Hymes(1972)が先駆的に述べていたコミュニケーションの「創発」(emergence)についてもまだ論考されていない。これは今後の課題となるだろう(後述)。(4)の課題は、言語の極にもコミュニケーションの極にも偏らない論考をすることだった。この課題は、読心力と言語力を独立させ直交的に表現し、その二次元平面で、ほとんど読心力だけでも成立するコミュニケーションから、高度に言語力に依拠することによって成立する言語コミュニケーションの変容範囲を理論的に示すことによって克服された。(5)の課題は言語コミュニケーション力の全体像の見通しを得る論考を目指すことであったが、これは全体像を三つの要因(三次元)という簡明な構造図式で説明し、なおかつ、それぞれの次元においてより詳しい説明が展開できる議論を展開したことによって達成されたと考えられる。このように本論文の言語コミュニケーション力の三次元的理解は、これまでの言語コミュニケーション力の展開に基づきながらも新しい独自の貢献を果たすと考える。
著者
三上 明洋
出版者
日本言語テスト学会
雑誌
日本言語テスト学会誌 (ISSN:21895341)
巻号頁・発行日
vol.21, pp.82-101, 2018 (Released:2018-12-24)

The aims of this study are to evaluate the content validity of a reflection tool for EFL teachers’ professional development in Japan, called Self-Evaluation Checklist for EFL Teachers (SECEFLT), and to provide validity evidence for interpreting and using SECEFLT scores through Kane’s (2006) argument-based approach. SECEFLT was originally developed by Mikami (2015) to promote EFL teachers’ reflection on their professional competencies. It was revised by Mikami (2018) through the validation process of construct validity, using both exploratory and confirmatory factor analyses. To gather further validity evidence related to content aspect for the revised SECEFLT, a survey was conducted with a panel of experts including six English language teachers (all English language education majors) at teacher education departments in national universities in Japan. The experts were asked to evaluate the extent to which each item in the revised SECEFLT was relevant to the content domain it aimed to measure, as well as the overall extent of relevance of the revised SECEFLT to the content domain it aimed to measure. The results showed that each individual item in the scale was appropriate in content validity and the whole scale was also appropriate judging from individual item evaluations. It was confirmed that experts judged the revised SECEFLT as content-valid when asked directly whether it was appropriate overall. Based on the study results, interpretive arguments are discussed using Kane’s (2006) framework for indicators of theoretical constructs.
著者
HIRAI Akiyo KOIZUMI Rie
出版者
日本言語テスト学会
雑誌
日本言語テスト学会研究紀要
巻号頁・発行日
no.11, pp.1-20, 2008-09-20

Among different types of rating scales in scoring speaking performance, the EBB (Empirically derived, Binary-choice, Boundary-definition) scale is claimed to be easy to use and highly reliable (Turner & Upshur, 1996; 2002). However, it has been questioned whether the EBB scale can be applied to other tasks. Thus, in this study, an EBB scale was compared with an analytic scale in terms of validity, reliability, and practicality. Fifty-two EFL learners were asked to read and retell four stories in a semi-direct Story Retelling Speaking Test (SRST). Their performances were scored using these two rating scales, and then the scores were compared by using generalizability theory, a multitrait-multimethod approach, and a questionnaire delivered to the raters. As a result, the EBB scale, which consists of four criteria, was found to be more generalizable (i.e., reliable) than those of the analytic scale and generally assessed the intended constructs. However, the present EBB scale turned out to be less practical than the analytic scale due to its binary format and because it had more levels in each criterion. Further revisions seeking a better scale for the SRST are suggested.
著者
MIN Hoky
出版者
日本言語テスト学会
雑誌
日本言語テスト学会誌
巻号頁・発行日
vol.17, pp.3-15, 2014

This paper introduces the writing section of the National English Ability Test (NEAT) developed for college entrance exam in Korea. It will discuss the nature of the writing tasks of the NEAT, rating domains and rater training procedure.
著者
SUNADA Midori SUZUKI Yuichi
出版者
日本言語テスト学会
雑誌
日本言語テスト学会誌
巻号頁・発行日
vol.17, pp.43-58, 2014

In construction of Sentence Repetition Test (hereafter, SR), both sentence length and pause between the presentation of the target sentence and its repetition, have been found to play an important role in SR performance. The present study investigated how these two factors influence the task difficulty and the concurrent validity of SR with TOEIC scores. In order to achieve these aims, 79 Japanese high school students participated in this study and undertook the English SR under two conditions-with and without a pause. The SR sentences were varying lengths. The results showed that, while long sentences were more difficult to recall correctly than short ones, the pause made the short sentences more difficult and long sentences easier to repeat. Multiple regression analyses indicated that the performance under the most difficult condition, where the test-takers repeated long sentences without any pause, showed the highest concurrent validity with TOEIC. The implications of the present findings, along with some suggestions for further research, will be discussed to calibrate SR for second language test-takers.
著者
IIMURA Hideki
出版者
日本言語テスト学会
雑誌
日本言語テスト学会誌
巻号頁・発行日
vol.17, pp.19-39, 2014

This study investigates how distractors function in multiple-choice listening tests. Conventionally, each distractor is evaluated by its attractiveness. In other words, distractors that can plausibly be chosen by test takers are considered to perform well on the test. On the other hand, distractors that are chosen by few test takers are recognized as performing poorly. Considering that test takers have to choose only one among several options, it can be assumed that some unselected distractors may have in fact performed adequately. Therefore, it is prudent to independently analyze the attractiveness of distractors. A total of 75 Japanese university students evaluated their confidence in selecting both correct and incorrect answers. The results indicated that (a) the least chosen distractors were not always the least attractive, (b) less proficient listeners were more likely to be allured by distractors, and (c) more proficient listeners were more likely to answer with higher confidence. The researcher explains the process of eliminating distractors and reevaluates the unselected distractors.
著者
TSUCHIHIRA Taiko
出版者
日本言語テスト学会
雑誌
日本言語テスト学会誌
巻号頁・発行日
vol.17, pp.59-80, 2014

The purpose of the present study is to examine the appropriateness of item response theory (IRT) to language testing. Although it has been discussed by many researchers (e.g., Henning, 1992; Blais & Laurier, 1995) for a long time, its appropriateness has not been demonstrated using language testing data, especially listening test data. Therefore, the dimensionality of the listening test data was examined by several approaches in the present study. The 50 questions in the listening section of the TOEFL Sample Test (6th edition) were administered to 392 students as a part of their usual General English classes. The data were analyzed using two approaches: the factor-analytic approach and the principle component analytic approach. The analyses of the 30 questions in Part A showed less possibility of the existence of a second dimension. As for the analyses of the 20 questions in Part B and C, however, two out of the five analyses affirmed the possibility of a second dimension. These mixed results suggested that multidimensionality may be detected depending on the methods used. Moreover, it was found that the different tasks tended to measure different dimensions, even though they seem to measure the same language skill. In addition, there was a fairly large amount of unexplained variance in the data. It is possible to say that there was a great deal of noise in the data that could not be aligned along dimensions. An implication is that tests using IRT should be more construct-valid. Lastly, it is observed that unidimensionality is a continuum rather than one position of a binary variable. Its appearance depends on the methods used to seek it.
著者
Shizuka Tetsuhito
出版者
日本言語テスト学会
雑誌
日本言語テスト学会研究紀要
巻号頁・発行日
vol.6, pp.108-127, 2004

The purpose of this study was to explore the potential of 'invisible-gap filling' items primarily as an in-house achievement measure of reading-oriented courses and secondarily as a more general overall-ability measure. More specifically, it compared multiple-matching 'invisible-gap filling' items and their 'visible' counterparts in terms of item facility, item discrimination, test reliability, and test validity. Eighty-eight Japanese university 1st year students took a 25-item invisible-gap filling test and its visible counterpart, along with two 25-item c-tests, the combination of which constituted a semester-end examination of a reading-oriented course. The invisible and visible gap filling tests were based on the same passage covered in the course. Target words (i.e., words to fill the gaps) were also the same between the versions, making the salience of the gaps the only difference between the two. Hence, psychometric property differences between these two versions, if any, should be attributed to the gap visibility condition difference. One c-test was created from a passage already covered in class and the other from a new passage. The former served as an achievement criterion while the latter was considered a proficiency criterion. Results indicated that the invisible-gap filling items had (1) lower facility values, (2) higher discriminations, (3) higher reliability, (4) higher validity as an achievement measure, and (5) higher validity as a proficiency measure, than its visible counterpart. Based on these findings, it is contended that invisible gap filling is a technique that can be used to produce reliable and valid achievement tests with relative ease. After discussing possible limitations of the format, two possible modifications are proposed.
著者
斉田 智里
出版者
日本言語テスト学会
雑誌
日本言語テスト学会研究紀要
巻号頁・発行日
no.10, pp.119-133, 2007-10-01

This research addressed the comparison of concurrent calibration between a polytomous IRT model and a dichotomous IRT model using English achievement test data. Two forms of English achievement tests for senior high school students were composed of testlets (groups of items) to eliminate the effect of the dependence among within-testlet items. The two forms were equated with common testlets through a polytomous IRT model. The testlet parameter estimates and the category characteristic curves were analyzed on a common scale. The result showed that one form was more difficult than the other, as test designers had intended. The mean of the ability parameter estimates of the more difficult form was higher than that of the easier form. These findings yielded useful feedback for test designers. Item parameter estimates of independent dichotomous items, ability parameter estimates and the amount of test information derived by concurrent calibration under the graded response model (polytomous IRT model) and the two-parameter logistic model (dichotomous IRT model) were compared. The results showed similar parameter estimates for the two IRT models. The standard errors of ability parameter estimates for both models also were highly correlated. The two-parameter logistic model provided a greater amount of test information than the graded response model.
著者
YOSHIDA Hiroko
出版者
日本言語テスト学会
雑誌
日本言語テスト学会誌 (ISSN:21895341)
巻号頁・発行日
vol.15, pp.101-114, 2012

This study investigated the relationship between the TOEIC Bridge and TOEIC test scores, in particular, the extent to which the TOEIC Bridge test scores can predict the TOEIC test scores. The participants in this study were 292 non-English major students who took both the TOEIC Bridge and TOEIC tests in 2009. They were first-year students enrolled in a private university in Western Japan. Their scores on both tests were statistically examined using regression analysis. The results of the study showed that (1) the scores of the TOEIC Bridge and TOEIC tests were moderately correlated and (2) the TOEIC Bridge scores significantly predicted the TOEIC scores. Equations for estimating the TOEIC scores using the TOEIC Bridge scores were also specified, from which a comparison of the predicted TOEIC scores from the ETS study and the present study was constructed. The results of the comparison showed that the predicted scores from the two studies had similar intercepts and slopes for a certain range of TOEIC Bridge scores, but that the predicted scores diverged above this range.
著者
Sugita Yoshihito
出版者
日本言語テスト学会
雑誌
日本言語テスト学会研究紀要
巻号頁・発行日
no.13, pp.21-40, 2010-11-15

This article examines the main data of a task-based writing performance tests in which the five junior high school teachers participated as novice raters. The purpose of this research is to implement a task-based writing test (TBWT) which was developed on the basis of construct-based processing approach to testing, and to examine the degree of reliability and validity of the assessment tasks and rating scales. Accuracy and communicability were defined as constructs, and the test development proceeded according to such three stages as designing and characterizing writing tasks, reviewing existing scoring procedures and drafting rating scales. Each of the forty scripts collected from twenty undergraduate students was scored by five new raters, and the analyses were done using FACETS. The results indicated that all novice raters displayed acceptable levels of self-consistency, and that there was no significantly different scoring on the two tasks and overall impression, which provided reasonable fit to the Rasch model. The modified scales associated with the five rating categories and their specific written samples were shown to be mostly comprehensible and usable by raters, and demonstrated that the students' ability was effectively measured using these tasks and rating scales. However, further research is necessary for considering elimination of inter-rater differences.
著者
Sato Takanori
出版者
日本言語テスト学会
雑誌
日本言語テスト学会研究紀要
巻号頁・発行日
vol.13, pp.1-20, 2010

The purpose of the present study was to examine the validity of 16 can-do items taken from the EIKEN can-do list (STEP, 2008). A total of 2,571 Japanese junior high school students were asked to assess their degree of confidence in the 16 can-do statements-four EIKEN Grade 5, Grade 4, Grade 3, and Grade Pre-2 items, respectively. The present study employed the Rasch model to investigate whether (a) the items are unidimensional, (b) their item difficulty is appropriate, (c) item difficulty correlates with the items' EIKEN grades, and (d) the students' confidence levels correlate with their proficiency levels. The results showed that the can-do items are highly reliable and unidimensional. However, the students tended to feel that the items were unchallenging, especially the speaking and listening items.