簡易檢索 / 詳目顯示

研究生: 顏永進
Yung-Chin Yen
論文名稱: 於可回溯電腦化適性測驗中加入4PL錯誤校正機制
A 4PL-Based Error-Correction Mechanism for Reviewable Computerized Adaptive Testing
指導教授: 何榮桂
Ho, Rong-Guey
學位類別: 博士
Doctor
系所名稱: 資訊教育研究所
Graduate Institute of Information and Computer Education
論文出版年: 2010
畢業學年度: 98
語文別: 英文
論文頁數: 108
中文關鍵詞: 項目反應理論電腦化適性測驗可回溯電腦化適性測驗上漸近線參數四參數試題反應理論重排程序
英文關鍵詞: Item response theory (IRT), computerized adaptive testing (CAT), reviewable CAT, upper asymptote parameter, four-parameter logistical (4PL) IRT model, rearrangement procedure
論文種類: 學術論文
相關次數: 點閱:168下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在電腦化適性測驗(computerized adaptive tests, CAT)的施測過程中,若提供受試者回顧修改已作答試題的機會,或可讓受試者更正因粗心答錯的試題,進而令測驗結果更符合受試者能力,此為可回溯電腦化適性測驗(reviewable CAT)之基本假設。然而允許受試者於CAT測驗過程中修改作答反應卻可能造成後續試題無法有效預估該受試者能力值,進而導致測驗結果產生偏誤、降低能力估計之精確程度。本研究旨在利用四參數試題反應模式(four-parameter logistical IRT model, 4PL IRT model)減低上述因回溯而產生的不良試題對能力估計的影響,並比較其與三參數試題反應模式(three-parameter logistical IRT model, 3PL IRT model)在能力估計表現上之差異。

    本研究共分為三階段實驗,在第一階段模擬實驗及第二階段實徵實驗中,分別探討在模擬及實際施測兩種情境下,上漸近線參數(upper asymptote parameter)對CAT能力估計之精確性及估計效率的影響。第三階段的模擬實驗則探討4PL IRT model可否有效減低因回溯而產生的不良試題造成的能力估計偏誤。研究結果顯示,4PL IRT model 可改善因測驗初期作答失誤所導致的能力低估問題,提供較3PL IRT model更精確的能力估計;在正常施測狀況下,4PL也可有效改善整體施測效率。此外,透過與重排程序(rearrangement procedure)結合,4PL IRT model亦能解決可回溯CAT中因不良試題所產生的能力估計問題,改善可回溯CAT的能力估計精確性與效率。最後,實徵實驗結果顯示高中女生的英文能力顯著高於男性同儕。

    The underlying hypothesis of reviewable CAT was that after rereading or rethinking an item, the examinees might correct the careless mistake they have made. Therefore, the testing score would be closer to examinee’s actual ability when mistakes were corrected; and the prohibition of reviewing items in CAT might lead to underestimating examinees’ ability. However, changing the answer of one item in CAT might cause the following items no longer appropriate for estimating the examinee’s ability. These inappropriate items in a reviewable CAT could introduce bias in ability estimation and decrease precision. This study attempted to evaluate the performance of four-parameter logistical (4PL) model by comparing it with three-parameter logistical (3PL) model and utilizing it to reduce the impact of inappropriate items on reviewable CAT.

    Three experiments were conducted in this study. The first two experiments, a simulation and an empirical one, focused on a study of evaluating the performance of 4PL IRT model by comparing the measurement precision and efficiency of 3PL and 4PL IRT model under both simulation and empirical conditions; the third one focused on the study of reducing the impact of inappropriate items on reviewable CAT by implementing the 4PL model. Results of these experiment indicated that the 4PL IRT model could: (1) improve the estimation precision of CAT under poor-start administration condition; (2) promote the estimation efficiency of CAT under normal administration condition; and (3) could be implemented as a valuable solution in reducing the estimation bias introduced by the inappropriate items in reviewable CAT. Finally, the language achievement of female senior-high-school examinees was higher than that of males in both midterm score and ability estimated by CAT in this study.

    List of Tables vii List of Figures ii Chapter 1. Introduction 1 1.1. Background and Motivation 1 1.2. Purpose 3 1.3. Scope and Limitation 4 Chapter 2. Literature Review 7 2.1. Item Response Theory 7 2.1.1. IRT Assumptions 8 2.1.2. One, two, and three parameter logistical IRT model 9 2.1.3. 4PL IRT Model 14 2.1.4. Examinee ability estimation 18 2.2. Computerized Adaptive Testing 21 2.3. Reviewable CAT 24 2.3.1. Arguments against allowing review 25 2.3.2. Arguments supporting item review 27 2.4. Solutions for Reviewable CAT 29 2.4.1. Limiting answer review and change procedure 29 2.4.2. Rearrangement procedure 31 2.5. Gender Differences in Language Achievement 34 Chapter 3. Method 37 3.1. Experiment 1 37 3.1.1. Participants 37 3.1.2. Item bank 38 3.1.3. Simulation Procedure 39 3.2. Experiment 2 47 3.2.1. Participants 47 3.2.2. Item bank 47 3.2.3. Procedure 49 3.3. Experiment 3 50 3.3.1. Procedure 50 Chapter 4. Results and Discussion 59 4.1. Experiment 1 59 4.1.1. The theta convergence of 3PL- and 4PL-based CAT 59 4.1.2. The precision of 3PL- and 4PL-based CAT 61 4.1.3. The efficiency of 3PL- and 4PL-based CAT 68 4.2. Experiment 2 75 4.2.1. The theta convergence of 3PL- and 4PL-based CAT 75 4.2.2. The precision of 3PL- and 4PL-based CAT 76 4.2.3. The efficiency of 3PL- and 4PL-based CAT 78 4.3. Experiment 3 81 4.3.1. The precision of four solutions for reviewable CAT 83 4.3.2. The efficiency 88 Chapter 5. Conclusion and Suggestion 97 5.1. Discussion 97 5.2. Suggestion 99 5.3. Conclusion 100

    Ackerman, T. A. (1989). Unidimensional IRT calibration of compensatory and noncompensatory multidimensional items. Applied Psychological Measurement, 13(2), 101–125.
    Baker, F. B. (1992). Item response theory: Parameter estimation techniques. New York: Marcel Dekker.
    Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques. New York: Marcel Dekker.
    Bar-Hillel, M., Budescu, D., & Attali, Y. (2005). Scoring and keying multiple choice tests: A case study in irrationality. Mind & Society, 4(1), 3–12.
    Barton, M. A., & Lord, F. M. (1981). An upper asymptote for the three-parameter logistic item-response model. Princeton, NJ: Educational Testing Services.
    Bimbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical Theories of Mental Test Scores (pp. 397–479). London: Addison Wesley.
    Bowles, R., & Pommerich, M. (2001, April). An examination of item review on a CAT using the specific information item selection algorithm. Paper presented at the the annual meeting of the National Council on Measurement in Education, Seattle WA.
    Brown, J. D. (1997). Computers in language testing: Present research and some future directions. Language Learning & Technology, 1(1), 44–59.
    Burton, R. F. (2002). Misinformation, partial knowledge and guessing in true/false tests. Medical Education, 36(9), 805–811.
    Carr, M. J., & Pauwels, A. (2006). Boys and foreign language learning : Real boys don't do languages. New York: Palgrave Macmillan.
    Chang, H. H. (2004). Understanding computerized adaptive testing: From Robbins-Munro to Lord and beyond. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social sciences. (pp. 117–133). New York: Sage.
    Chen, L. J. (2009). Effects of block-review and rearrangement computerized adaptive test on ability estimation and test anxiety. Unpublished doctoral dissertation, National Taiwan Normal University, Taiwan.
    Coombs, C. H., & Womer, F. B. (1956). The assessment of partial knowledge. Educational and Psychological Measurement, 16(1), 13–37.
    Dunkel, P. A. (1997). Computer-adaptive testing of listening comprehension: A blueprint for CAT development. The Language Teacher, 21, 7–14.
    Educational Testing Service. (2000). The computer-based TOEFL score user guide. Princeton, NJ: Author.
    Ehrman, M., & Oxford, R. (1988). Effects of sex differences, career choice, and psychological type on adult language learning strategies. Modern Language Journal, 73(1), 3–13.
    Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates.
    Gardner-Medwin, A. R., & Gahan, M. (2003). Formative and summative confidence-based assessment. Paper presented at the Proceedings of the 7th International Computer-Aided Assessment Conference, Loughborough, UK.
    Gershon, R. C., & Bergstrom, B. (1995, April). Does cheating on CAT pay: NOT! Paper presented at the the annual meeting of the American Educational Research Association, San Francisco.
    Hambleton, R. K., Rogers, H. J., & Swaminathan, H. (1995). Fundamentals of item response theory. Newbury Park: Sage.
    Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer Nijhoff.
    Harvey, R. J., & Hammer, A. L. (1999). Item response theory. The Counseling Psychologist, 27(3), 353–384.
    Harvil, L. M., & Davis III, G. (1997). Medical students' reasons for changing answers on multiple-choice tests. Academic Medicine, 72(10 Suppl 1), S97–S99.
    Harwell, M., Stone, C. A., Hsu, T. C., & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101–125.
    Heidenberg, A. J., & Layne, B. H. (2000). Answer changing: A conditional argument. College Student Journal, 34(3), 440–450.
    Ho, R.-G. (1989). Computerized adaptive testing. Psychological Testing, 36, 117–130.
    Ho, R.-G., & Yen, Y.-C. (2005). Design and evaluation of an XML-based platform-independent computerized adaptive testing system. IEEE Transactions on Education, 48(2), 230–237.
    Hsu, T. C., & Sadock, S. F. (1985). Computer-assisted test construction: A state of art. TME report 88, Princeton, New Jersey, Eric on Test. Measurement, and Evaluation, Educational Testing Service.
    Huff, K. L., & Sireci, S. G. (2001). Validity issues in computer-based testing. Educational Measurement: Issues and Practice, 20(3), 16–25.
    Hyde, J. S., & Linn, M. C. (1988). Gender differences in verbal ability: A meta-analysis. Psychological Bulletin, 104(1), 53–69.
    Jackson, R. A. (1955). Guessing and test performance. Educational and Psychological Measurement, 15(1), 74–79.
    Kingsbury, G. G. (1996). Item review and adaptive testing. Paper presented at the the annual meeting of the National Council on Measurement in Education, New York, NY.
    Kissau, S. (2006a). Gender differences in motivation to learn French. Canadian Modern Language Review, 62(3), 401–422.
    Kissau, S. (2006b). Gender differences in second language motivation: An investigation of micro- and macro-level influences. Canadian Journal of Applied Linguistics, 9(1), 73–96.
    Kissau, S., & Turnbull, M. (2008). Boys and French as a second language: A research agenda for
    greater understanding. Canadian Journal of Applied Linguistics, 11(3), 151–170.
    Kreitzberg, C., & Jones, D. (1980). An empirical study of the broad range tailored test of verbal ability. research report. RR-80-5. Princeton, NJ: Educational Testing Service.
    Larson, J. W., & Madsen, H. S. (1985). Computerized adaptive language testing: moving beyond computer-assisted testing. CALICO Journal, 2(3), 32–36.
    Lord, F. M. (1970). Some test theory for tailored testing. In W. H. Holzman (Ed.), Computer assisted instruction, testing, and guidance. New York: Harper and Row.
    Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates.
    Lord, F. M. (1983). Unbiased estimators of ability parameters, of their variance, and of their parallel-forms reliability. Psychometrika, 48(2), 233–245.
    Lord, F. M., & Novick, M. R. (1968). Theory of mental test scores. Reading, MA: Addison-Wesley.
    Luecht, R. M., & Hirsch, T. M. (1992). Item selection using an average growth approximation of target information functions. Applied Psychological Measurement, 16(1), 41–51.
    Lunz, M., Bergstrom, B., & Wright, B. (1992). The effect of review on student ability and test efficiency for computerized adaptive tests. Applied Psychological Measurement, 16(1), 33.
    McBride, J. R., Wetzel, C. D., & Hetter, R. D. (1997). Preliminary psychometric research for CAT-ASVAB: Selecting an adaptive testing strategy. In W. A. Sands, B. K. Waters & J. R. McBride (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 83–95). Washington, DC: American Psychological Association.
    McMorris, R. F., DeMers, L. P., & Schwarz, S. P. (1987). Attitudes, behaviors, and reasons for changing responses following answer-changing instruction. Journal of Educational Measurement, 131–143.
    McMorris, R. F., & Weideman, A. H. (1986). Answer changing after instruction on answer changing. Measurement and Evaluation in Counseling and Development, 19(2), 93–101.
    Mills, C. N., & Stocking, M. L. (1995). Practical issues in large-scale high-stakes computerized adaptive testing. Princeton, NJ: Educational Testing Service.
    Morisset, C. E., Barnard, K. E., & Booth, C. L. (1995). Toddlers' language development: Sex differences within social risk. Developmental psychology, 31(5), 851–865.
    Nyikos, M. (1990). Sex-related differences in adult language learning: Socialization and memory factors. Modern Language Journal, 74(3), 273–287.
    Olea, J., Revuelta, J., Ximenez, M., & Abad, F. (2000). Psychometric and psychological effects of review on computerized fixed and adaptive tests. Psicologica, 21, 157–173.
    Owen, R. (1975). A Bayesian sequential procedure for quantal response in the context of adaptive mental testing. Journal of the American Statistical Association, 351–356.
    Oxford, R., Nyikos, M., & Ehrman, M. (1988). Vive la différence? Reflections on sex differences in use of language learning strategies. Foreign Language Annals, 21(4), 321–329.
    Oxford, R., Park-Oh, Y., It, S., & Sumrall, M. (1993). Japanese by satellite: Effects of motivation, language learning styles and strategies, gender, course Level, and previous language learning experience on Japanese language achievement. Foreign Language Annals, 26(3), 359–371.
    Papanastasiou, E. C. (2002, April). A ‘rearrangement procedure’ for scoring adaptive tests with review options. Paper presented at the the National Council of Measurement in Education, New Orleans, LA.
    Papanastasiou, E. C. (2005). Item review and the rearrangement procedure: Its process and its results. Educational Research and Evaluation, 11(4), 303–321.
    Papanastasiou, E. C., & Reckase, M. (2007). A "rearrangement procedure" for scoring adaptive tests with review options. International Journal of Testing, 7(4), 387–407.
    Parchev, I. (2004). A visual guide to item response theory. Retrieved November 9, 2009, from http://www2.uni-jena.de/svw/metheval/irt/VisualIRT.pdf.
    Parshall, C. G., Kalhn, J. C., & Davey, T. (2002). Practical considerations in computer based testing. New York: Springer-Verlag.
    Reeve, B. B., & Fayers, P. (2005). Applying item response theory modelling for evaluating questionnaire item and scale properties. In Q. Fayers & R. D. Hays (Eds.), Assessing quality of life in clinical trial: Methods and practice (pp. 55–74). Oxford: Oxford University Press.
    Rulison, K., & Loken, E. (2009). I've fallen and I can't get up: can high-ability students recover from early mistakes in CAT? Applied Psychological Measurement, 33(2), 83.
    Schwarz, S. P., McMorris, R. F., & DeMers, L. P. (1991). Reasons for changing answers: An evaluation using personal interviews. Journal of Educational Measurement, 28(2), 163–171.
    Shatz, M. A., & Best, J. B. (1987). Students' reasons for changing answers on objective tests. Teaching of Psychology, 14(4), 241–242.
    Stocking, M. L. (1997). Revising item responses in computerized adaptive tests: A comparison of three models. Applied Psychological Measurement, 21(2), 129.
    Stone, G., & Lunz, M. (1994). The effect of review on the psychometric characteristics of computerized adaptive tests. Applied Measurement in Education, 7(3), 211–222.
    Tao, Y.-H., Wu, Y.-L., & Chang, H.-Y. (2008). A practical computer adaptive testing model for small-scale scenarios. Educational Technology & Society, 11(3), 259–274.
    Van Der Linden, W., & Glas, C. (2000). Computerized adaptive testing: Theory and practice. Boston, MA: Kluwer Academic Publishers.
    Vicino, F., & Moreno, K. (1997). Human factors in the CAT system: A pilot study. In W. A. Sands, B. K. Waters & J. R. McBride (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 157–160). Washington, DC: APA.
    Vispoel, W. P. (1998). Reviewing and changing answers on computer-adaptive and self-adaptive vocabulary tests. Journal of Educational Measurement, 35(4), 328–345.
    Vispoel, W. P., Clough, S. J., Bleiler, T., Hendrickson, A. B., & Ihrig, D. (2002). Can examinees use judgments of item difficulty to improve proficiency estimates on computerized adaptive vocabulary tests? Journal of Educational Measurement, 39(4), 311–330.
    Vispoel, W. P., Hendrickson, A., & Bleiler, T. (2000). Limiting answer review and change on computerized adaptive vocabulary tests: Psychometric and attitudinal results. Journal of Educational Measurement, 37(1), 21–38.
    Vispoel, W. P., Rocklin, T., & Wang, T. (1994). Individual differences and test administration procedures: a comparison of fixed-item, computerized-adaptive, and self-adapted testing. Applied Measurement in Education, 7(1), 53–79.
    Vispoel, W. P., Rocklin, T. R., Wang, T., & Bleiler, T. (1999). Can examinees use a review option to obtain positively biased ability estimates on a computerized adaptive test? Journal of Educational Measurement, 36(2), 141–157.
    Waddell, D. L., & Blankenship, J. C. (1994). Answer changing: A meta-analysis of the prevalence and patterns. Journal of Continuing Education in Nursing, 25(4), 155–158.
    Wagner, D., Cook, G., & Friedman, S. (1998). Staying with their first impulse? The relationship between impulsivity/reflectivity, field dependence/field independence and answer changes on a multiple-choice exam in a fifth-grade sample. Journal of Research and Development in Education, 31(3), 166–175.
    Wainer, H. (1993). Some practical considerations when converting a linearly administered test to an adaptive format. Educational Measurement: Issues and Practice, 12(1), 15–20.
    Wainer, H., Dorans, N. J., Eignor, D., Flaugher, R., Green, B. F., Mislevy, R. J., et al. (2000). Computerized adaptive testing: A primer (2nd ed.). Hillsdale, NJ: Erlbaum.
    Wallentin, M. (2009). Putative sex differences in verbal abilities and language cortex: A critical review. Brain and Language, 108(3), 175–183.
    Wang, M., & Wingersky, M. (1992). Incorporating post-administration item response revision into a CAT. Paper presented at the the annual meeting of the American Educational Research Association, San Francisco, CA.
    Wise, S. L. (1996, April). A critical analysis of the arguments for and against item review in computerized adaptive testing. Paper presented at the the annual meeting of the National Conference on Measurement in Education, New York, NY.
    Wise, S. L., & Kingsbury, G. (2000). Practical issues in developing and maintaining a computerized adaptive testing program. Psicologica, 21(1–2), 135–155.
    Wright, B. D. (1997). Fundamental measurement for psychology. In S. Embretson & S. Hershberger (Eds.), The new roles of measurement: What every psychologist and educator know. Hillsdale, NJ: Lawrence Erlbaum Associates.
    Yen, Y. C., Ho, R. G., Chen, L. J., Chou, K. Y., & Chen, Y. L. (in press). Development and evaluation of a confidence-weighting computerized adaptive testing. Educational Technology & Society.

    下載圖示
    QR CODE