研究生: |
張國豐 Chang, Kuo-Feng |
---|---|
論文名稱: |
分類目標與選題限制對於高階試題反應理論之電腦化分類測驗效能的影響 The Influences of Target Classification Traits and Item Selection Constrains on the Efficiency of Computerized Classification Testing Using High-Order Item Response Theory |
指導教授: |
陳柏熹
Chen, Po-Hsi |
學位類別: |
碩士 Master |
系所名稱: |
教育心理與輔導學系 Department of Educational Psychology and Counseling |
論文出版年: | 2016 |
畢業學年度: | 104 |
語文別: | 中文 |
論文頁數: | 120 |
中文關鍵詞: | 高階試題反應理論 、電腦化分類測驗 、分類目標 、選題限制 |
英文關鍵詞: | high-order item response theory, computerized classification testing, target classification traits, item selection constrains |
DOI URL: | https://doi.org/10.6345/NTNU202205107 |
論文種類: | 學術論文 |
相關次數: | 點閱:169 下載:25 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文主要是應用高階試題反應理論(high-order item response theory, HIRT)於電腦化分類測驗(computerized classification test, CCT)情境中,探討分類目標、Fisher Information選題方法、切截點數以及最大測驗長度,對於高階試題反應理論之電腦化分類測驗(簡稱HIRT-CCT)效能的影響,以便對未來實行相關測驗方式上提供建議。研究採用三參數HIRT作為測驗模式,並以能力信賴區間(ability confidence interval, ACI)搭配暫時能力估計值為基礎的選題策略(estimated-based, EB)為分類方式下,比較三種分類目標(包括:以能區分二階潛在能力為分類目標、以能區分一階潛在能力為分類目標以及以能同時區分一二階潛在能力為分類目標)、三種Fisher Information(簡稱FI)選題方法(包括:使二階潛在能力訊息量最大法(FI2)、使一階潛在能力訊息量最大法(FI1)以及同時使一二階潛在能力訊息量最大法(FI1+2))、兩種切截點數(包括:一個切截點以及兩個切截點)以及四種最大測驗長度(一個切截點包括:15、30、60以及90題;兩個切截點包括:30、60、90以及120題)在HIRT-CCT的表現,並進一步探討當HIRT-CCT加入選題限制時(包括:試題曝光率控制以及內容平衡限制),對於分類測驗結果的影響。研究的依變項包括:分類正確性、平均測驗長度、最大試題曝光率、題庫使用率以及內容平衡(各內容所選題數百分比)。
研究結果顯示,在三種分類目標中,以能區分一階潛在能力為分類目標所得到的結果與以能同時區分一二階潛在能力為分類目標所得到的結果相似,但若是以能區分二階潛在能力為分類目標時,則結果會與前兩者不同。此外,隨著最大測驗長度增加,一二階能力分類正確性均能有所提升。
針對三種FI選題方法,在分類正確性方面,當使用FI2時,並不能有效提高二階能力的分類正確性;當使用FI1時,擁有最低因素負荷量的一階能力分類正確性能獲得提升(次高因素負荷量則保持不變或上升),但最高因素負荷量的一階能力分類正確性則會下降,不過隨著最大測驗長度增加,三種FI選題方法的分類正確性結果會趨於相同。在強迫分類百分比與及平均測驗長度方面,當以能區分一階潛在能力為分類目標或是以能同時區分一二階潛在能力為分類目標時,使用FI1的強迫分類百分比以及平均測驗長度最小;當以能區分二階潛在能力為分類目標時,則是使用FI2的強迫分類百分比以及平均測驗長度最小。在內容平衡方面,當使用FI1+2時,所選題目平均分布在各內容上;當使用FI2時,會傾向選出較多最高因素負荷量題庫的試題;當使用FI1時,則會傾向選出較多最低因素負荷量題庫的試題,不過隨著最大測驗長度增加,各內容所選題數百分比的差異會減少。此外,當最大測驗長度增加時,強迫分類百分比會減少,分類正確性、平均測驗長度以及題庫使用率則會提高。當切截點數增加時,分類正確性會降低,強迫分類百分比、平均測驗長度以及題庫使用率則會提高。
在選題限制方面,當加入試題曝光率控制時,雖然可以有效控制最大試題曝光率,但會使分類正確性稍微下降,強迫分類百分比以及平均測驗長度稍微上升,題庫使用率則會大幅提升。當加入內容平衡限制時,雖然可以得到均勻的內容平衡,但會導致三種FI選題方法的效果變得沒有差異。
整體而言,三種分類目標在一個切截點時,將最大測驗長度設為30且使用FI1+2選題方法;在兩個切截點時,將最大測驗長度設為60且同樣使用FI1+2選題方法,如此將有最佳HIRT-CCT表現效能。此外,當加入試題曝光率控制以及內容平衡限制時,除了能將試題曝光率控在設定的範圍內以及使所選題目平均分布在各內容上,並能有效提升題庫使用率,且對於HIRT-CCT效能的影響不大。
This study aims to implement high-order item response theory (HIRT) in computerized classification test (CCT), and to investigate the influences of target classification traits, Fisher Information (FI) item selection methods, cutting points, maximum test lengths and item selection constrains on the efficiency of HIRT-CCT. In this study, 3PLM-HIRT was employed as the test model and the ability confidence interval with estimated-based item selection method was used as a classification method. Five independent variables were manipulated: (a) target classification traits target at second-order latent trait, target at first-order latent trait, and target at both second-order and first-order latent traits; (b) FI item selection methods maximum of FI at second-order latent trait (FI2) , maximum of FI at first-order latent trait (FI1) , and maximum of FI at both second-order and first-order latent traits (FI1+2); (c) number of cutting points 1 and 2; (d) maximum test lengths 15, 30, 60, 90 for 1-cutting point and 30, 60, 90, 120 for 2-cutting point (e) item selection constrains no item exposure and content balancing controls, only item exposure control, only content balancing control, and item exposure plus content balancing controls. Five major dependent variables were included: (a) classification accuracy, (b) average test length, (c) maximum item exposure rate, (d) pool usage rate, and (e) content balancing (the percentage of selected items for each content).
The main results are summarized as follows:
1. For three types of target classification traits, the results indicated that there was a little difference between target at first-order latent trait and target at both second-order and first-order latent traits. Besides, classification accuracy would increase while maximum test length increases.
2. For three types of FI item selection methods, in term of classification accuracy, FI2 had little effect on increasing classification accuracy of second-order latent trait. FI1 could increase classification accuracy of the first-order latent trait with the lowest factor loading (the second highest factor loading one would keep unchanged or increasing), but the one with the highest factor loading would decrease. However, three methods tended to be similar while maximum test length increases. In term of the percentage of forced classification, and average test length, using FI1 would yield the lowest percentage of forced classification, and the lowest average test length for target at first-order latent trait and target at both second-order and first-order latent traits; using FI2 would yield the lowest percentage of forced classification, and the lowest average test length for target at second-order latent trait. In term of content balancing, the results were close to being even while using FI1+2. Besides, there are more items selected from the highest factor loading item pool while using FI2, and more from the lowest factor loading while using FI1. However, the differences among three content balancings would decrease while maximum test length increases.
3. For four types of maximum test lengths, the percentage of forced classification would decrease but classification accuracy, average test length, and pool usage rate would increase while maximum test length increases.
4. For two types of cutting points, classification accuracy would decrease but the percentage of forced classification, average test length, and pool usage rate would increase while cutting point increases.
5. For item selection constrains, although item exposure control could control item exposure rate, it would result in a slight decreasing on classification accuracy, a slight increasing on the percentage of forced classification, and average test length, but a substantial increasing on pool usage rate. As for content balancing control, although it could maintain an even content balancing, it would lead to no differences occurred to the results among the three methods.
In sum, three types of target classification traits of HIRT-CCT would have the best performances in the context of 1-cutting point while setting maximum test length to 30 and using FI1+2; as for the context of 2-cutting point, HIRT-CCT would yield the best performances while setting maximum test length to 60 and using FI1+2. Besides, imposing item exposure and content balancing controls on HIRT-CCT not only could control item exposure rate and maintain an even content balancing, but also improve pool usage rate; moreover, it brought little effect on the efficiency of HIRT-CCT.
參考文獻
中文部分
吳玫玲、陳淑英(2008):電腦適性測驗線上曝光率控管之研究。測驗學刊,55,1-32。
陳柏熹(2006):能力估計方法對多向度電腦化適性測驗測量精準度的影響。教育心理學報,38(2),195-211。
西文部分
Ackerman, T. A. (1991). The use of unidimensional parameter estimates of multidimensional items in adaptive testing. Applied Psychological Measurement, 13, 113-127.
Adams, R. J., Wilson, M., & Wang, W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1-23.
Adams, R. J, Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to error in variables regression. Journal of Educational and Behavioral Statistics, 22, 47-76.
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinees’ ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29-51.
Chang, S.-W., & Ansley, T. N. (2003). A comparative study of item exposure control methods in computerized adaptive testing. Journal of Educational Measurement, 40, 71-103.
Chen, S.-Y., Lei, P.-W., & Liao, W.-H. (2008). Controlling item exposure and test overlap on the fly in computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 61, 471-492.
Chen, S.-Y., & Liao, W.-H. (2005). Controlling Item Exposure and Test Overlap on the Fly in Computerized Adaptive Testing. Paper presented at the Annual Meeting of the Psychometric Society, Tilburg, Netherlands.
Chen, S.-Y., & Ankenmann, R. D. (2004). Effects of practical constraints on item selection rules at the early stages of computerized adaptive testing. Journal of Educational Measurement, 41, 149-174.
Chen, S.-Y., Ankenmann, R. D., & Spray, J. A. (2003). The relationship between item exposure and test overlap in computerized adaptive testing. Journal of Educational Measurement, 40, 129-145.
Davey, T., & Parshall, C. G. (1995, April). New algorithms for item selection and exposure control with computerized adaptive testing. Paper presented at the annual meeting of the American Educational Research Association, San Francisco.
de la Torre, J., & Douglas, J. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333-353.
de la Torre, J., & Hong, Y. (2010). Parameter estimation with small sample size a higher-order IRT model approach. Applied Psychological Measurement, 34, 267-285.
de la Torre, J., & Song, H. (2009). Simultaneously estimation of overall and domain abilities: A higher- order IRT model approach. Applied Psychological Measurement, 33, 620-639.
de la Torre, J., Song, H., & Hong, Y. (2011). A comparison of four methods of IRT subscoring. Applied Psychological Measurement, 35, 296-316.
Eggen, T. J. H. M. (1999). Item selection in adaptive testing with the sequential probability ratio test. Applied Psychological Measurement, 23, 249-261.
Eggen, T. J. H. M, & Straetmans, G. J. J. M. (2000). Computerized adaptive testing for classifying examinees into three categories. Educational and Psychological Measurement, 60, 713-734.
Ferguson, R. L. (1969). The development, implementation, and evaluation of a computer-assisted branched test for a program of individually prescribed instruction. Unpublished doctoral dissertation, University of Pittsburgh.
Fischer, G. H. (1973). Linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.
Hattie, J. (1981). Decision criteria for determining unidimensional and multidimensional normal ogive models of latent trait theory. Armidale, Australia: The University of New England, Center for Behavioral Studies.
Huang, H.-Y., Chen, P.-H., & Wang, W.-C. (2012). Computerized adaptive testing using a class of high-order item response theory models. Applied Psychological Measurement, 36, 689–706.
Huang, H.-Y., & Wang, W.-C. (2012). Higher-order testlet response models for hierarchical latent traits and testlet-based items. Educational and Psychological Measurement, 73, 491-511.
Huang, H.-Y., Wang, W.-C., & Chen, P.-H. (2010, April). An item response model with hierarchical latent traits. Paper presented at the annual meeting of the American Educational Research Association, Denver, CO.
Huang, H.-Y., Wang W.-C., Chen, P.-H., & Su, C.-M. (2013). Higher-order item response models for hierarchical latent traits. Applied Psychological Measurement, 37, 619–637.
Ju, Y. (2005). Item exposure control in a-stratified computerized adaptive testing. Unpublished master’s thesis, National Chung Cheng University, Chia-Yi, Taiwan.
Kelderman, H. (1996). Multidimensional Rasch models for partial-credit scoring. Applied Psychological Measurement, 20, 155-168.
Kingsbury, G.G., & Weiss, D.J. (1979). An adaptive testing strategy for mastery decisions. Research report 79-05. Minneapolis: University of Minnesota, Psychometric Methods Laboratory.
Kingsbury, G.G., & Weiss, D.J. (1983). A comparison of IRT-based adaptive mastery testing and a sequential mastery testing procedure. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237-254). New York: Academic Press.
Kingsbury, G.G., & Zara, A.R. (1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2, 359-375.
Lewis, C., & Sheehan, K. M. (1990). Using Bayesian decision theory to design a computerized mastery test. Applied Psychological Measurement, 14, 367-386.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.
McBride, J.R., & Martin, J.T. (1983). Reliability and validity of adaptive ability tests in a military setting. In D.J. Weiss (Ed.), New horizons in testing: Latent trait test theory and computerized adaptive testing. New York: Academic Press.
McKinley, R. L., & Reckase, M. D. (1982). The use of the general Rasch model with multidimensional item response data (Research Report ONR82-1). Iowa City: American College Testing Program.
McKinley, R. L., & Reckase, M. D. (1983). An application of a multidimensional extension of the twoparameter logistic latent trait model (ONR-83-3). (ERIC Document Reproduction Service No. ED 240 168)
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Institute of Educational Research. (Expanded edition, 1980. Chicago: The University of Chicago Press.)
Reckase, M. D. (1983). A procedure for decision making using tailored testing. In D. J. Weiss (Ed.), New horizons in testing: Latent trait theory and computerized adaptive testing (pp. 237- 254). New York: Academic Press.
Revuelta, J., & Ponsoda, V. (1998). A Comparison of Item Exposure Control Methods in Computerized Adaptive Testing. Journal of Educational Measuremen, 35, 311-327.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement (No. 17). Richmond, VA: Psychometric Society.
Sand W. A., Water, B. K., & McBride, J. R. (Eds.) (1997). Computerized adaptive testing: from inquiry to operation. Washington, DC: American Psychological Association.
Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331-354.
Sheng, Y., & Wikle, C. K. (2008). Bayesian multidimensional IRT models with a hierarchical structure. Educational and psychological measurement, 68, 413-430.
Smith, R. L., & Lewis, C. (1995). A Bayesian computerized mastery model with multiple cut scores. Paper presented at the annual meeting of National Council on Measurement in Education, San Francisco.
Spray, J. A. (1993). Multiple-category classification using a sequential probability ratio test (Research Report 93-7). Iowa City, Iowa: ACT, Inc.
Spray, J. A., & Reckase, M. D. (1994). The selection of test items for decision making with a computerized adaptive test. Paper presented at the Annual Meeting of the National Council for Measurement in Education (New Orleans, LA, April 5-7, 1994).
Stocking, M. L., & Swanson, L. (1993). A method for severely constrained item selection in adaptive testing. Applied Psychological Measurement, 17(3), 277-292.
Sympson, J. B., & Hetter, R. D. (1985). Controlling item-exposure rates in computerized adaptive testing. Proceedings of the 27 th Annual Meeting of the Military Testing Association, (pp. 973-977). San Diego, CA: Navy Personnel Research and Development Center.
Thompson, N. A. (2009). Item selection in computerized classification testing. Educational and Psychological Measurement, 69, 778-793.
van der Linden, W.J. (1998). Bayesian item selection criteria for adaptive testing. Psychometrika, 63, 201-216.
Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. New York, NY: Cambridge University Press.
Wainer, H., Dorans, N. J., Flaugher, R., Green, B. F., Mislevy, R. J., Steinberg, L., et al. (Eds.) (1990). Computerized adaptive testing: A primer. Hillsdale, NJ: Lawrence Erlbaum Associates publish.
Wald, A. (1947). Sequential analysis. New York: Wiley.
Wang, W.-C., & Wilson, M. R. (2005a). The Rasch testlet model. Applied Psychological Measurement, 29, 126-149.
Wang, W.-C., & Wilson, M. R. (2005b). Exploring local item dependence using a random-effects facet model. Applied Psychological Measurement, 29, 296-318.
Wang, W.-C., Chen, P.-H., & Cheng, Y.-Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Method, 9, 116-136.
Wang, W.-C., Cheng, Y.-Y., & Wilson, M. R. (2005). Local item dependency for items across tests connected by common stimuli. Educational and Psychological measurement, 65, 5-27.
Weiss, D. J. (Ed.) (1985). Item response theory and computerized adaptive testing conference proceedings. MN: University of Minnesota press.
Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21, 361-375.
Wu, M. L., Adams, R. J., & Wilson, M. R. (1998). ACER ConQuest: Generalised item response modeling software. Melbourne, Australia: Australian Council for Educational Research.