簡易檢索 / 詳目顯示

研究生: 李炯方
Joseph P. Lavallee
論文名稱: 以多面向Rasch模式為基礎檢驗Angoff標準設定法的效度議題
Validation Issues in an Angoff Standard Setting: A Facets-based Investigation
指導教授: 林世華
Lin, Sieh-Hwa
學位類別: 博士
Doctor
系所名稱: 教育心理與輔導學系
Department of Educational Psychology and Counseling
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 118
中文關鍵詞: 標準設定Angoff法多面向Rasch模式評分者效果歐洲語言共同架構評分品質
英文關鍵詞: standard setting, Angoff method, many-facet Rasch model, rater effects, Common European Framework of Reference, rating quality
論文種類: 學術論文
相關次數: 點閱:1322下載:16
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,標準設定方法在教育實務情境中蓬勃發展,其中尤以修正版Angoff標準設定法的使用最為廣泛。Angoff法假定,經過訓練後的評分者能依據試題難度正確地估計出通過預設標準的最低能力受試者,其答對每一道試題的成功機率。由於標準設定方法的主觀評分特性,因此,尋求適切的工具以確保評分者評分品質甚為重要。多面向Rasch模式(MFRM)已被廣泛使用於主觀評分情境,特別是在標準設定程序中,用以考驗評分過程中是否出現負向的評分者效果而影響評分品質。然而,多面向Rasch模式的基本假設為,評分者間的影響是不存在的。然而由於多數的研究除了評分資料外並未能取得相對客觀的試題難度資料加以比對以考驗此假設,因此極少有研究檢驗該假設。由於使用Angoff法時,除了評分者對於試題難度的評估以及受試者是否有能力能夠達到預先設定的標準,同時還可以取得外部試題反應資料。基於此,本研究利用Angoff法所取得的外部試題反應資料以及評分者資料,來交叉驗證多面向Rasch模式的基本假設。其次,利用多面向Rasch模式來檢驗Angoff法的三個假設,以及評分資料與模式的適切程度。
    在執行Angoff法時,研究者請18位外語教學(EFL)專家擔任評分者,並將英文閱讀以及聽力試題各40題對照到歐洲語言共同架構中的B1等級(Common European Framework of Reference)。在負向評分者效果的偵測方面,本研究依據MFRM所提供的各項指標,偵測三種在評分過程常出現的評分者效果:嚴苛度 (leniency/severity)、準確度(inaccuracy)以及趨中與極端評分 (centrality/extremism)。接著,將Angoff設定法所估計的概率作為內在參照架構,並將施測所得的試題難度估計作為外在參照架構。首先,將MFRM指標用來偵測在兩個參照架構下的評分者效果,並比較兩個架構下標準設定的結果。其次,利用原始分數以及MFRM指標來考驗Angoff標準設定法的基本假定。
    本研究主要的發現如下:
    1.對照兩個架構下的標準設定,評分者在嚴苛度、準確度以及評分趨中與極端程度的結果不一致。如此的差異使研究者對於單獨使用Angoff設定法,作為設定標準分數的方式,產生疑慮。有關群體效果假設的考驗也確實發現,在使用內部的參造架構下,確實出現群體趨中評分效果。這也顯示出在使用多面向Rasch模式前必須先考驗評分者間的群體效果是否存在。
    2.關於Angoff法的假設檢定,BPS以及試題功能方面違反基本假設。其中較嚴重的缺失為,幾乎所有的評分者皆無法利用概率來評估最低受試者能力。

    Introduction: The use of standards-based scores in education has grown in recent years and the modified Angoff standard setting method is perhaps the most widely used procedure for establishing these standards. In this method, trained judges imagine students who just meet the standard in question and estimate the likelihood of their responding correctly to each item on the test being aligned to the standard. The method assumes that trained judges can accurately represent students who just meet the standard, represent how test items function and quantify their estimation of the likelihood of student success for each item. All three assumptions have been called into question. More generally, the subjective nature of all standard setting methods has resulted in a focused search for tools to evaluate the quality of judges’ decisions.
    The many-facet Rasch model (MFRM) has been proposed for use in detecting rater effects generally and for evaluating standard setting results in particular. Use of the MFRM, however, relies on the further assumption that no group-level rater effects exist. Because only internal, judge-generated data is available in most cases, this assumption is usually not evaluated and little research exists on how plausible the assumption is in real settings or on how robust results are to violations of the assumption. As external item response information often is available when the Angoff method is used, an Angoff setting provides a rare opportunity to test this assumption of the MFRM. Thus, the two-fold purpose of this study is to first evaluate the suitability of the many-facet Rasch model using data from an Angoff standard setting, and then to evaluate the assumptions of the Angoff method using the MFRM.

    Method: The data consisted of the first round estimates of a panel of 18 trained EFL professionals serving as judges in an operational Angoff standard setting linking two 40-item English exams (one reading, one listening) to the Common European Framework of Reference B1 proficiency level, and of the item response data from the original administration of the exams. MFRM indices were identified for the detection of three broad types of rater effects: leniency/severity, inaccuracy and centrality/extremism. These indices include estimated parameters and standard errors, residuals and residual-based indices, separation statistics and correlations between ratings and model indices. The probability estimates made by the Angoff judges were used to construct an ‘internal’ frame of reference, and the item difficulty estimates from the test administration were used to construct an ‘external’ frame of reference. Indices from the many-facet Rasch model were used to examine the subjective ratings of the Angoff judges for the presence of rater effects in both frames and the results were compared. In the second stage of the study, the assumptions of the modified Angoff method were assessed, using raw score and MFRM indices.

    Results: In the first phase, results differed across frames for all three rater effects. The leniency/severity indicators suggested greater agreement between judges in the internal frame than in the external frame, although a similar number of judges were flagged (four in both the internal and external frames for reading; two in the internal and three in the external frame for listening). Inaccuracy effects were sharply underestimated within the internal frame of reference: six judges were flagged in the internal frame and nine in the external frame for reading; for the listening test, two and four judges were flagged in the internal and extermal frames respectively. Results for centrality/extremity differed even more markedly: for the reading test, four judges were flagged for centrality and five for extremism in the internal frame while 17 judges were flagged for centrality in the external frame; for the listening test, 10 judges were flagged for centrality and one judge for extremity in the internal frame while all 18 judges were flagged for centrality in the external frame. Group-level indicators did indicate the presence of group-level centrality and inaccuracy effects within the internal frame of reference, suggesting their possible use in evaluating the assumption of the model prior to use.
    In terms of the assumptions of the Angoff method, the BPS and item functioning assumptions appear to have been violated to some extent but the most striking failure was the inability of nearly all judges to accurately quantify their assessments using the probability scale. The ‘centrality’ or ‘central tendency’ bias, in particular, was displayed by nearly all judges, compressing the Angoff metric. This compression of the scale appears to have been largely responsible for the distorted results for the MFRM leniency/severity and centrality/extremity indices in the internal frame noted above. Further, this scale compression appears to have distorted the cut scores, leading to differences in pass/fail rates: for the reading test, the pass rates within the internal frame across the three rounds of the standard setting were 46.4%, 37.8% and 37.7%, while the corresponding pass rates in the external frame were 38.1%, 29.0% and 27.2%; for the listening test, the pass rates in the internal frame were 35.4%, 35.4% and 31.5%, compared to 31.0%, 31.0% and 27.1% in the external frame.

    Discussion: The critical assumption underlying use of the MFRM for detecting rater effects was found not to hold in the present case, casting doubt on the use of the model in standard setting situations for which only internal data (from the judges’ estimates) is available. More positively, the group-level indicators within the internal frame were found to be sensitive to inaccuracy and centrality effects and thus may serve to help check the suitability of the model for use where no external data is available.
    The assumptions of the Angoff method were also found to be violated. In particular, a centrality or central tendency bias was shown to persist across all three rounds and to distort results. In view of previous research into central tendency, the present findings are consistent with the possibility that the Angoff method is inherently highly susceptible to the distorting effects of this bias. More generally, the centrality bias seems likely to pose a serious threat in many rating situations, both to the validity of ratings and to the accuracy of indicators used to evaluate these ratings.
    Future research should focus on refining our understanding of when the MFRM is likely to be appropriate for use; on solutions to problems with the Angoff method (perhaps in the form of procedural modifications or score adjustments); and on what rating situations are likely to be susceptible to the centrality bias and how it might be reduced or eliminated.

    ACKNOWLEDGEMENTS i ABSTRACT (CHINESE) ii ABSTRACT (ENGLISH) iv TABLE OF CONTENTS vii LIST OF FIGURES viii LIST OF TABLES ix CHAPTER 1 INTRODUCTION 1 1.1 Significance of the Current Study 1 1.2 Research Questions 3 1.3 Terminology 4 CHAPTER 2 LITERATURE REVIEW 6 2.1 The Angoff Method: Assumptions and Validity Threats 6 2.2 Detection of Rater Effects with the MFRM 15 2.3 Assumption of the Use of the MFRM for Detecting Rater Effects 32 CHAPTER 3 METHODS 35 3.1 Methodological Overview 35 3.2 Exam Items and Calibrations 36 3.3 Angoff Standard Setting 37 3.4 Analysis 42 CHAPTER 4 RESULTS 46 4.1 Assumption of the MFRM 46 4.2 Assumptions of the Angoff Method 82 CHAPTER 5 DISCUSSION AND CONCLUSION 87 5.1 Summary of Results 87 5.2 Implications and Suggestions 90 5.3 Limitations of the Present Study 94 5.4 Future Research Directions 94 REFERENCES 96 APPENDICES 106 Appendix A Item Quality Statistics from Original Administration of Test 106 Appendix B CEFR Scales Used to Provide Performance Level Descriptors 108 Appendix C Angoff Judge Response Form 110 Appendix D Results for all MFRM Indices 111

    American Psychological Association, American Educational Research Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association.
    Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.
    Angoff, W.H. (1971). Scales, norms, and equivalent scores. In R.L. Thorndike (Ed.), Educational Measurement (2nd ed.). Washington, DC: American Council on Education.
    Brandon, P.R. (2004). Conclusions about frequently studied modified Angoff standard-setting topics. Applied Measurement in Education, 17, 59-88.
    Brennan, R.L. & Lockwood, R.E. (1980). A comparison of the Nedelsky and Angoff cutting score procedures using generalizability theory. Applied Psychological Measurement, 4, 219-240.
    Bourque, M.L. (2000, April). Setting student performance standards: The role of achievement level descriptions in the standard setting process. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.
    Busch, J.C., & Jaeger, R.M. (1990). Influence of type of judge, normative information, and discussion on standards recommended for the National Teacher Examinations. Journal of Educational Measurement, 27, 145-163.
    Chang, L. (1999) Judgmental item analysis of the Nedelsky and Angoff standard-setting methods. Applied Measurement in Education, 12, 151-165.
    Chang, L., Dzuiban, C.D., & Olson, A.H. (1996). Does a standard reflect minimal competency of examinees or judge competency? Applied Measurement in Education, 9, 161-173.
    Cizek, G.J., & Bunch, M.B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage.
    Clauser, B.E., Harik, P., Margolis, M.J., McManus, I.C., Mollon, J., Chis, L., & Williams, S. (2009). An empirical examination of the impact of group discussion and examinee performance information on judgments made in the Angoff standard-setting procedure. Applied Measurement in Education, 22, 1-21.
    Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge: Cambridge University Press.
    Council of Europe. (2009). Manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment. Strasbourg, France: Council of Europe/Language Policy Division.
    Cross, D.V. (1973). Sequential dependencies and regression in psychophysical judgment. Perception & Psychophysics, 14, 547-552.
    Cross, L.H., Impara, J.C., Frary, R.B., & Jaeger, R.M. (1984). A comparison of three methods for establishing standards on the National Teacher Examination. Journal of Educational Measurement, 21, 113-129.
    DeCarlo, L.T., & Cross, D.V. (1990). Sequential effects in magnitude scaling: Models and theory. Journal of Experimental Psychology: General, 119, 375-396.
    Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2, 197–221.
    Eckes, T. (2009). Many-facet Rasch measurement. In S. Takala (Ed.), Reference supplement to the manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (Section H). Strasbourg, France: Council of Europe/Language Policy Division.
    Egan, K.L., Ferrara, S., Schneider, M.C., & Barton, K. (2009). Writing Performance Level Descriptors and Setting Performance Standards for Assessments of Modified Achievement Standards: The Role of Innovation and Importance of Following Conventional Practice', Peabody Journal of Education, 84(4), 552-577.
    Engelhard, G., Jr. (1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5, 171-191.
    Engelhard, G., Jr. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31, 93-112.
    Engelhard, G., Jr. (1996). Evaluating rater accuracy in performance assessments. Journal of Educational Measurement, 33, 56-70.
    Engelhard, G. (2007). Evaluating bookmark judgments. Rasch Measurement Transactions, 21, 1097–1098.
    Engelhard, G., Jr. (2009). Evaluating the judgments of standard-setting panelists using Rasch measurement theory. In Criterion referenced testing: Practice analysis to score reporting using Rasch measurement models (pp. 312-346). Maple Grove, Minnesota: JAM Press.
    Engelhard, G., Jr. (2011). Evaluating the bookmark judgments of standard-setting panelists. Educational and Psychological Measurement, 71(6), 909-924.
    Engelhard, G., Jr., & Anderson, D.W. (1998). A binomial trials model for examining the ratings of standard-setting judges. Applied Measurement in Education, 11, 209-230.
    Englehard, G., Jr., & Cramer, S. (1997). Using Rasch Measurement to evaluate the ratings of standard-setting judges. In M. Wilson, G. Engelhard, and K. Draney. (Eds.). Objective measurement: theory into practice (Vol. 4, pp. 97-112). Norwood, NJ: Ablex.
    Engelhard, G., & Gordon, B. (2000). Setting and evaluating performance standards for high stakes writing assessments. In M. Wilson & G. Engelhard (Eds.), Objective measurement: Theory into practice (Vol. 5, pp. 3–14). Stamford, CT: Ablex.
    Englehard, G., Jr., and Stone, G.E. (1998). Evaluating the quality of ratings obtained from standard-setting judges. Educational and Psychological Measurement, 58(2), 179-196.
    Fehrmann, M.L., Woehr, D.J., & Arthur, W., Jr. (1991). The Angoff cutoff score method: The impact of frame of reference rater training. Educational and Psychological Measurement, 51, 857-872.
    Ferdous, A.A., & Plake, B.S. (2005). Understanding the factors that influence decisions of panelists in a standard setting study. Applied Measurement in Education, 18(3), 257-267.
    Garner, W.R. (1953). An informational analysis of absolute judgments of loudness. Journal of Experimental Psychology, 46, 373-380.
    George, S., Haque, M.S., & Oyebode, F. (2006). Standard setting: Comparison of two methods. BMC Medical Education. 46(6).
    Giraud, G., Impara, J.C., & Plake, B.S. (2000, April). A qualitative examination of teachers’ conception of the just competent examinee in Angoff workshops. Paper presented at the meeting of the American Educational Research Association, New orleans, LA.
    Giraud, G., Impara, J.S., & Plake, B.S. (2005). Teachers’ Conceptions of the Target Examinee in Angoff Standard Setting. Applied Measurement in Education, 18(3), 223-232.
    Goodwin, L.D. (1999). Relations between observed item difficulty levels and Angoff minimum passing levels for a group of borderline candidates. Applied Measurement in Education, 12(1), 13-28.
    Hamberlin, M.K. (1992). Influence of item response theory and type of judge on a standard set using the iterative Angoff standard setting method. Unpublished doctoral dissertation, University of North Texas, Denton, TX.
    Heldsinger, S. (2006). Accounting for unit of scale in standard setting methodologies (Doctoral dissertation, Murdoch University, Perth, Australia). Retrieved from http://researchrepository.murdoch.edu.au/72/
    Heldsinger, S., & Humphry, S. (2006). Maintaining consistent metrics in standard setting. Unpublished manuscript, Murdoch University, Perth, Australia.
    Humphry, S. (2005). Maintaining a Common Arbitrary Unit in Social Measurement (Doctoral dissertation, Murdoch University, Perth, Australia). Retrieved from http://wwwlib.murdoch.edu.au/adt/browse/view/adt-MU20050830.95143
    Humphry, S. (2011). The role of the unit in physics and psychometrics. Measurement: Interdisciplinary Research & Perspective. 9(1), 1-24.
    Humphry, S., & Andrich, D. (2008). Understanding the Unit in the Rasch Model. Journal of Applied Measurement, 9(3), 249-264.
    Hollingworth, H.L. (1910). The central tendency of judgment. The Journal of Philosophy, Psychology and Scientific Methods, 7(17), 461-469.
    Hurtz, G.M., & Jones, J.P. (2009) Innovations in measuring rater accuracy in standard setting: Assessing ‘fit’ to item characteristic curves. Applied Measurement in Education, 22, 120-143.
    Impara, J.C. (1997, October). Setting standards using Angoff’s method: Does the method meet the standard? Paper presented to the Midwestern Educational Research Association, Chicago.
    Impara, J.C., Giraud, G., & Plake, B.S. (2000, April). The influence of providing target group descriptors when setting a passing score. Paper presented at the meeting of the American Educational Research Association, New Orleans, LA. (ERIC Document Reproduction Service No. ED445013).
    Impara, J.C., & Plake, B.S. (1998). Teachers ability to estimate item difficulty: A test of the assumptions of the Angoff standard setting method. Journal of Educational Measurement, 35, 69-81.
    Jaeger, R.M. (1989). Certification of student competence. In R.L. Linn (Ed.), Educational Measurement (3rd ed.). Washington, DC: American Council on Education.
    Jaeger, R.M. (1991). Selection of judges for standard-setting. Educational Measurement: Issues and Practice, 10, 3-6.
    Jesteadt, W., Luce, R. D., & Green, D. M. (1977). Sequential effects in judgments of loudness. Journal of Experimental Psychology: Human Perception & Performance, 3, 92–104.
    Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64(3), 425-461.
    Kim, S. C., & Wilson, M. (2009). A Comparative Analysis of the Ratings in Performance Assessment Using Generalizability Theory and The Many-Facet Rasch Model. Journal of Applied Measurement, 10(4), 40-423.
    Lewis, D.M. & Green, D.R. (1997). The validity of performance level descriptors. Paper presented at the Council of Chief State School Officers National Conference on Large-Scale Assessment, Phoenix, AZ.
    Linacre, J.M. (1989). Many-Facet Rasch Measurement. Chicago: MESA Press.
    Linacre, J.M. (2000). Using Rasch fit statistics to rescale linear measures and anchor values. Rasch Measurement Transactions, 14(2), 750. Retrieved from http://www.rasch.org/rmt/rmt142n.htm
    Linacre, J.M. (2009). Facets Rasch measurement computer program (Version 3.68.0). Chicago: Winsteps.com.
    Linn & Gronlund, (2000). Measurement and Assessment in Teaching (Eighth Edition). Des Moines: Prentice-Hall.
    Lorge, I., & Kruglov, L.K. (1953). The improvement of the estimates of test difficulty. Educational and Psychological Measurement, 13, 34-46.
    Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12, 54–71.
    Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and Many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15(2), 158-180.
    Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.
    Maurer, T.J., Alexander, R.A., Callahan, C.M., Bailey, J.J., & Dabrot, F.H. (1991). Methodological and psychometric issues in setting cutoff scores using the Angoff method. Personnel Psychology, 44, 235-262.
    Mercado, R. L., & Egan, K. L. (2005). Performance level descriptors. Paper presented at the National Council on Measurement in Education, Montreal, Quebec, Canada.
    Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: Macmillan.
    McGinty, D. (2005). Illuminating the ‘Black Box’ of Standard Setting: An exploratory qualitative study. Applied Measurement in Education, 18(3), 269-287.
    Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4, 386–422.
    Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5, 189–227.
    Newcomb, T. (1931). An experiment designed to test the validity of a rating technique. Journal of Educational Psychology, 22(4). 279-289.
    Norcini, J.J., Shea, J.A., & Kanya, D.T. (1988). The effects of various factors on standard setting. Journal of Educational Measurement, 25, 57-65.
    Noor, Lide Binti Abu Kassim. (2007). Using the Rasch measurement model for standard setting of the English Language Placement Test at the IIUM (Unpublished doctoral dissertation), Universiti Sains Malaysia, Pulau Pinang, Malaysia.
    Papageorgiou, S. (2010). Investigating the decision-making process of standard setting participants. Language Testing, 27(2), 261-282.
    Peterson, C.H., Schulz, E.M., & Engelhard, G., Jr. (2011). Reliability and validity of bookmark-based methods for standard setting: Comparisons to Angoff-based methods in the National Assessment of Educational Progress. Educational Measurement: Issues and Practice, 30(2), 3-14.
    Plake, B.S., & Impara, J.C. (2001). Ability of panelists to estimate item performance for a target group of candidates: An issue in judgmental standard setting. Educational Assessment, 7, 87-97.
    Plake, B.S., Impara, J.C., & Irwin, P. (2000). Consistency of Angoff-based predictions of item performance: Evidence of technial quality of results from the Angoff standard setting method. Journal of Educational Measurement, 37, 347-355.
    Plake, B.S., Impara, J.C., & Potenza, M.T. (1994). Content specificity of expert judgements in a standard setting study. Journal of Educational Measurement, 31, 339-347.
    Poulton, E.C. (1979). Models for biases in judging sensory magnitude. Psychological Bulletin, 86(4), 777-803.
    Pula, J.J., & Huot, B.A. (1993). A model of background influences on holistic raters. In M.M. Williamson & B.A. Huot (Eds.), Validuting holistic scoring for writing assessment: Theoretical and empirical foundations (pp. 237-265). Cresskill, NJ: Hampton Press.
    Rasch, G. (1977). On Specific Objectivity: An attempt at formalizing the request for generality and validity of scientific statements. The Danish Yearbook of Philosophy, 14, 58-93. Retrieved from http://www.rasch.org/memo18.htm
    Reckase, M.D. (2006). A conceptual framework for a psychometric theory for standard setting with examples of its use for evaluating the functioning of two standard setting methods. Educational Measurement: Issues and Practice, 25(2), 4-18.
    Reid, J.B. (1985, April). Establishing upper limits for item rtings for the Angoff method: Are resulting standards more ‘realistic’? Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago.
    Ricker, K.L. (2006). Setting cut-scores: A critical review of the Angoff and Modified Angoff Methods. Alberta Journal of Educational Research, 52(1), 53-64.
    Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88(2), 413-428.
    Schulz, E.M. (2006). Commentary: A response to Reckase’s conceptual framework and examples for evaluating standard setting. Educational Measurement: Issues and Practice, 25(3), 4-13.
    Scullen, S.E., Mount, M.K., & Goff, M. (2000). Understanding the Latent Structure of Job Performance Ratings. Journal of Applied Psychology, 85(6), 956-970.
    Skorupski, W.P., & Hambleton, R.K. (2005). What are panelists thinking when they participate in standard-setting studies? Applied Measurement in Education, 18(3), 233-256.
    Shepard, L.A. (1994, October). Implications for standard setting of the NAE evaluation of NAEP achievement levels. Paper presented at the Joint Conference on Standard Setting for Large Scale Assessments, National Assessment Governing Board, National Center for Educational Statistics, Washington, DC.
    Shepard, L., Glaser, R., Linn, R., & Bohrnstedt, G. (1993). Setting performance standards for student achievement tests. Stanford, CA: National Academy of Education.
    Stemler, S.E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4). Retrieved from http://PAREonline.net/getvn.asp?v=9&n=4
    Stemler, S. E., & Tsai, J. (2008). Best practices in interrater reliability: Three common approaches. In J. W. Osborne (Ed.), Best practices in quantitative methods (pp. 29–49). Los Angeles: Sage.
    Stevens, S.S., & Greenbaum, H.B. (1966). Regression effect in psychophysical judgment. Perception & Psychophysics, 1, 439-446.
    Teghtsoonian, R. (1973). Range effects in psychophysical scaling and a revision of Stevens’s law. American Journal of Psychology, 86, 3–27.
    Teghtsoonian, R., & Teghtsoonian, M. (1978). Range and regression effects in magnitude scaling. Perception & Psychophysics, 24, 305–314.
    Teghtsoonian, M., Teghtsoonian, R., & DeCarlo, L.T. (2008). The influence of trial-to-trial recalibration on sequential effects in cross-modality matching. Psychological Research, 72, 115-122.
    van der Linden, W.J. (1982). A latent trait method for determining intrajudge consistency in the Angoff and Nedelsky techniques of standard setting. Journal of Educational Measurement, 19, 295-308.
    van de Watering, G., & van der Rijt, J. (2006). Teachers’ and students’ perceptions of assessments: A review and a study into the ability and accuracy of estimating the difficulty levels of assessment items. Educational Research Review, 1(2), 133-147.
    Verhoeven, B.H., Van der Stegg, A.F.W., Scherpbier, A.J.J.A., Muijtjens, A.M.M., Verwijnen, G.M., & Van der Vleuten, C.P.M. (1999). Reliability and credibility of an Angoff standard setting procedure in progress testing using recent graduates as judges. Medical Education, 33, 832-837.
    Verhoeven, B.H., Verwijnen, G.M., Muijtjens, A.M.M., Scherpbier, A.J.J.A., & Van der Vleuten, C.P.M. (2002). Panel expertise for an Angoff standard setting procedure in progress testing: Item writers compareted to recently graduated students. Medical Education, 36, 860-867.
    Ward, L. M. (1973). Repeated magnitude estimations with a variable standard: Sequential effects and other properties. Perception & Psychophysics, 14, 193–200.
    Weir, J.C. (2005). Limitations of the Common European Framework for developing comparable examinations and tests. Language Testing, 22(3), 1-20.
    Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287.
    Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10, 305–335.
    Wilson, M., & Case, H. (2006). An examination of variation in rater severity over time: A study in rater drift. In M. Wilson & G. Engelhard, Eds., Objective measurement: theory into practice.
    Wolfe, E.W. (2004). Identifying rater effects using latent trait models. Psychology Science, 46, 35-51.
    Wolfe, E.W., Chiu, C.W.T., & Myford, C.M. (2000). Detecting rater effects with a multi-faceted Rasch rating scale model. In M.Wilson & G. Engelhard (Eds.), Objective measurement: Theory into practice (Vol. 5, pp. 147-164). Stamford, CT: Ablex.
    Wolfe, E.W., & McVay, A. (2010). Rater effects as a function of rater training context. Retrieved from http://www.pearsonassessments.com/NR/rdonlyres/6435A0AF-0C12-46F7-812E-908CBB7ADDFF/0/RaterEffects_101510.pdf
    Wolfe, E.W., & McVay, A. (2011, April). Application of latent trait models to identifying substantively interesting raters. Presented at the Annual Conference of the American Educational Research Association, New Orleans. Retrieved from http://www.pearsonassessments.com/hai/images/PDF/AERA_Application_Latent_Trait_Models.pdf
    Wright, B.D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370.
    Wright, B.D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA.
    Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago: MESA.
    Yue, Xiaohui. (2011). Detecting rater centrality effect using simulation methods and Rasch measurement analysis (Doctoral dissertation, Virginia Polytechnic Institute and State University). Retrieved from http://scholar.lib.vt.edu/theses/available/etd-07272011-104720/unrestricted/Yue_X_D_2011.pdf

    下載圖示
    QR CODE