簡易檢索 / 詳目顯示

研究生: 張夏石
Michael Scott Sommers
論文名稱: Angoff標準設定之判斷者的評估
An evaluation of judges in an Angoff standard setting
指導教授: 陳柏熹
學位類別: 博士
Doctor
系所名稱: 教育心理與輔導學系
Department of Educational Psychology and Counseling
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 120
中文關鍵詞: Angoff判斷者標準設定
英文關鍵詞: Angoff, judges, standard setting
DOI URL: https://doi.org/10.6345/NTNU202203072
論文種類: 學術論文
相關次數: 點閱:115下載:16
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在標準設定中,專業的判斷者根據表現水準描述(Performance Level Descriptors, PLDs),扣合到標準化測驗的分數,並據以區分將學生的能力表現。這個流程通常決定了分數對學生的意義和決策人員對測驗的使用,例如,通過/未通過的決定、或優秀/平均/未通過等,也就是說,這些決定與標準設定判斷者之評估密切相關。在典型標準設定中,專家學者小組的判斷者接受訓練,評估符合表現水準的考生是否能答對測驗題目,接著互相討論判斷的結果。標準設定的組織者,則會提供回饋讓判斷者了解其決定對影響考生之通過和未通過比例的影響和其他的測驗使用情形。此外,整個標準設定過程,判斷者在訓練中被要求提出對於了解相關概念和想法之熟悉性與自信程度的自我報告,以及是否正確地來運用判斷。Angoff標準設定是廣泛被使用於區分設定的方法之一。這個方法中,專家判斷小組對於學生的能力做出判斷,以評估學生能夠於表定時間中正確回答測驗題目。此流程相當重要,然而,有關如何地預備判斷者在標準化設定中的角色,所知仍有限。
    本研究數據蒐集是由一所臺灣的大學發展之本土外語測驗和共同歐洲參考架構(Common European Framework of Reference, CEFR)所對應的題項而來,包括聽和讀兩個小組都加以實施。本研究採用兩種共同使用的評量方法,以瞭解預備判斷者對於Angoff標準設定和判斷精確性的關聯。判斷的精確性是以答對率判斷的相關性(p相關)和方均根差(Root Mean Square Error, RMSE) 和截止分數判斷(Cut-off Score Judgments, CSJ)來測量。在第一次評估時,判斷者以PLDs加以訓練,然後測試其對於PLDs切合測驗知識的PLDs和判斷精準性;第二次評估時,則在訓練中介紹判斷的測量精確性,對於概念和想法的熟悉性和自信程度的相關情形,發現最終判斷的測驗精確性於熟悉程度和自信程度之間沒有相關。除了主要發現之外,進一步觀察到精確的語詞說明,對於判斷的精確性是非常重要的。也觀察到以RMSE和CSJ來對精確性做出差異決定優於p相關。本文對未來研究方向提出在訓練Angoff標準設定判斷者的結論和建議,也指出本研究限制所在。

    In a standard setting, groups of expert judges evaluate verbal descriptions of performance (Performance Level Descriptors or PLDs) contained in a standard and match these with scores on a standardized test that place students in categories of performance. This procedure is often used to make decisions about what scores mean for the students and policy makers who use the tests. For example, Pass/Fail decisions, as well as Excellent/ Average/ Fail decisions are often tied to how tests are evaluated by standard setting judges. In a typical standard setting, panels of expert judges are trained, evaluate test items, and are then given time to discuss their results with other judges. Feedback is provided by standard setting organizers that allow judges to know how their decisions would affect students Pass/ Fail rate and other decisions the test will be used to make. In addition, throughout the standard setting, judges are asked to give self-reports about their familiarity with and confidence in their understanding of the concepts and ideas during the training and whether or not the judge is applying them correctly. The Angoff standard setting method is one of the mostly widely used methods for setting cutscores. In this method, panels of expert judges make judgments about the ability of students to correctly answer test items listed one at a time. Despite the importance of this procedure, little is known about how best to prepare judges for their role as a judge in the standard setting. Data was gathered from a standard setting held at a Taiwan university to match items from a locally developed foreign language test with the Common European Framework of Reference (CEFR). The study then used an evaluation of two commonly used methods to prepare judges for an Angoff standard setting and their relationship with judge accuracy. Both a listening and reading panel were conducted. Accuracy of judges was measured by the p-value correlation, the Root Mean Square Error (RMSE), and the Cutoff Score Judgment (CSJ). For the first evaluation, judges were trained in the PLDs and then tested about their ability to match a test of knowledge of the PLDs with the three measures of judge accuracy. No relationship was found between tested knowledge of the PLDs and judge accuracy. The second evaluation correlated familiarity with and confidence in the concepts and ideas introduced during the training period with the measured accuracy of the judge. Once again no relationship was found between familiarity and confidence with the final measured accuracy of the judge. In addition to the main findings, it was also observed that the exact wording of the instructions to instructions is very important to the accuracy of the judges. RMSE and CSJ were observed to make different decisions about accuracy than the p-value correlation. Future directions for research on the training of Angoff standard setting judges are suggested, as are the limitations of this study.

    ACKNOWLEDGEMENTS i ABSTRACT(CHINESE) v ABSTRACT(ENGLISH) vii TABLE OF CONTENTS ix LIST OF TABLES xi CHAPTER 1 INTRODUCTION 1 1.1 Significance of the Current Research 1 1.2 Research Questions 3 1.3 Terminology 4 CHAPTER 2 LITERATURE REVIEW 5 2.1 Standard Setting Method 5 2.2 The Angoff Method 12 2.3 Training and the Angoff Standard Setting Method 16 2.4 Problems with the Angoff Method 20 CHAPTER 3 METHODS 25 3.1 Materials 25 3.2 Judges 28 3.3 Procedures 30 3.4 Assessment Tools 40 3.5 Assessment Expectations 47 3.6 Data Analysis 48 CHAPTER 4 RESULTS 49 CHAPTER 5 CONCLUSIONS & DISCUSSION 79 5.1 Summary of Results 79 5.2 Other Important Findings 82 5.3 Future Research Directions 84 5.4 Limitations of the Present Study 85 REFERENCES 89 APPENDIXES 101 Appendix 1 Common European Framework of Reference - Global Scale 101 Appendix 2 Informed Consent Form 104 Appendix 3 Security Form 106 Appendix 4 Angoff Panelist Record Form 107 Appendix 5 Panelist Information Form 111 Appendix 6 PART I. Procedures 113 Appendix 7 PART II. Common European Framework 114 Appendix 8 PART III. The University Practical English Test 115 Appendix 9 Review of Standard Setting Procedures 116 Appendix 10 Angoff Standard Setting. Final Evaluation 117 Appendix 11 Cutscore statistics for the Standard Setting – reading 119 Appendix 12 Cutscore statistics for the Standard Setting – listening 120

    REFERENCES
    American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational, & Psychological Testing (US). (2014). Standards for educational and psychological testing. Amer Educational Research Assn.
    Angoff, W. H. (1971). Scales, norms, and equivalent scores. In: R. L. Thorndike (Ed.), Educational Measurement (pp. 508-600). Washington, DC: American Council on `Education.
    Brandon, P. R. (2004). Conclusions about frequently studied modified Angoff standard-setting topics. Applied Measurement in Education, 17(1), 59–88.
    Bond, T. G., & Fox, C. M. (2001). Applying the Rasch Model: Fundamental Measurement in Human Sciences. Mahwah, NJ: Erlbaum.
    Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B. S. (2002). A comparison of Angoff and Bookmark standard setting methods. Journal of Educational Measurement, 39(3), 253-263.
    Chi, M. T. H., Glaser, R., & Farr, M. J. (Eds.). (1988). The Nature of Expertise. Hillsdale, NJ: Erlbaum.
    Cizek, G. J. (1996). An NCME instructional module on setting passing scores. Educational Measurement: Issues and Practice, 15(2), 20-31.
    Cizek, G. J. (2001). Conjectures on the rise and call of standard setting: An introduction to
    context and practice. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives, (pp. 3-17). Routledge.
    Cizek, G. J. (Ed.). (2012). Setting performance standards: Foundations, methods, and
    innovations. Mahwah. NJ: Erlbaum.
    Cizek, G. J. (2012a). The forms and functions of evaluations in the standard setting
    process. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations, (pp. 165-178). NJ: Erlbaum.
    Cizek, G. J., & Bunch, M. B. (2007). Standard Setting: A Guide to Establishing and Evaluating Performance Standards on Tests. Thousand Oaks, CA: Sage.
    Cizek, G.J., Bunch, M.B., & Koons, H. (2004). Setting performance standards: Contemporary methods. Educational Measurement: Issues and Practice, 23(4), 31–50.
    Clauser, J. C., Margolis, M. J., & Clauser, B. E. (2014). An examination of the replicability of Angoff standard setting results within a generalizability theory framework. Journal of Educational Measurement, 51(2), 127-140.
    Clauser, B. E., Mee, J., Baldwin, S. G., Margolis, M. J., & Dillon, G. F. (2009). Judges' use of examinee performance data in an Angoff standard‐setting exercise for a medical licensing examination: An experimental study. Journal of Educational Measurement, 46(4), 390-407.
    Clauser, B. E., Mee, J., & Margolis, M. J. (2013). The effect of data format on integration of performance data into Angoff judgments. International Journal of Testing, 13(1), 65-85.
    Clauser, B. E., Swanson, D. B., & Harik, P. (2002).A multivariate generalizability analysis of the impact of training and examinee performance information on judgments made in an Angoff-style standard-setting procedure. Journal of Educational Measurement, 39(4), 269–290.
    Council of Europe. (2001). Common European framework of reference for languages. Cambridge: Cambridge University Press.
    Council of Europe. (2009). Manual for relating language examinations to the Common
    Common European Framework of References for Language Learning, Teaching, Assessment. Cambridge: Cambridge: Cambridge University Press.
    Crocker, L. & Zieky, M. (1994). Joint Conference Standard Setting for Large-Scale Assessments. National Assessment Governing Board. Washington, D.C.
    Cronbach, L. J. (1988). Five perspectives on validation argument. In H. Wainer & H. Braun (Eds.), Test Validity (pp. 3–17). Hillsdale, NJ: Erlbaum.
    Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.
    Cross, L. H., Impara, J. C., Frary, R. B., & Jarger, R. M. (1984). A comparison of three methods on the National Teacher Examination. Journal of Educational Measurement, 21(2), 113- 129.
    Egan, S. J., Dick, M., & Allen, P. J. (2012). An experimental investigation of standard setting in clinical perfectionism. Behaviour Change, 29(3), 183-195.
    Elman, B. A. (2000). A cultural history of civil examinations in late imperial China. University of California Press.
    Embretson, S.E. and Reise, S.P. (2000). Item response theory for psychologists. Mahwah, NY: Lawrence Erlbaum Associates.
    Engelhard, G. (2007). Evaluating bookmark judgments. Rasch measurement Transactions, 21, 1097-1098.
    Engelhard, G. and Anderson, D. W. (1998). A binomial trials model for examining the ratings of standard setting judges. Applied Measurement in Education, 11(3), 209-230.
    Fitzpatrick, A. R. (1989). Social influences in standard setting: The effects of social interaction on group judgments. Review of Educational Research, 59(3), 315-328.
    George, S., Haque, M. S., & Oyebode, F. (2006). Standard setting: comparison of two methods. BMC Medical Education, 6(1), 46.
    Glaser, R. (1963). Instructional technology and the measurement of learning outcomes. American Psychologist, 18(8), 519–522.
    Goodwin, L.D. (1999). Relations between observed item difficulty levels and Angoff minimum passing levels for a group of minimally competent examinees. Applied Measurement in Education, 12(1), 13-28.
    Green, D. R., Trimble, C. S., & Lewis, D. M. (2003). Interpreting the results of three different
    standard setting procedures. Educational Measurement: Issues and Practice, 22(1), 22–32.
    Halpin, G., Sigmon, G., & Halpin, G. (1983). Minimum competency standards set by three divergent groups of raters using a three judgmental procedures: Educational and Psychological Measurement, 47(1), 977-983.
    Hambleton, R. K. (1980). Test score validity and standard-setting methods. Criterion-referenced measurement: The state of the art, 80, 123.
    Hambleton, R. K. (2001). Setting performance standards on educational assessments and criteria for evaluating the process. In Cizek G. J. (Ed.), Setting performance standards: Concepts, methods, and perspectives, (pp. 89-116).
    Hambleton, R. K., Pitoniak, M. J., & Copella, J. M. (2012). Essential steps in setting performance standards on educational tests and strategies for assessing the reliability of results. In Cizek G. J. (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd ed., pp. 47–76). New York, NY: Routledge.
    Pitoniak, M. J. (2006). Setting performance standards. Educational Measurement, 4, 433-470.
    Hertz, N. R., & Chinn, R. N. (2002, April). The role of deliberation style in standard setting for licensing and certification examinations. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.
    Holden, R. (2010). "Face validity". In Weiner, Irving B.; Craighead, W. Edward. (Eds,),The Corsini Encyclopedia of Psychology (4th ed).(pp. 637-638). Hoboken, New Jersey: Wiley.
    Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4), 679-688.
    Hurtz, G. M., & Auerbach, M. A. (2003). A meta-analysis of the effects of modifications to the Angoff method on cutoff scores and judgment consensus. Educational and
    Psychological Measurement, 63(4), 584–601
    Huynh, H. & Schneider, C. (2005). Vertically moderated standards: Background, assumptions, and practices. Applied Measurement in Education, 18(1), 99-113.
    Impara, J.C., & Plake, B.S. (1998). Teachers’ ability to estimate item difficulty: A test of the assumptions in the Angoff standard-setting method. Journal of Educational Measurement, 35(1), 69-81.
    Jaeger, R. M. (1991). Selection of judges for standard‐setting. Educational Measurement: Issues and Practice, 10(2), 3-14.
    Johnson, E. J. (1988). Expertise and decision under uncertainty: Performance and process. In M. Chi, R. Glaser, & M. J. Farr (Eds.), The Nature of Expertise. (pp. 209-228). Hillsdale, NJ: Lawrence Erlbaum Associates.
    Kaftandjieva, F. (2010). Methods for Setting Cut Scores in Criterion-references Achievement Tests. A Comparative Analysis of Six Recent Methods with an Application to Tests of Reading in EFL. EALTA publication. Retrieved March 25, 2013 from http://www.ealta.eu.org/documents/resources/FK_second_doctorate.pdf
    Kane, M. T. (2006). Validation. Educational Measurement, 4(2), 17-64.
    Kane, M. T. (2001). So much remains the same: conception and status of validation in setting standards. In G. J. Cizek (Ed.), Setting performance standards: concepts, methods and perspectives (pp. 19–51). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
    Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527-535.
    Larkin, J. H., McDermott, J., Simon, D. P., & Simon, H. A. (1980). Expert and novice performance in solving physics problems. Science, 208, 1335-1342.
    Lavallee, J. (2012). Validation Issues in an Angoff Standard Setting: A Facets-based investigation. Unpublished PhD Dissertation, Department of Counseling and Educational Psychology, National Taiwan Normal University, Taipei, Taiwan.
    Linn, R. L. (2003). Accountability: Responsibility and reasonable expectations. Educational Researcher, 32, 3-13.
    Linn, R. L., Baker, E. L., & Betebenner, D. W. (2002). Accountability systems: Implications of requirements of the No Child Left Behind Act of 2001. Educational Researcher, 31, 3–16.
    Linn, R. L., & Shepard, L. A. (1997). Item-by-item standard setting: Misinterpretations of judge’s intentions due to less than perfect item inter-correlations. In Council of Chief
    State School Officers National Conference on Large Scale Assessment, Colorado Springs, CO.
    Lissitz, R. W. & Huynh H. (2003). Vertical equating for state assessments: Issues and solutions in determination of adequate yearly progress and school accountability. Practical Assessment, Research & Evaluation, 8(10). Retrieved March 25, 2012 From http://pareonline.net/getvn.asp?v=8&n=10
    Lissitz, R. W. & Wei, H. (2008).Consistency of standard setting in an augmented state testing system. Educational Measurement, 27(2), 46-56.
    Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(3), 635–694.
    Loomis, S. C. (2012). Selecting and Training Standard Setting Participants. Setting performance standards: Foundations, methods, and innovations, 107-134.
    Lorge, L, & Kruglov, L. (1953). A suggested technique for the improvement of difficulty prediction of test items. Educational and Psychological Measurement, 12(4), 554-561.
    McGinty, D. (2005). Illuminating the “Black Box” of standard setting: An exploratory qualitative study. Applied Measurement in Education, 18(3), 269–287.
    Margolis, M. J., & Clauser, B. E. (2014). The Impact of Examinee Performance Information on Judges’ Cut Scores in Modified Angoff Standard‐Setting Exercises. Educational
    Measurement: Issues and Practice, 33(1), 15-22.
    Mee, J., Clauser, B. E., & Margolis, M. J. (2013). The impact of process instructions on judges’ use of examinee performance data in Angoff standard setting exercises. Educational Measurement: Issues and Practice, 32(3), 27-35.
    Messick, S. (1981). Constructs and their vicissitudes in educational and psychological measurement. Psychological Bulletin, 89(3), 575–588.
    Messick, S. (1989).Validity. In R. L. Linn (Ed.), Educational measurement (pp.13–103). Washington, DC: American Council on Education and National Council on Measurement in Education.
    Messick, S. (1998). Test validity: A matter of consequence. Social Indicators Research, 45(1-3),
    35–44.
    Michigan State Department of Education. (February, 2007). Retrieved from http://www.michigan.gov/documents/mde/MI-ELPA_Tech_Report_final_199596_7.pdf
    Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2001). The Bookmark procedure: Psychological perspectives. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 249-281). Mahwah, NJ: Erlbaum.
    National Council for Measurement in Education. (2015). Retrieved from http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorV
    Nedelvsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14(2), 3-19.
    Nelson, D. S., (1994). Job analysis for licensure and certification exams: science or politics? Educational Measurement: Issues and Practice, 13(3), 29-35.
    Norcini, J., Lipner, R., Langdon, L., & Strecker, C. (1987). A comparison of three variations on a standard-setting method. Journal of Educational Measurement, 24(1), 56-64.
    Norcini, J. J. & Shea, J. A. (1997). The credibility and comparability of standards. Applied Measurement in Education, 10(1), 39–59.
    Plake, B., & Giraud, G. (1998). Effect of a modified Angoff strategy for obtaining item performance estimates in a standard setting study. Paper presented at the Annual Meeting of the American Educational Research Association. San Diego, Calf.
    Plake, B. S., Melican, G. J., & Mills, C. N. (1991). Factors Influencing Intrajudge Consistency During Standard‐Setting. Educational Measurement: Issues and Practice, 10(2), 15-16.
    Raymond, M. R., & Reid, J. B. (2001). Who made thee a judge? Selecting and training Participants for standard setting. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives, (pp. 119-157).
    Reckase M. D.(2000). The ACT/NAGB standard setting process: How "modified" does it have to be before it is no longer a modified-Angoff process? Paper presented at the annual meeting of the American Educational Research Association, New Orleans.
    Reckase, M. D. (2006) Rejoinder: Evaluating standard setting methods using error models proposed by Schulz. Educational Measurement, 25(3), 14-17.
    Roach, A. T., McGrath, D., Wixon, C., & Talapatra, D. (2010). Aligning an early childhood assessment to state kindergarten content standards: application of a nationally recognized alignment framework. Educational Measurement: Issues and Practice, 29(1), 25-37.
    Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings: Assessing the psychometric quality of rating data. Psychological Bulletin, 88(2), 413-428.
    Schafer, W. D. (2005). Criteria for standard setting from the sponsor’s perspective. Applied Measurement in Education, 18(1), 61-81.
    Schoonheim‐Klein, M., Muijtjens, A., Habets, L., Manogue, M., Van Der Vleuten, C., & Van der Velden, U. (2009). Who will pass the dental OSCE? Comparison of the Angoff and
    the borderline regression standard setting methods. European Journal of Dental Education, 13(3), 162-171.
    Shepard, L.A. (1980). Standard setting issues and methods. Applied Psychological Measurement, 4(4), 447-467.
    Shepard, L. A. (1994). Implications for standard setting of the National Academy of Educational Evaluation of the National Assessment of Educational Progress achievement levels. In: Proceedings of the joint conference on standard setting for large-scale assessments of the
    National Assessment Governing Board and the National Center for Educational Statistics (pp. 143–159). Washington, DC: U.S. Government Printing Office.
    Smith, R. L. and Smith, J. S. (1988). Differential use of item information by judges using Angoff and Nedelsky procedures. Journal of Educational Measurement, 25(4), 259-274.
    Taube, K.T. (1997). The incorporation of empirical item difficulty data in the Angoff standard-setting procedure. Evaluation and the Health Professions, 20(4), 479-498.
    Taylor, J. (2014, July 17). Difference Between Within-Subject and Between-Subject [Blog] Retrieved from http://www.statsmakemecry.com/smmctheblog/within-subject-and-between-subject-effects-wanting-ice-cream.html
    van de Watering, G., & van der Rijt, J. (2006). Teachers’ and students’ perceptions of assessments: A review and a study into the ability and accuracy of estimating the
    difficulty levels of assessment items. Educational Research Review, 1(2), 133-147.
    Verhoeven, B. H., Verwijnen, G. M., Muijtjens, A. M. M., Scherpbier, A. J. J. A., & Van der Vleuten, C. P. M. (2002). Panel expertise for an Angoff standard setting procedure in progress testing: item writers compared to recently graduated students. Medical Education, 36(9), 860-867.
    Wessen, C. (2010). Analysis of Pre- and Post-Discussion Angoff ratings for evidence of social influence effects. Unpublished MA Dissertation, Department of Psychology, University of California, Sacramento.
    Wiley, A., & Guille, R. (2002). The occasion effect for “at-home” Angoff ratings. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.
    Yin, P. & Schultz, E. M. (2005). A comparison of cut scores and cut score variability from Angoff-based and Bookmark-based procedures in standard setting. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada.

    下載圖示
    QR CODE