COMPARISON OF ITEM RESPONSE THEORY MODELS FOR AKM NUMERACY ASSESSMENT IN SENIOR HIGH SCHOOL STUDENTS IN SOUTH SULAWESI

DOI: https://doi.org/10.30605/tcxgxk51

Authors

Item response theory, Rasch model, 2PL, 3PL, Numeracy assessment

Abstract

The Asesmen Kompetensi Minimum (AKM) constitutes the cornerstone of Indonesia's national large-scale assessment framework, designed to measure foundational numeracy competencies across the student population. Selecting the most appropriate psychometric model for calibrating AKM items is critical for ensuring valid score interpretations, equitable measurement, and evidence-based instructional policy. This study presents an empirical comparison of three Item Response Theory (IRT) models—the one-parameter logistic (Rasch) model, the two-parameter logistic (2PL) model, and the three-parameter logistic (3PL) model—applied to a 30-item AKM numeracy instrument administered to 500 senior high school students in South Sulawesi, Indonesia. Parameter estimation, model data fit, and measurement precision were evaluated using marginal maximum likelihood (MML) methods. Results revealed that The Rasch model produced the lowest Akaike Information Criterion (AIC = 15,178.11) and Bayesian Information Criterion (BIC = 15,304.54), alongside the highest marginal test information (TIF = 5.427) and reliability (.844), indicating superior parsimony and precision relative to the 2PL and 3PL models. Item difficulty parameters ranged from b = −2.788 (Item 23) to b = 0.541 (Item 22), reflecting the adequate breadth of the numeracy construct. The 2PL yielded the smallest mean chi-square item misfit, whereas the 3PL introduced unnecessary parameter complexity without meaningful gain-in-fit. These findings suggest that the Rasch model is the preferred framework for operational AKM calibration, with practical guidance provided for contexts in which 2PL or 3PL models may be appropriate.

Downloads

Download data is not yet available.

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. https://doi.org/10.1109/TAC.1974.1100705

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. AERA.

Azwar, S. (2016). Penyusunan skala psikologi [Development of psychological scales] (2nd ed.). Pustaka Pelajar.

Badan Standar, Kurikulum, dan Asesmen Pendidikan. (2022). Laporan teknis Asesmen Nasional 2022 [Technical report of National Assessment 2022]. Kementerian Pendidikan, Kebudayaan, Riset, dan Teknologi.

Baker, F. B. (2001). The basics of item response theory (2nd ed.). ERIC Clearinghouse on Assessment and Evaluation.

Baker, F. B., & Kim, S.-H. (2017). The basics of item response theory using R. Springer. https://doi.org/10.1007/978-3-319-54205-9

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Addison-Wesley.

Bond, T. G., Yan, Z., & Heene, M. (2021). Applying the Rasch model: Fundamental measurement in the human sciences (4th ed.). Routledge. https://doi.org/10.4324/9780429030499

Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). Springer.

Carstensen, C. H. (2013). Linking PISA competencies over three cycles—Results from Germany. In M. Prenzel, M. Kobarg, K. Schöps, & S. Rönnebeck (Eds.), Research on PISA (pp. 199–213). Springer. https://doi.org/10.1007/978-94-007-4458-5_12

de Ayala, R. J. (2022). The theory and practice of item response theory (2nd ed.). Guilford Press.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum Associates.

Gal, I., Grotlüschen, A., Tout, D., & Kaiser, G. (2020). Numeracy, adult education, and vulnerable adults: A critical view of a neglected field. ZDM Mathematics Education, 52(3), 377–394. https://doi.org/10.1007/s11858-020-01155-9

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.

Kementerian Pendidikan dan Kebudayaan. (2020). Asesmen Kompetensi Minimum: Panduan teknis [Minimum Competency Assessment: Technical guide]. Kemendikbud.

Kiefer, T., Mayer, A., & Zeileis, A. (2023). TAM: Test analysis modules (R package version 4.1-4). https://CRAN.R-project.org/package=TAM

Kreiner, S., & Christensen, K. B. (2014). Analyses of model fit and robustness: A new look at the PISA scaling model underlying ranking of countries according to reading literacy. Psychometrika, 79(2), 210–231. https://doi.org/10.1007/s11336-013-9347-z

Liu, Y., & Zumbo, B. D. (2007). The impact of outliers on Cronbach's coefficient alpha estimate of reliability: Visual analogue scales. Educational and Psychological Measurement, 67(4), 620–634. https://doi.org/10.1177/0013164406296976

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates.

Mair, P., & Hatzinger, R. (2007). Extended Rasch modeling: The eRm package for the application of IRT models in R. Journal of Statistical Software, 20(9), 1–20. https://doi.org/10.18637/jss.v020.i09

Mardapi, D. (2012). Pengukuran penilaian dan evaluasi pendidikan [Educational measurement, assessment, and evaluation]. Nuha Medika.

Masters, G. N. (2022). National assessment programs: Their purposes and limitations. Assessment in Education: Principles, Policy & Practice, 29(4), 396–413. https://doi.org/10.1080/0969594X.2022.2116157

Mullis, I. V. S., Martin, M. O., Foy, P., Kelly, D. L., & Fishbein, B. (2020). TIMSS 2019 international results in mathematics and science. TIMSS & PIRLS International Study Center.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). McGraw-Hill.

Nylund, K. L., Asparouhov, T., & Muthén, B. O. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. Structural Equation Modeling, 14(4), 535–569. https://doi.org/10.1080/10705510701575396

OECD. (2017). PISA 2015 technical report. OECD Publishing. https://doi.org/10.1787/9789264255425-en

OECD. (2019). PISA 2018 assessment and analytical framework. OECD Publishing. https://doi.org/10.1787/b25efab8-en

Purwanto, A., Pambudi, A., & Lestari, I. (2021). Analisis butir soal menggunakan model Rasch pada instrumen asesmen literasi numerasi [Item analysis using the Rasch model for numeracy literacy assessment instruments]. Jurnal Pengukuran Psikologi dan Pendidikan Indonesia, 10(1), 45–58. https://doi.org/10.15408/jp3i.v10i1.20123

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.4.0). R Foundation for Statistical Computing. https://www.R-project.org/

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Danish Institute for Educational Research.

Reckase, M. D. (2009). Multidimensional item response theory. Springer. https://doi.org/10.1007/978-0-387-89976-3

Reise, S. P., & Waller, N. G. (1990). Fitting the two-parameter model to personality data. Applied Psychological Measurement, 14(1), 45–58. https://doi.org/10.1177/014662169001400105

Rizopoulos, D. (2006). ltm: An R package for latent variable modelling and item response analysis. Journal of Statistical Software, 17(5), 1–25. https://doi.org/10.18637/jss.v017.i05

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. https://doi.org/10.1214/aos/1176344136

Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51(4), 567–577. https://doi.org/10.1007/BF02295596

van der Linden, W. J. (Ed.). (2016). Handbook of item response theory: Vol. 1. Models. CRC Press. https://doi.org/10.1201/9781315374512

Vrieze, S. I. (2012). Model selection and psychological theory: A discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychological Methods, 17(2), 228–243. https://doi.org/10.1037/a0027127

Waller, N. G., & Reise, S. P. (2010). Measuring psychopathology with non-standard IRT models: Fitting the four-parameter model to the Minnesota Multiphasic Personality Inventory. In S. E. Embretson (Ed.), Measuring psychological constructs (pp. 147–173). APA. https://doi.org/10.1037/12074-007

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. MESA Press.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. MESA Press.

Wu, M., Tam, H. P., & Jen, T.-H. (2016). Educational measurement for applied researchers: Theory into practice. Springer. https://doi.org/10.1007/978-981-10-3302-5

Yang, C., & Mao, X. (2014). Model selection in IRT: A comparison of model selection criteria and data recovery. Applied Psychological Measurement, 38(2), 105–122. https://doi.org/10.1177/0146621613490218

Yen, W. M. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5(2), 245–262. https://doi.org/10.1177/014662168100500212.

Downloads

Published

2026-06-09

How to Cite

COMPARISON OF ITEM RESPONSE THEORY MODELS FOR AKM NUMERACY ASSESSMENT IN SENIOR HIGH SCHOOL STUDENTS IN SOUTH SULAWESI. (2026). Pedagogy: Jurnal Pendidikan Matematika, 11(2), 1003-1020. https://doi.org/10.30605/tcxgxk51