Differential Item Functioning (DIF) in Terms of Gender in the Reading Comprehension Subtest of a High-Stakes Test

Document Type: Research Paper


Sharif University of Technology


Validation is an important enterprise especially when a test is a high stakes one. Demographic variables like gender and field of study can affect test results and interpretations. Differential Item Functioning (DIF) is a way to make sure that a test does not favor one group of test takers over the others. This study investigated DIF in terms of gender in the reading comprehension subtest (35 items) of a high stakes test using a three-step logistic regression procedure (Zumbo, 1999). The participants of the study were 3,398 test takers, both males and females, who took the test in question (the UTEPT) as a partial requirement for entering a PhD program at the University of Tehran. To show whether the 35 items of the reading comprehension part exhibited DIF or not, logistic regression using a three step procedure (Zumbo, 1999) was employed. Three sets of criteria of Cohen’s (1988), Zumbo’s (1999), and Jodin and Girel’s (2001) were selected. It was revealed that, though the 35 items show “small” effect sizes according to Cohen’s classification, they do not display DIF based on the other two criteria. Therefore, it can be concluded that the reading comprehension subtest of the UTEPT favors neither males nor females.     


Alderson, C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. New York: Cambridge University Press.

Anastasi, A. (1986). Evolving concepts of test validation. Annual Reviews of Psychology, 37, 1-15.

Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & Braun, H. (Eds.) Test validity (pp. 19-32). Hillsdale, NJ: Erbaum.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Brown, H. D. (2004). Language assessment: Principles and classroom practices. London: Longman.

Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English language assessment. New York: McGraw-Hill.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd  Ed.). Hillsdale, NJ: Lawerence Erlbaum Associates, Inc.

Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds), Differential item functioning (pp. 137-166). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.

French, A. A., & Miller, T. R. (1996). Logistic regression and its use in detecting differential item functioning in polytomous items. Journal of Educational Measurement, 33 (3), 315-332.

Geranpayeh, A., & Kunnan, A. J. (2007). Differential item functioning in terms of age in the certificate in advanced English examination.

Language Assessment Quarterly, 4 (2), 190-222.

Hatch, E., & Farhady, H. (1982). Research design and statistics for applied linguistics. Rowley, Massachusetts: Newbury House.

Hatch, E., & Lazaraton, A. (1997). The research manual: Design and statistics for applied linguistics. Boston, MA: Heinle &Heinle Publishers.

Jodin, M. G., & Gierl, M. J. (1999). Evaluating type I error and power using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329-349.

Kim, M. (2001). Detecting DIF across the different language groups in a speaking test. Language Testing, 18, 89-114.

Lai, J. S., Teresi, J., & Gershon, R. (2005). Procedures for the analysis of differential item functioning (DIF) for small sample sizes. Evaluation & the Health Professions, 28 (3), 283-294.

McNamara, T., & Roever, C. (2006). Language testing: The social dimension. New York: Blackwell publishing.

Monahan, P. O., McHorney, C. A., Stump, T. E., & Perkins, A. J. (2007).

Odds ratio, delta, ETS classification, and standardization measures of DIF magnitude for binary logistic regression. Journal of Educational and Behavioral Statistics, 32 (1), 92-109.

Mousavi, S. A. (2009). An encyclopedic dictionary of language testing. Tehran: Rahnamma Press.

Noortgate, W. V. D., & Boeck, P. D. (2005). Assessing and examining differential item functioning using logistic mixed models. Journal of Educational and Behavioral Statistics, 30 (40), 443-464.

O'Neill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are associated with differential item functioning. In P. W. Holland & H. Wainer (Eds), Differential item functioning (pp. 255-267). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.

Pae, T. (2004). Gender effect on reading comprehension with Korean EFL learners. System, 32, 265-281.

Park, T. (2006). Detecting DIF across different language and gender groups in the MELAB essay test using the logistic regression method. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 4, 81-96.

Perrone, M. (2006). Differential item functioning and item bias: Critical  considerations in test fairness. Columbia University Working Papers in TESOL & Applied Linguistics, 6 (2), 1-3.

Rezaee, A., & Salehi, M. (2008). The construct validity of a language proficiency test: A multitrait multimethod approach. TELL, 2 (8), 93-110.

Roever, C. (2001). Web-based language testing. Language and Learning and Technology, 5 (2), 84-94.

Roever, C. (2005). “That’s not fair!” Fairness, bias, and differential item functioning in language testing. Retrieved November 18, 2006, from the University of Hawai’i System Web site: http://www2.hawaii.edu/~roever


 Salehi, M., & Rezaee, A. (2009). On the factor structure of the grammar section of university of Tehran English Proficiency Test (the UTEPT). Indian Journal of Applied Linguistics, 35 (2), 169-187.

Scherman, C. A., & Goldstein, H. W. (2008). Examining the relationship between race-based Differential Item Functioning and Item Difficulty. Educational and Psychological Measurement, 68, 537-553.

Shoahmy, E. (2000). Fairness in language testing.  In A. J. Kunnan (Ed.), Fairness and validation in language assessment: Selected papers from the 19th Language Testing Research Colloquium, Orlando, Florida (pp. 15-19). Cambridge, UK: Cambridge University Press.

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of  Educational Measurement, 27 (4), 361-370.

Swanson, D. B., Clauser, B. E., Case, S. M., Nungester, R. J., Featherman, C. (2002). Analysis of Differential Item Functioning (DIF) Using Hierarchical Logistic Regression Models. Journal of  Educational and Behavioral Statistics, 27 (1), 53- 75.

Tae, P. (2004). Gender effect on reading comprehension with Korean EFL learners. System, 32, 265-281.

Takala, S., & Kaftandjieva. (2000). Teat fairness: A DIF analysis of an L2 vocabulary test. Language Testing, 17, 323-340.

Teresi, J. (2004). Differential item functioning and health assessment. Columbia University Stroud Center and faculty of Medicine. New York  State Psychiatric Institute, Research Division, Hebrew Home for the Aged at Riverdale. 1-24.

Zumbo, B. D. (1999). A Handbook on the theory and methods of Differential       Item Functioning (DIF): Logistic regression modeling as a unitary  framework for binary and likert-type (Ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense.