Enem, the second largest university entrance exam in the world, is taken by more than 3 million participants every year.
Conservatives, including the current President, Jair Bolsonaro, argue that the exam fails to measure a candidate’s ability because of items that they perceive to be biased. The criticized questions are often those regarding the country’s military dictatorship, women’s rights and LGBTQI+ themes.We analyzed billions of responses from millions of students over the past 11 years to verify if these test items effectively assess their abilities and we found that Bolsonaro criticism is unjustifiable.
Jair Bolsonaro, current Brazilian President, has been trying to change questions present in the Enem test since taking over the government. This effort has a technical argument as its formal justification: that certain items would not measure the students’ proficiency and would only be on the test to promote ideological indoctrination. To our knowledge, this is the first time a group – journalistic or academic – has put these claims to test. Thus, the findings presented here, that the justifications presented by the president are false, are unprecedented.
Furthermore, we have shown that other items, which were not targets of Bolsonaro’s criticism, are inefficient in the assessment of a students’ knowledge. This conclusion was corroborated when we found that correct answers on these items were not considered for the final grade of the students. In other words, the model used by the institute responsible for the test (Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira – INEP) identified that these items were unfit to the point that getting them right did not give any information of how good the student was, leading to no increase in the final grade. None of the millions of test participants were aware of this at the time, since the parameters for calculating the grade are not disclosed by the government.
This article was the headline of the Folha de São Paulo newspaper, one of the largest and most traditional newspapers in the country.
- Item response theory (IRT): To make the theorical model of expected answer from the students according to their abilities. Which we compared the empirical data (using Bock’s (1972) chi-squared method). IRT was also used to to ascertain the ability of test’s items to discriminate students according to their proficiency;
- Factor analysis: To check if students with more knowledge had more chance of getting the right aswer or not (using Crit coefficient);
- Biserial correlation: To check if getting the right answer fom one it was correlated to get the right answer in others items;
- Web scraping: To get the links to download the data and to obtain images of the test items;
Tools and technologies:
- Postgresql: To store the data from the students and the tests;
- R: Used to obtain the data, make the analysis and plots and format the data;
- Adobe Illustrator: To edit, layout and ready plots for print
What was the hardest part of this project?
The evaluation of the quality of items in a test, that is, if it is efficiently measuring the knowledge of students, is not a simple task.
We have based our analysis on the Item Response Theory (IRT), a complex statistical model used to estimate the ability of students in different areas. This model is the same used by Enem to calculate students’ grades. Hence, our analysis is based on the same assumptions that the test uses.
Moreover, one of our journalists took a college course on the subject, with one of the greatest specialists in the field in Brazil, in order to obtain the necessary knowledge for the analyses
Structuring the PostgreSQL databases was a complex operation, as it was necessary to store 12 billion responses from 70 million students who took the tests between 2009 and 2019. This structure was further hampered by the lack of standardization of the files from different years provided by the government. This work took 3 months.
After this step, it was necessary to create codes in R to assess whether the items were able to discriminate students according to knowledge, that is, whether, for each item, the chance of getting it right was related to the student’s ability in the area of knowledge.
Finally, there was extensive effort on coming up with a visual solution to best demonstrate the complexity of the theory in an understandable way for the reader. This includes a quiz with the unfit questions from the test, so the reader could understand how the candidates react to these and the problems with it.
What can others learn from this project?
We believe there’s value in translating technical knowledge as a way of doing journalism and contesting authorities’ declarations. This project started when we tried to understand how the government calculated students’ grades. In doing so, we faced difficulty in understanding the extremely specific technical knowledge. Although this would often be reason enough to abandon a task, we did not let ourselves be intimidated by it and sought to understand the model in depth. We understood that this knowledge provided the tools to assess the quality of the tests’ items and that with it we were able challenge the narrative that the current federal government tried to impose.