For the first time, the Portuguese Education Ministry gave access to an anonymized database with the grades of the students from 11.º e 12.º years in public schools. The database contained information such as if a student had “escalão A” – a subsidy given to poorer students. The data allowed us to conclude that poorer students have lower marks in every subject, but also that they tend to choose Humanities rather than STEM.
The story allowed everyone to understand how the Portuguese public school is not yet able to end with the differences between students from poorer and richer families. Even though the national exam’s data already allowed us to check that students from poorer backgrounds tend to have lower marks on those exams, the data on this database was way more comprehensive. Apart from the already mentioned trend observed that students from poorer families tend to avoid STEM, we were already able to observe a bigger gap in subjects like Geometry and English rather than, for example, Sociology. Those trends need a more comprehensive database (including several school years, something that was not provided by the Portuguese authorities), but might give public authorities some clues about what could be done to close the gap between poorer and richer students.
R programming language was used to analyze the database. Using ggplot2 we were able to do some draft charts. Choosing the right visualization was a challenge since we were not sure about the best way to show our findings.
We used flourish to make the final visualizations and Scrollama.js to build the scroller.
What was the hardest part of this project?
Since the original database included information for 102.947 students – and their marks on every subject – it was a massive database that provided some challenges. Also, since it included so much information, it was a challenge to find what could be the real story there. The database also included some problems (for example, the unique ids that the anonymized database included were not so unique after all…). That meant that the analysis involved asking several questions to the authorities that provided it.
The second challenge was to determine if some of those conclusions could just be noise in the data. The lack of a database from previous years meant that it was impossible to know if that was a trend observed only on those 102.947 students. Those students were also studying during a pandemic – and we were not able to evaluate how much that affected the results of our analysis.
What can others learn from this project?
That data journalism can contribute to the public debate but also give some clues about further academic research.