Armanbot: automated baseball news generation

Country/area: Cuba

Organisation: Postdata.club, Artificial Intelligence Research Group – Havana University

Organisation size: Small

Publication date: 14 Aug 2020

Credit: Yudivián Almeida Cruz, Roberto Balboa, Frank Saddan Naranajo

Project description:

In Cuba, baseball is the national sport an people follow cuban players performance in the different leagues, specially in Major Leagues (MLB). This implies following a great number of matches with a large number of players and reporting the performance of them. Armanbot then generates, using AI methods, daily news about the performance of cubans in MLB. Thus, each day the cuban players who participated in each game held are identified and, then, their performance and impact on the results of the teams for which they play are described.  With this, an article is generated and published, all automatically.

Impact reached:

With the Armanbot project, the performance of all Cuban players in all major league matches during the 2020 season was followed day by day and integrally. Thus, for the first time in Cuba, an automatic news generation project was carried out. This quickly captured the interest of other newsrooms in the country to monitor both the performance of Cuban players in the major leagues, as well as their spread to other leagues including the Cuban national league. Also, given the impact it had on national journalists, the debate began, which did not exist before, about the possibilities and impacts that automatic news generation can have on newsrooms.

Techniques/technologies used:

Baseball-Reference (BR) was used as data source for the system. BR reports the statistics of the matches for each day and has a database of MLB players. Each day the data of the matches held and the Cuban players who participated were automatically obtained and structured.

In order to determine the relevant players, information was collected from ESPN on the outstanding players by matchday in previous seasons. With this information and BR statistics, a training set was built that allowed the creation of a machine learning model, based on logistic regression, to classify the performance of the players on each day.

Having the data of each match of the day and the most prominent players, the text of the news is generated. For this, a rule-based heuristic strategy was adopted. Rules were defined for the different possible situations. With sentence templates and parts of them, the final text is generated. For each possible sentence or parts of them, several alternatives were drawn up that could be selected indistinctly. A similar scheme was adopted with the titles. For the summary of the news, the algorithm provided by Gensim was used. This way, the information to be published is generated for each day: title, summary and content of the report.

For the publication of the daily news, we used Github Pages and Github Actions for the.automated tasks (scraping, classification, text generation). A scheduled task is executed daily from the repository of the news site that generates the news and adds it to a file in JSON format. With this file, using HTML5, Javascript and JQuery, the news that Armanbot writes us daily is displayed in each browser.

What was the hardest part of this project?

In this automatic news generation project, the main challenge was that the texts generated from the data were not mere statistical reports, that the texts had variability and a certain empathy with readers even when texts responded to a certain type of information scheme. In addition, all this could be done with not so complicated methods that do not involve research and development that requires large resources and that, therefore, may be available to any newsroom and that in the future can be extended to other newsrooms interested in related projects . To this must be included, that the entire process had to be developed in an automated and transparent way, with zero maintenance costs in infrastructure.

What can others learn from this project?

Automatic news generation is a possibility that is available to many newsrooms and can be carried out in different domains in a controlled manner. This can contribute to a better distribution of work in small newsrooms or in those where actions must be covered periodically over long periods of time.

This is one of the possible contributions, which are already being made, of AI in journalism and it can be carried out with methods in this field that can be within the reach of simple collaborations (as we did in Postdata.club) between different actors, both from journalism and from the investigation.

On the other hand, Postdata.club, with only three permanent people in its newsroom, was able to fully cover the performance of all Cuban players in all major league matches. Without having developed Armanbot, the team couldn’t have done it.

Project links: