What we captured
2 years of crawling
As a pilot project in targeted web crawls, the National Library of Luxembourg has archived over the course of two years: websites, social media profiles and online news media in relation with local and national election campaigns. While we were able to include Facebook and Twitter profiles of candidates and political parties, aiming at completeness in capturing all relevant content from social media was simply out of reach, due to technical hurdles and not being able to keep up with the pace of changing content.
Fields and topics
Same as the article selection, this manual procedure was completed by one person, with a single screening of each article and its resulting characterisation is therefore subjective. In the context of the 2017 local elections, over 2600 news articles were harvested and tagged. News coverage of the 2018 national elections however produced over 8200 articles. While the “Topic” field was also tagged manually, we wanted to move on to automated tagging in the fields of “Political party”, “Electoral district”, “Publication date” and “Votes received”. The publication dates of articles were captured by a web scraping tool from the live web versions of the articles. By extracting the text fields from the archived WARC files, we would like to determine the names of political candidates mentioned in each article, as well as mentions of political parties and other regular expressions (such as nicknames or official titles for example). By determining the mentions of political parties and candidates, the electoral district can be deduced automatically. Since the results of the elections are in, the number of votes received for each candidate, or party mentioned in the article, can also be attributed to a field.
Since this method of processing news articles is very time-consuming, this form of examination might go beyond the goal of collecting and preserving websites and will most likely not be a regular exercise for the Luxembourg Web Archive. The main objective of this project was to present the possibilities and benefits of working with archived web materials, with a relatable subject, in an easily accessible way. Moreover, we are able to share insights about our activities, without accessing the actual web archive.