What we captured
About the collection
We started the covid-19 collection on 18th March 2020. On that date, there were 81 known cases in Luxembourg. One of the first steps was to go through all news outlet pages, gathering every relevant article about the coronavirus. At this stage, it was already clear that the coronavirus would become the single most extensive subject we had ever captured before – both in terms of its impact on society as well as on the internet.
Participation from the public
For other collections, we have already launched calls for participation and invited political parties to share their online presence in order to improve the lists we use for web harvests. For the covid-19 collection, we made a call to a larger audience since the crisis has touched every aspect of society and community, making it impossible for us to research and detect every relevant website for the collection on our own. The response was overwhelmingly positive as we received many contributions from smaller communities and minorities whose experiences didn’t get any coverage in the news. Without the call for participation, we would have thus been unable to capture this vital aspect of life during the coronavirus pandemic.
Priorities and methods
Our coverage on covid-19 focuses on websites, news outlets and Twitter. Unfortunately, we have limited coverage from Facebook, Youtube or other social media platforms. This is because we have to prioritise our technical resources over an undetermined period of time and the end date for the collection has not yet been defined. There are different methods that we’ve combined since March 2020:
– Manual crawls with the Archive-It tool
– Domain crawls in collaboration with the Internet Archive
– Continous online news media crawls in collaboration with the Internet Archive
– Manual crawls with Webrecorder/Conifer
Scope and limitations
These methods vary in the kind of sites that are captured, the number of URLs included in each webcrawl and the frequency of captures. For instance, some websites are only captured once, whereas media outlets are being captured on a daily or weekly basis since June. As manual crawls with Archive-It are limited by a data budget, we have to select the seeds for each category carefully. This method allows for a higher frequency and focuses on high priority sites. Nevertheless, we still aim to capture the larger picture, such as collecting data from all websites with a .lu domain., Such large-scale domain crawls are done twice a year. For the covid-19 collection, we were able to add an additional domain crawl, beginning of April, increasing the rhythm of such larger scale captures of the Luxembourgish web to a 4 month rhythm between December 2019 and December 2020.
Due to the unpredictable nature of the pandemic and unprecedented situation in Luxembourgand the world, it is difficult to determine a fixed end to the crawls in this collection. We plan to keep collecting while the subject is still dominating the news, the internet and people’s lives, yet we have to manage our resources which may affect the pace of harvesting.