What we collect is more than screenshots or code.
We try to capture every site exactly as it is, in as much depth and detail as possible.
A website is crawled by our “robot”
The heretrix webspider crawls over every part of a website (which is also how indexing for search engines works).
All elements of the site are downloaded
Text, images, documents, layout… everything that is publicly available on the site is downloaded.
An exact copy of the site is created
The archival copy of the website can be browsed with all the functionalities of the original.
In time, with every copy of a website, the archive will form a timeline for a website, which can be naviguated via our Wayback Machine.
Information for webmasters
The Luxembourg Web Archive harvests websites automatically in accordance with the law of the 25th of June 2004 “portant réorganisation des instituts culturels de l’Etat” and the “Règlement grand-ducal du 6 novembre 2009 relatif au dépôt légal”. Publications in non-material forms which are accessible to the public through electronic means, for instance through the Internet, are subject to legal deposit in Luxembourg.
The websites that are harvested through these means enrich the patrimonial collections of the National library of Luxembourg which can thus collect and preserve the digital publications for future generations.
We have our methods
Broad (or domain) crawls are operated twice a year and create a snapshot of the all “.lu” addresses plus additional lists of websites determined by the Luxembourg Web Archive.
These crawls cover a large number of websites at once, but can be tardy in capturing sites that are changing at a rapid pace or may have disappeared between two harvests.
Targeted (or event) crawls try to harvest as much information as possible about a certain event over a limited time frame. The seed lists for event crawls are short, but the frequency of captures will likely be higher. There is always a start and end date to event crawls, which could be determined in advance, (e.g. for election crawls), or could depend on the urgency of surprising events (e.g. natural catastrophies).
Selective crawls cover a specific topic or field of interest, with a higher priority to the web archive. This could be linked to the pace of changing information, or the importance of the topic to the cultural heritage of Luxembourg. The seed lists will expand over time and have additional harvests, complementing the coverage by broad crawls.
Every collection, resulting from broad crawls or targeted crawls, is defined by our collection policy.
The different policy components, such as the frame, objectives and contents of the collection, are explained in the following categories.
What is the topic, occasion or motivation behind the creation of the collection.
How broad or narrow is the field of interest for this topic and what is the time frame for harvests.
Strenghts and areas of interest
Different types of websites imply different challenges in capturing their contents. This means that there are necessarily limits to the completeness of coverage in terms of number of seeds, and changes on websites. The indication of selective, extensive, comprehensive or broad coverage on different parts of a collection, helps in understanding the limitations and priorities of the project.
Which seeds are used to build the collection. Which types of websites are included.
Frequency of captures
The number of times, different types of seeds should be captured.
Volume of data and number of documents to be captured.
The types of documents to be found in the collection and possible exclusions
Criteria for foreign websites
Does the topic warrant the inclusion of websites outside of the .lu domain?
How we collect
Based on the experience of targeted crawls on the topics of local, national, European and social elections, from 2017 to 2019, we are able to illustrate the process of building a collection and which the selection criteria shape our special collections.
We need to make sure to look in the right places to select the right elements for each collection. This means, to determine whether a seed should be included in the seed list. The criteria we set up will help web archive users in following the structure and reasoning behind our special collections. Furthermore, a consistent policy helps with the organisation and search for additional seeds throughout the project.
Research and organisation
The categories could also be described as “stakeholders in the elections”. In order to find as much information on each stakeholder as possible, we research the different types of seeds for each entry in a category. For example for political party “X”, we look up their website, Facebook page, twitter page etc., and continue to do the same for political party “Y”, then all local sections, all candidates etc. Not every stakeholder has a seed of each type and the correct seeds can be difficult to find. Therefore, it is necessary to check entry by entry for every seed type.
Archiving social media
Social media arguably represent some of the most unique content on the web, with no comparable counterpart in print media. Especially in the context of politics, the campaign information from political party websites, is for the most part also published in brochures, flyers and posters. Social media allow for politicians to have more individual platforms and most importantly, exchanges with the public, other politicians, journalists, etc. Social media are also the most volatile source of information, since new content appears and disappears very quickly. For a number of technical and budgetary reasons, social media are the most difficult and time-consuming part of an election campaign to research and harvest.
While the Luxembourg Web Archive strives to grow over time and expand its inventory of the Luxembourg web space, there are a few exceptions to this inclusive approach:
– Websites will not be harvested, or only partially harvested, if the technical ressources necessary for its archival, are not justified by its relation to the LWA’s areas of interest.
– Websites will be excluded from the browsable Web Archive, if their contents have been deemed illegal for any reason.
Exclusions due to reasons of privacy concerns are generally not applicable. Under the legal deposit mandate, the National library of Luxembourg is only concerned by the browsable, openly accessible part of the web, also known as the surface web. Since this is only part of the worldwide web (which is only part of the Internet), our field of activity does not come into contact with information that is not supposed to be publicly available.
Even if the Luxembourg Web Archive is not required to ask for permission to archive a website within the legal deposit, we try to be as transparent and communicative about our interactions with websites as possible.