How it works

The process

What we collect is more than screenshots or code.
We try to capture every site exactly as it is, in as much depth and detail as possible.

1

A website is crawled by our “robot”

The heretrix webspider crawls over every part of a website (which is also how indexing for search engines works).

2

All elements of the site are downloaded

Text, images, documents, layout… everything that is publicly available on the site is downloaded.

3

An exact copy of the site is created

The archival copy of the website can be browsed with all the functionalities of the original.
In time, with every copy of a website, the archive will form a timeline for a website, which can be naviguated via our Wayback Machine.

Information for webmasters

The Luxembourg Web Archive harvests websites automatically in accordance with the law of the 25th of June 2004 “portant réorganisation des instituts culturels de l’Etat” and the “Règlement grand-ducal du 6 novembre 2009 relatif au dépôt légal”. Publications in non-material forms which are accessible to the public through electronic means, for instance through the Internet, are subject to legal deposit in Luxembourg.

The websites that are harvested through these means enrich the patrimonial collections of the National library of Luxembourg which can thus collect and preserve the digital publications for future generations.

mission & legal framework

 

How does it work?

The harvesting is done using the Heretrix webspider.
This program doesn’t interpret Javascript completely and hence it can happen that it generates some false URLs.
This is of course not the intention of the Luxembourg Web Archive but can unfortunately not be avoided at this stage of the technology.

robots.txt

The spider of the Luxembourg Web Archive respects the robots.txt file with a few exceptions. Any file necessary for a complete display of a webpage (e.g. css, images, …) is downloaded even if it is in the robots.txt exclusion list. Moreover all landing pages for all sites are collected regardless of the robots.txt settings. In any event, the BnL reserves the right to change this policy as needed, in accordance with the “Règlement grand-ducal du 6 novembre 2009 relatif au dépôt légal”.

We have our methods

Broad crawls

Broad (or domain) crawls are operated twice a year and create a snapshot of the all “.lu” addresses plus additional lists of websites determined by the Luxembourg Web Archive.
These crawls cover a large number of websites at once, but can be tardy in capturing sites that are changing at a rapid pace or may have disappeared between two harvests.

Targeted crawls

Targeted (or event) crawls try to harvest as much information as possible about a certain event over a limited time frame. The seed lists for event crawls are short, but the frequency of captures will likely be higher. There is always a start and end date to event crawls, which could be determined in advance, (e.g. for election crawls), or could depend on the urgency of surprising events (e.g. natural catastrophies).

Selective crawls

Selective crawls cover a specific topic or field of interest, with a higher priority to the web archive. This could be linked to the pace of changing information, or the importance of the topic to the cultural heritage of Luxembourg. The seed lists will expand over time and have additional harvests, complementing the coverage by broad crawls.

Collection policy

Every collection, resulting from broad crawls or targeted crawls, is defined by our collection policy.
The different policy components, such as the frame, objectives and contents of the collection, are explained in the following categories.

Collection proposal

Intent
What is the topic, occasion or motivation behind the creation of the collection.
Scope
How broad or narrow is the field of interest for this topic and what is the time frame for harvests.

Strenghts and areas of interest

Different types of websites imply different challenges in capturing their contents. This means that there are necessarily limits to the completeness of coverage in terms of number of seeds, and changes on websites. The indication of selective, extensive, comprehensive or broad coverage on different parts of a collection, helps in understanding the limitations and priorities of the project.

Harvest strategy

Seed list
Which seeds are used to build the collection. Which types of websites are included.

Frequency of captures
The number of times, different types of seeds should be captured.
Collection size
Volume of data and number of documents to be captured.
Document types
The types of documents to be found in the collection and possible exclusions

Criteria for foreign websites

Does the topic warrant the inclusion of websites outside of the .lu domain?

Targeted crawls:

How we collect

Based on the experience of  targeted crawls on the topics of local, national, European and social elections, from 2017 to 2019, we are able to illustrate the process of building a collection and which the selection criteria shape our special collections.

Selection criteria

We need to make sure to look in the right places to select the right elements for each collection. This means, to determine whether a seed should be included in the seed list. The criteria we set up will help web archive users in following the structure and reasoning behind our special collections. Furthermore, a consistent policy helps with the organisation and search for additional seeds throughout the project.

Topicality:

Is this about politics?
Does the website show information about politics or topics relevant to the elections?

Relevance:

Is this important?
How relevant and important is the website’s content to the topic and how likely is there going to be more interesting information at a later point in time? This is important in determining the depth of the harvest, because not the whole website might be relevant to the topic. It can be difficult to determine how relevant a seed is going to be, so the general approach is to be more inclusive than too selective.

Public interest:

Should people know about this?
This question plays a role in drawing certain lines: for instance, “private” profiles of political candidates on social media, which would not represent them in their function of politicians, would not be included because they only concern their private lives and are therefore not relevant to the elections. These profiles might be publicly visible on social networks, but are clearly not meant to be interpreted in the political context. The same principles were largely applied to tabloid and yellow press articles about a person’s private life.

Uniqueness of information:

Haven’t I seen this somewhere else?
An important factor in determining the urgency and frequency of collecting websites. Even though the websites of political parties are an essential part of the collection, their content could be rather static and only be updated at rare intervals during the election campaign. Moreover, their content is repeated many times over in press conferences, posters, flyers etc. The more unique information, found for example in personal blogs, video channels, comments, and discussions on social media, would also be more likely to have a shorter life-span on the web, making some of these examples the most difficult to harvest. If there are no major objections in respect to any of these criteria, a website can be added as a seed, which will help in systematically looking for similar websites. Furthermore, this examination gives a first idea on the next step of the archiving process.

Research and organisation

The categories could also be described as “stakeholders in the elections”. In order to find as much information on each stakeholder as possible, we research the different types of seeds for each entry in a category. For example for political party “X”, we look up their website, Facebook page, twitter page etc., and continue to do the same for political party “Y”, then all local sections, all candidates etc. Not every stakeholder has a seed of each type and the correct seeds can be difficult to find. Therefore, it is necessary to check entry by entry for every seed type.

Archiving social media

Social media arguably represent some of the most unique content on the web, with no comparable counterpart in print media. Especially in the context of politics, the campaign information from political party websites, is for the most part also published in brochures, flyers and posters. Social media allow for politicians to have more individual platforms and most importantly, exchanges with the public, other politicians, journalists, etc. Social media are also the most volatile source of information, since new content appears and disappears very quickly. For a number of technical and budgetary reasons, social media are the most difficult and time-consuming part of an election campaign to research and harvest.

Facebook

Costly and often unreliable – Facebook crawls show to produce results of varying quality, due to Facebook’s active efforts of preventing the archival of pages and profiles.
In many cases we can only include Facbook seeds to show that they are part of the collection, but we are far from being able to expect a sense of completeness in coverage
    

Twitter

Less costly and more reliable than Facebook, Twitter crawls produce better results and are mostly limited by time constraints for harvests, since we might still miss information with daily crawls.
Twitter also follows a different policy regarding web archiving, where it is actually encouraged and several methods are made available for that purpose.

Youtube

Video files are generally heavier and take up more space compared to other types of media. Therefore Youtube pages can be costly to harvest.

Instagram

We have very little experience in archiving Instagram pages, but the plan is to include Instagram in future targeted crawls.

Exclusion criteria







While the Luxembourg Web Archive strives to grow over time and expand its inventory of the Luxembourg web space, there are a few exceptions to this inclusive approach:

– Websites will not be harvested, or only partially harvested, if the technical ressources necessary for its archival, are not justified by its relation to the LWA’s areas of interest.
– Websites will be excluded from the browsable Web Archive, if their contents have been deemed illegal for any reason.
Exclusions due to reasons of privacy concerns are generally not applicable. Under the legal deposit mandate, the National library of Luxembourg is only concerned by the browsable, openly accessible part of the web, also known as the surface web. Since this is only part of the worldwide web (which is only part of the Internet), our field of activity does not come into contact with information that is not supposed to be publicly available.
Even if the Luxembourg Web Archive is not required to ask for permission to archive a website within the legal deposit, we try to be as transparent and communicative about our interactions with websites as possible.