The following requirements can help website developers in preparing their websites, to improve the quality and completeness of web archive captures. These recommendations are based on the criteria established by the National Archives’ UK Government Web Archive.
HTML versions and HTTP protocols
We can archive and replay all versions of HTML to date. Present everything on your website using either the HTTP or HTTPS protocol.
Documents and file sharing
We cannot capture file content hosted on web-based collaborative platforms and file sharing services such as Sharepoint, Google Docs and Box. You should make these files available in ways which are accessible to the crawler – for example as downloadable files hosted on the root domain.
Links and URLS
– ‘Orphaned’ content (content that is not linked to from within your website) will not be captured. You will need to provide a list of orphan links as an XML sitemap or supplementary URL list before the crawl is launched.
– Links in binary files attached to websites (links included in .pdf, .doc, .docx, .xls, .xlsx, .csv documents) cannot be captured. All resources linked to in these files must also be linked to on simple web pages or you will need to provide a list of these links as an XML sitemap or supplementary URL list before the crawl is launched.
– Where possible, use meaningful URLs such as https://mywebsite.com/news/new-reportlaunch rather than https://mywebsite.com/5lt35hwl. As well as being good practice, this can help when you need to redirect users to the web archive.
– Avoid using dynamically-generated URLs.
Interactive graphs, maps and charts
– Avoid interactive content where possible as we often have difficulty archiving these resources and retaining their functionality.
– If you have vital interactive graphs, maps or charts please let us know as we may be able to attempt to capture them using experimental technology.
– In all cases, if content of these types cannot be avoided, please provide alternative ‘crawler friendly’ methods for accessing and displaying it. Where visualisations are used, the underlying data itself should always be accessible in as simple a way as possible – for example in a .txt or .csv file.
Database and lookup functions
– If your site uses databases to support its functions, these can only be captured in a limited fashion. We can capture snapshots of database-driven pages if these can be retrieved via a query string, but cannot capture the database used to power the pages.
– For example, we should be able to capture the content generated at https://www.mywebsite.lu/mypage.aspx?id=12345&d=true
since the page will be dynamically generated when the web crawler requests it, just as it would be for a standard user request. This works where the data is retrieved using a HTTP GET request as in the above example.
In most cases, a website that has been designed to be W3C Web Accessible should also be easy to archive.
Always use simple, standard web techniques when building a website. There are few limits to a website builder’s creativity when using the standard World Wide Web Consortium (WC3) recommendations. Using overly complex and non-standard website design increases the likelihood that there will be problems for users, for web archiving, and for search engine indexing.
Intranet and secured content areas
We cannot archive content that is protected behind a login, even if you provide us with the login details.
If content is hosted behind a login because it is not appropriate for it to be publicly accessible; it should be managed there until its sensitivity falls away and it can then be published to the open website, or you can liaise with your information management team about whether it should be preserved through other methods as part of the public record.