Is your website archive compliant?

The following requirements can help website developers in preparing their websites, to improve the quality and completeness of web archive captures. These recommendations are based on the criteria established by the National Archives’ UK Government Web Archive.

1

HTML versions and HTTP protocols

We can archive and replay all versions of HTML to date. Present everything on your website using either the HTTP or HTTPS protocol.

2

Video, infographic, audio and multimedia content

– Streaming video or audio cannot be captured and should also be made accessible via progressive download, over HTTP or HTTPS, using absolute URLs, where the source URL is not obscured.

– Link to audio-visual (AV) material with absolute URLs instead of relative URLs.


– Provide transcripts for all audio and video content.


– Provide alternative methods of accessing information held in infographics,
videos and animations.


– We cannot usually capture content which is protected by a cross domain file or within a cross domain iframe. This is most often multimedia content which is embedded in web pages but hosted on another domain. If this is the case for any of your content, please ensure that it is made available to the crawler in an alternative way.

3

Documents and file sharing

We cannot capture file content hosted on web-based collaborative platforms and file sharing services such as Sharepoint, Google Docs and Box. You should make these files available in ways which are accessible to the crawler – for example as downloadable files hosted on the root domain.

4

Site structure and sitemaps

– Include a human-readable HTML sitemap in your website. It makes content more accessible, especially when users are accessing the archived version, as it provides an alternative to interactive functionality.

– Have an XML sitemap. This really speeds up our ability to capture and quality assure our website archives.
Please see https://www.sitemaps.org/ for further details. Link the sitemap from robots.txt file (RFC 9309).

– Where possible, keep all content under one root URL. Any content hosted under root URLs other than the target domain, sub-domain or microsite is unlikely to be captured. Typical examples of this include documents hosted in the cloud (such as amazonaws.com), newsletters hosted by services such as Mailchimp and when any services that link through external domains are used.


– If using pagination (../page1 , ../page2 and so on), you will also need to include all URLs from that pagination structure in your browse or XML sitemap as the crawler can sometimes misinterpret recurrences of a similar pattern as a crawler trap and therefore may only crawl to a limited depth.

5

Links and URLS

– ‘Orphaned’ content (content that is not linked to from within your website) will not be captured. You will need to provide a list of orphan links as an XML sitemap or supplementary URL list before the crawl is launched.

– Links in binary files attached to websites (links included in .pdf, .doc, .docx, .xls, .xlsx, .csv documents) cannot be captured. All resources linked to in these files must also be linked to on simple web pages or you will need to provide a list of these links as an XML sitemap or supplementary URL list before the crawl is launched.


– Where possible, use meaningful URLs such as https://mywebsite.com/news/new-reportlaunch rather than https://mywebsite.com/5lt35hwl. As well as being good practice, this can help when you need to redirect users to the web archive.


– Avoid using dynamically-generated URLs.

6

Dynamically-generated content and scripts

– Client-side scripts should only be used if it is determined that they are most appropriate for their intended purpose.

– Make sure any client-side scripting is publicly viewable over the internet – do not use encryption to hide the script.

– As much as you can, make sure your code is maintained in readily-accessible separate script files (example: with .js extension) rather than coded directly into content pages, as this will help diagnose and fix problems.

– Avoid using dynamically-generated date functions. Use the server-generated date, rather than the client-side date. Any dynamically-generated date shown in an archived website will always display today’s date.

– Avoid using dynamically-generated URLs.

– Dynamically-generated page content using client-side scripting cannot be captured. This may affect the archiving of websites constructed in this way.

– Wherever possible the page design should make sure that content is still readable and links can still be followed by using the <no script> element.

– When using JavaScript to design and build, follow a ‘progressive enhancement’ approach. This works by building your website in layers:

1. Code semantic, standards-compliant (X)HTML or HTML5
2. Add a presentation layer using CSS
3. Add rich user interactions with JavaScript

– This is an example of a complex combination of JavaScript, which will cause problems for archive crawlers, search engines, and some users: javascript:__doPostBack(‘ctl00$ContentPlaceHolder1$gvSectionItems’,’Page$1′)


– This is a preferred example of a well designed URL scheme with simple links: 
<a href=”content/page1.htm”

onclick=”javascript:__doPostBack(‘ctl00$ContentPlaceHolder1$gvSectionItems’,’Page$1′)>“1<a>

– Always design for browsers that don’t support JavaScript or have disabled JavaScript.

– Provide alternative methods of access to content, such as plain HTML.

7

Interactive graphs, maps and charts

– Avoid interactive content where possible as we often have difficulty archiving these resources and retaining their functionality.

– If you have vital interactive graphs, maps or charts please let us know as we may be able to attempt to capture them using experimental technology.


– In all cases, if content of these types cannot be avoided, please provide alternative ‘crawler friendly’ methods for accessing and displaying it. Where visualisations are used, the underlying data itself should always be accessible in as simple a way as possible – for example in a .txt or .csv file.

8

Menus, search and forms

– Use static links, link lists and basic page anchors for menus and navigation elements, rather than using JavaScript and dynamically generated URLs.

– Any function that requires a ‘submit’ operation, such as dropdown menus, forms, search and checkboxes, will not archive well. Always provide an alternative method to access this content wherever possible, and make sure you provide a list of links that are difficult to reach as an XML sitemap or supplementary URL list before the crawl is launched.

9

Database and lookup functions

– If your site uses databases to support its functions, these can only be captured in a limited fashion. We can capture snapshots of database-driven pages if these can be retrieved via a query string, but cannot capture the database used to power the pages.

– For example, we should be able to capture the content generated at https://www.mywebsite.lu/mypage.aspx?id=12345&d=true
since the page will be dynamically generated when the web crawler requests it, just as it would be for a standard user request. This works where the data is retrieved using a HTTP GET request as in the above example.

10

POST requests and Ajax

– We can’t archive content that relies on HTTP POST requests, since no query string is generated. Using POST parameters is fine for certain situations such as search queries, but you must make sure that the content is also accessible via a query string URL that is visible to the crawler, otherwise it will not be captured.

– We are unlikely to be able to capture and replay any content which use HTTP POST requests, Ajax or similar.


– Always provide an alternative method to access this content wherever possible, and make sure you provide a list of links that are difficult to reach as an XML sitemap or supplementary URL list before the crawl is launched.

11

W3C compliance

In most cases, a website that has been designed to be W3C Web Accessible should also be easy to archive.

Always use simple, standard web techniques when building a website. There are few limits to a website builder’s creativity when using the standard World Wide Web Consortium (WC3) recommendations. Using overly complex and non-standard website design increases the likelihood that there will be problems for users, for web archiving, and for search engine indexing.

12

Website backups (as files)

We cannot accept ‘dumps’ or ‘backups’ of websites from content management systems, databases, on hard drives, CDs or DVDs or any other external media into the archive. Only snapshots directly crawled by our system are accepted.

13

Intranet and secured content areas

We cannot archive content that is protected behind a login, even if you provide us with the login details.

If content is hosted behind a login because it is not appropriate for it to be publicly accessible; it should be managed there until its sensitivity falls away and it can then be published to the open website, or you can liaise with your information management team about whether it should be preserved through other methods as part of the public record.