Understanding web legal deposit

The BnF is responsible for the legal deposit of the French Web. Its collection of archived sites, which is one of the oldest and richest in the world, is open to anyone wishing to carry out research.

Machine room © Gilles Coulon/ Tendance Floue/ BnF

Legal framework

Legal deposit, which was introduced in the 16th century, has the aim of preserving the memory of all French publishing production, whatever the target audience (scientific, artistic, leisure, etc.). It has adapted to changes in the media it uses, creating a unique and irreplaceable heritage collection.

Since the DADVSI law (law on copyright and related rights in the information society) passed on 1 August 2006 and its implementing decree passed in 2011, the BnF has been responsible for collecting, preserving, referencing and providing public access to websites in the French domain under the legal deposit scheme. Unlike traditional legal deposit for printed publications, web legal deposit does not require any active approach on the part of the website producer, as the data is collected automatically using a bot. This law, which transposed the European directive on copyright and related rights in the information society into French law, also enabled the adoption of a number of exceptions to copyright and related rights in French legislation.

In particular, it introduced into the French Heritage Code (Articles L.132-4, L.132-5 and L.132-6) an exception to intellectual property rights (copyright, related rights and the rights of database producers) in favour of organisations responsible for legal deposit. Organisations responsible for legal deposit may now legally, without having to request prior authorisation or pay any remuneration (French Heritage Code, Articles L.131-1 to L.133-1 and R.131-1 to R.133-1):

reproduce works for the purposes of legal deposit on any medium and by any process: collection, preservation, consultation,
make these works available for consultation by accredited researchers on individual consultation workstations.

How the web crawler works

The BnF uses Heritrix, a web crawler, which acts as a “hoover” or “harvester” of websites. Launched on a list of initial URL addresses known as “seeds”, it extracts the links in the code of the pages, following them like an automated web surfer. It then copies the components (pages, images, etc.) it finds that fall within the scope of the crawl.

Frequency and depth

The frequency and depth (all or part of a site) of crawls are adapted to the nature of the sites and the rate at which they are updated, so as to keep successive versions that are representative of their development. Each capture is precisely dated and referenced, making it possible to go back in time and navigate within the archived sites using the “Archives de l’internet” application.

Points of attention and limitations

Web technologies evolve faster than crawling tools, so it may happen that the harvesting bot does not collect a site in its entirety. This is particularly the case with dynamic websites. Sometimes the BnF’s bots are blocked by certain sites, preventing them from being crawled.

Collection from social media is unstable: depending on the period, the collection of certain content (Facebook, Instagram, Twitter) is partial or impossible owing to security controls. For example, it has no longer been possible to collect Facebook content since the end of 2020, and Twitter since July 2023.

When integrated into a site, videos cannot be collected for technical reasons. However, videos have been collected from YouTube channels (since 2017) and Dailymotion channels (from 2007 to 2013).

How crawls are organised

While not claimed to be exhaustive, collection is based on the principle of representativeness. To this end, the BnF combines two complementary crawls methods:

broad crawls: Carried out once a year, the aim of this type of crawl is to have a sample of as many sites as possible. The list of these sites is provided by partner registrars, such as the Association française pour le nommage de l’internet en coopération (Afnic) and OVH. Every year, the BnF strives to improve its web coverage: between 2007 and 2022, the number of domains collected rose from 0.9 million to 5.8 million (i.e. around 60% of the French Web).
focused crawls: These crawls vary in frequency and depth and cover several tens of thousands of sites selected by librarians at the BnF and in Printer legal deposit libraries in the French regions, as well as by specialists and researchers.

Within these focused crawls, so-called “on-going” crawls focus on reference sites, in line with the BnF’s collections, namely print collections. Cooperative thematic crawls document cross-disciplinary themes and major events such as elections.

Finally, there are urgent crawls, in response to unexpected events that have a major impact on society and are relayed spontaneously via social media.

News is well represented in the collections through online press sites, newspapers in PDF format, and titles from the regional daily press and social media.

Collection building policy

Created by the BnF and its partners in the French regions, these collections cover a wide range of disciplines and themes. Their aim is to build up, by sampling, the memory of the French Web, and to give an account of the diversity of this essential medium for the study of representations and changes in our ways of creating, communicating, entertaining ourselves, campaigning, travelling, etc. The oldest collections were acquired retrospectively from Internet Archive for the period 1996 to 2000.

While not exhaustive, the goal is to be representative and sample-based in order to capture the actions, knowledge, ideas and representations circulating on the Web on a given subject, and to reflect the diversity of the Web as a medium and space for exchange.

The selections complement and extend the BnF’s printed collections, in accordance with the Documentary Charter. The aim is to:

Collect objects that are now natively digital: candidate manifestos, research blogs, performance programmes, etc.;
Cover current research in a given discipline: academic sites, organisation of a disciplinary field, conferences and events, training bodies and programmes, etc.;
Capture the appropriation of a field by various stakeholders, the diversity of actions and representations (academic web and also amateur history blogs, blogs by well-known writers and readers’ blogs, participatory science, resources on both art music and popular music);
Document amateur and emerging practices (online writing, digital art), and everyday sites (video games);
Record debates and discussion, and diversity of opinion;
Document renewed forms of social commitment and activism with the arrival of the internet (online voting, digital public services, etc.).

DIscover the BnF’s web archive collections

Contact

Web legal deposit

depot.legal.web@bnf.fr

Understanding web legal deposit

Legal framework

How the web crawler works

Frequency and depth

Points of attention and limitations

How crawls are organised

Collection building policy

Contact

Web legal deposit

Catalogue général

CCFr

Data

Gallica

Archives et manuscrits

Médailles et antiques