Please note: all BnF sites close early at 3pm on Tuesday 24 and 31 December.
As a reminder, all BnF sites are closed on Wednesday 25 December and 1 January.
Your website is harvested by the BnF's robot
This operation is performed in accordance with the legal deposit of the internet as established by the French Heritage Code (art. L131-1 to L133-1 and R131-1 to R133-1), following the Law on Copyright passed on August 1st, 2006. Legal deposit is one of the main methods available to the BnF to ensure the growth and development of its collections.
Harvesting settings
BnF uses a spider called Heritrix (http://crawler.archive.org) to harvest websites. The robot’s identification field is “User-Agent : Mozilla/5.0 (compatible; bnf.fr_bot; …)”. It always applies high politeness rules (delays between two requests) in order not to stress the producers’ servers.
Robots.txt protocol
In accordance with the Heritage Code (art L132-2-1), the BnF is authorized to disregard the robot exclusion protocol, also called robots.txt. This protocol aims to direct the activity of crawlers used by search engines, by filtering out non-text and/or non-indexable content (binary files such as images, sounds, videos, style sheets or administration files).
To accomplish its legal deposit mission, the BnF can choose to collect some of the files covered by robots.txt when they are needed to reconstruct the original form of the website (particularly in the case of image or style sheet files). This non-compliance with robots.txt does not conflict with the protection of private correspondence guaranteed by law, because all data made ??available on the Internet are considered to be public, whether they are or are not filtered by robots.txt.
Javascript processing
Interactive web pages use Javascript which builds links and triggers actions based upon events (page loading, navigation in a menu, mouse click or scrolling…).
As Heritrix cannot process precisely all Javascript, it may generate false URLs: this behavior is not considered as an error in the robot’s functionality (https://github.com/internetarchive/heritrix3/wiki/crawling%20JavaScript).
The BnF strives to avoid the generation of these false URLs, integrating many filters in the harvest profiles, and to focus on relevant URLs.
Inquiries
If the performance of your website is affected by this operation, please report it by email to robot@bnf.fr. We will propose a solution as soon as possible.