Web scraping and generative AI: how to prevent non-authorized personal data collection?

Generative Artificial Intelligence (“AIG”) is certainly one of the most powerful and challenging technologies in the current technological landscape, with undeniable benefits in terms of efficiency, faster operations and improved quality of work. However, it is at once one of the most controversial technologies, and this is because of its – sometimes adverse – data protection issues.

Hence, AIG algorithms require copious amounts of data (including personal data) for their training, often coming from a large and indiscriminate collection carried out on the web. One of the most widespread techniques used to train such systems is web scraping.

Accordingly, the Italian Data Protection Authority (the “Garante Privacy”) took action by calling an investigation into web scraping, which led to the publication on May 20 of a specific information note on web scraping and AIG (the “Information Note”).

The purpose of the Garante Privacy is to point out possible enforcement actions that providers of websites and online platforms, both public and private, operating in Italy, as data controllers of personal data under publication, could implement in order to prevent – where deemed not consistent with the legal basis and purpose of the publication – the collection of data by third parties for the purpose of training AIG models.

What is web scraping and what issues does it raise regarding personal data protection?

Web scraping is a particular technique used to collect, store and retain, in a methodical and automated manner, a large and indiscriminate amount of information and data publicly available online or made available in controlled-access areas. The collected data are then employed for targeted analysis, processing and uses.

As AIG gained popularity, the web scraping technique has experienced exponential growth, allowing for faster and more comprehensive automated data collection, which is then used for training AIG itself. The information that such techniques are capable of extracting is diverse and certainly includes personal data. Consider, for example, contact data, biometric and geolocation data, personal preferences or even browsing behaviors. In such cases, i.e., when web scraping involves the collection of information attributable to an identified or identifiable person, a personal data protection issue arises (for a more in-depth analysis of the additional legal aspects related to and arising from web scraping activities, please refer to our previous contribution, available here, in Italian).

More specifically, in these cases, compliance focuses on the need to identify an appropriate legal basis for the processing of such data and on adhering to the general principles set forth in Regulation (EU) 679/2016 (the “GDPR”). Thus, this means that providers of websites and online platforms, who also act as data controllers, will have to comply with the obligations regarding transparency, publicity, reuse, access, and the adoption of the necessary security measures. Indeed, the fact that personal data is publicly accessible does not equate to consent for its unrestricted use.

The guidelines from the Garante Privacy to website and online platform providers

In addition to the further obligations set on data controllers by the GDPR, the Garante Privacy, through its Information Note, aimed to provide some guidelines to website and online platform providers regarding the measures they could adopt to mitigate the effects of third-party web scraping intended to train AIG systems.

Specifically, the Garante Privacy has identified four different measures (remedial, but not definitive) of a technical, technical-organizational, and legal nature:

  1. creation of restricted areas: this involves setting up areas of the website or platforms that can be accessed only after registration, thus removing the data from public availability. However, the Garante Privacy emphasizes that this measure should not result in excessive data processing by the controller, in breach of the principle of minimization set out in Article 5(1)(c) of the GDPR, for instance by imposing additional and unjustified registration requirements on users;
  2. insertion of specific clauses in the terms of service: this is a purely legal safeguard, operating retrospectively. Indeed, in the event that such clauses should not be complied with, website and platform providers would be entitled to take legal action to claim that the other party is in breach of contract;
  3. network traffic monitoring, by means of a technical device that is able to detect any unusual data flow in and out of a website or online platform, thereby enabling the adoption of appropriate protective countermeasures;
  4. intervention on bots. Since web scraping relies on the use of bots, the Garante Privacy highlights that any technique that can limit bot access represents an effective way to limit the automated data collection activity carried out through such software.

Such techniques include, but are not limited to, the inclusion of CAPTCHA (Completely Automated Public Turing-test-to-tell Computers and Humans Apart) checks, which require an action that can only be performed by a human being, periodic modification of HTML markup, or embedding content or data within multimedia objects, such as images.

However, as pointed out by the Garante Privacy, none of these measures can be considered in itself as sufficient and adequate to completely prevent web scraping techniques. Therefore, these safeguards shall be implemented based on an independent assessment by the data controller, to be conducted on a case-by-case basis according to the specific context and in compliance with the principle of accountability, as well as with the data protection principles set forth in the GDPR.

The intervention of the Garante Privacy is crucial for promoting greater awareness in the use of AI tools by companies. In this regard, ensuring a balanced and multidisciplinary approach will be essential.

Indeed, while various technical and legal measures may act as deterrents against unauthorized web scraping practices, they could also potentially slow down the innovation of new AIG technologies. Therefore, caution must be exercised in their adoption, balancing the interests involved and implementing preventive and mitigative measures that are proportionate and not overly burdensome. This process should also involve all relevant stakeholders, including technical and legal experts in the field.