Web scraping & AI system training: Italian DPA opens a consultation on website security measures

On November 22, 2023, the Italian Data Protection Authority (the “Authority”) announced that it has opened a consultation regarding the collection of personal data that is online to train the algorithms which form the basis of artificial intelligence (“AI”) systems. Specifically, the consultation will focus on public and private websites to assess their adoption of adequate security measures to prevent the large-scale collection (also known as “web scraping”) of personal data for the purpose of training AI algorithms by third parties.

By providing personal data online, websites, indeed, as data controllers, are required to ensure that appropriate security measures are adopted to prevent such data from being unlawfully used by third parties, for example, for different purposes than those for which they were originally collected. In this regard, as is well known, the purpose behind the training activities of AI systems is precisely a massive collection of information and personal data available online for specific purposes (such as, news reporting, administrative transparency, social media disclosure) reused to train and nurture such systems and, thus, enable them to regenerate, recreate, reprocess and reuse the information for a multitude of purposes not made clear at the time of data collection.

Moreover, the consultation aims to collect comments and feedback from various market players, such as business associations, consumer associations, experts and academics representatives in order to identify adopted and adoptable security measures against large-scale data collection.

What do we know about web scraping?
The term “web scraping” (also known as “data scraping”) refers to an IT technique or procedure used to collect data by automatic means, without permission, from a website or application. This online data collection is typically carried out using a specialized program or script that simulates the browsing activity of real users, with the aim of automatically analyzing and extracting specific information from public or private domains. In addition, there are several software programs that can process the data to create databases as well as reprocess the analyzed content, for example, into structured formats, usually as tabular or textual data, which are easily reusable by third-party subject systems that use it for their own purposes.

Web scraping is a commonly used data collection technique used by websites that offer users a service of comparing information from different sites. For example, online platforms that compare prices of goods and services aim to provide users the chance to buy at the cheapest price or techniques used by search engines (so-called spiders) that analyze the websites content to rank them.
The technique itself is not considered illegal, but it may present relevant legal aspects due to the type of information and personal data that are extracted with this practice and due to the purposes for which it is subsequently used and, therefore, present several critical issues from a legal point of view.

What are the legal aspects concerning and arising from web scraping activities?
As a result, among other things, of the consultation launched by the Authority, the legal aspects regarding the processing and protection of personal data that this technique generates are well known; in addition, there are issues regarding the protection of intellectual property when the information extracted and used is protected by copyright and is used for unauthorized purposes, and even cybersecurity issues arising from unauthorized access to a website’s servers and unfair competition aspects when competitors’ information may be used for obtaining a specific commercial benefit from web scraping activities.

Focusing on the data protection aspects, the Authority, has also repeatedly intervened with sanctioning orders to limit the use of such techniques in violation of the applicable laws, lastly with a measure dated May 17, 2023, which sanctioned a company that is the owner of a platform that had created telephone directories by extracting a large amount of names, addresses and telephone numbers from online users and publishing them.

The exploitation of personal data mined through web scraping techniques may, indeed, constitute the processing for purposes other than those for which the data had been previously collected from the data subject, for example, for their subsequent disclosure, dissemination or use for marketing and profiling purposes without the establishment of an appropriate legal basis and, especially, without a free, valid and informed consent of the data subject both at the time of their effective collection and subsequently when they are reused and, therefore, in violation of the applicable provisions on the protection of personal data.

On this regard, it should be noted that the circumstance that personal data are public, for example, because they are published in registers and records, and are consequently by default available to anyone, does not mean that they can be used freely, namely without considering the provisions of the GDPR and, therefore, the use of such techniques can determine an unlawful processing of personal data even when the information is collected from public databases.

In the same way, the use of personal data to train AI systems, such as ChatGPT, may constitute an unlawful process of personal data when the data is used in the absence of an appropriate legal basis and in violation of the applicable data protection provisions.

Moreover, the use of web scraping techniques to train AI systems raise concerns regarding intellectual property law. In particular, the compliance with copyright laws is considered by the European authorities in the context of adopting the current AI Act. On this regard, one of the most significant aspects concerns the use of copyrighted works to create databases of information and data used for the training and learning of AI systems (data available and accessible online and obtained through web scraping mechanisms). If these works are protected by copyright, their use – including reproduction, processing, modification, distribution, and so on – without authorization could lead to violations of the proprietary rights belonging to the authors of the works.

Considering the relevance of the issue, another aspect regards the circumstance that the analysis activity conducted by AI systems for training purposes can be considered as a reproduction, even on a temporary basis, of the data and sources used, including any protected works or entire portions of the databases employed. This could fall under one of the two exceptions to copyright infringement related to “text and data mining” activities, as outlined in Directive (EU) n. 790/2019 and Italian Legislative Decree n. 177/2021, which transposed the Directive. However, it is important to clarify that the extraction activity from works and other materials contained in networks or databases, even for profit, is allowed under the condition that:
• legitimate access to the content is obtained for the purpose of extracting text and data;
• the copyright owner and/or the owner of related rights and/or the database owner have not expressly reserved the extraction of text and data (“opt-out mechanism”) “in an appropriate manner, such as machine-readable means in the case of content made publicly available online”, limiting such extraction activities to their exclusive control.

Which security measures could be adopted?
Awaiting the outcome of the consultation, which will contribute to making the security measures that the Authority deems suitable for allowing websites, as data controllers, to limit the massive collection of personal data through the use of web scraping techniques, clearer and more defined, it is possible to identify some technical measures that operators can implement to protect websites against "unwanted” intrusions and prevent the extraction of information and data.

Starting with techniques such as the use of anti-bot services, robots.txt files, blocking bot IP addresses, or using verification tests like captchas, and for some websites, where possible, by creating reserved areas accessible through an authentication process that can make information accessible through multiple levels. Additionally, from a legal perspective, it is essential to adopt specific terms of use for the site that include an absolute prohibition on using web scraping techniques for the systematic retrieval of data and information. In this way, it could be easier to take legal action to safeguard rights in case of violations of contractual terms, with the aim of obtaining a restraining order and, potentially, compensation for damages suffered.