Ireland’s Data Protection Commission Launches Inquiry into Google’s AI Model

by Matthias Horn//

The rapid advancement of artificial intelligence products has brought with it a host of legal and ethical challenges. In a significant development, Ireland’s Data Protection Commission (DPC) has launched an inquiry into Google’s AI model, focusing on the legality of data scraping practices used for training AI systems. This move underscores the growing concern among regulators about data privacy and intellectual property rights in the age of AI.

On 12th September the DPC announced an official investigation into Google’s methods of collecting and processing data for its AI models. The inquiry under Section 110 of the Data Protection Act 2018. aims to assess whether Google’s data scraping practices comply with the Art. 35 of the General Data Protection Regulation (GDPR). The DPC’s examination will cover that Google may have had to undertake an assessment prior to engaging in the processing of personal data of EU/EEA data subjects associated with the development of its foundational AI model, Pathways Language Model 2 (PaLM 2). The inquiry forms part of the wider efforts of the DPC, working in conjunction with its EU/EEA peer regulators, in regulating the processing of the personal data of EU/EEA data subjects in the development of AI models and systems.

Data Scraping for AI Training

Data scraping involves extracting large amounts of information from websites and online platforms, often using automated bots. For AI developers, this practice is essential for gathering the vast datasets required to train machine learning models effectively. However, the legality of data scraping is complex, intersecting with data protection laws, intellectual property rights, and contractual obligations.

Under the GDPR, any processing of personal data must have a lawful basis, such as user consent or legitimate interest. Data scraping that involves personal information triggers compliance requirements, including transparency, data minimization, and purpose limitation. If a processing can lead to high risk for data subjects, data controllers are required to conduct a Data Protection Risk Assessment (DPIA) prior to the processing. A DPIA requires an assessment of the impact of the envisaged processing operations on the protection of personal data. The DPC’s inquiry concerns whether Google has complied with the obligations of Art. 35 GDPR.

Data Privacy Considerations

One of the primary concerns regarding AI model training through data scraping is its potential impact on data privacy. AI models like Google’s PaLM 2 are built on vast datasets, which may include personal information scraped from various online sources without the data subjects’ explicit consent. This can lead to several risks.

One the main may be ‘Loss of Control’. Individuals often lose control over how their data is used once it has been scraped and incorporated into AI training datasets. Data subjects may not even be aware that their data has been collected. Likewise, ‘Inaccuracy’ seems to be more probable as the integration of personal data into AI models could result in biases or inaccuracies in AI outputs, which may affect individuals if these models are used for decision-making purposes. Also, the risk of ‘Re-identification’ is likely to increase. Even when data is anonymized, there is a risk of re-identification, especially in cases where AI models are trained on large and diverse datasets.

Given these risks, the GDPR emphasizes the need to protect the fundamental rights and freedoms of data subjects, which may be at significant risk in cases of extensive data scraping.

Legitimate Interest under GDPR: Can Article 6(1)(f) Justify Data Scraping?

One possible justification that Google could argue for the processing of personal data in training its AI model is the “legitimate interest” provision under Article 6(1)(f) of the GDPR. This provision allows the processing of personal data if it is necessary for the legitimate interests of the data controller or a third party, as long as it does not override the rights and interests of the data subject.

However, for this argument to hold, Google would need to demonstrate that its legitimate interests in training the AI model outweigh the potential risks to individuals’ privacy rights. This is a contentious point, as the rights of data subjects are strongly protected under the GDPR, and the interests of a commercial entity in developing advanced AI models may not be seen as sufficiently compelling in comparison. In addition, Art. 6 (1)f could not be used when sensitive data within the meaning of Art. 9 (1) GDPR is being processed. So arguing on legitimate interest is a weak defence when sensitive personal data is involved.

Additionally, organizations processing personal data under the legitimate interest provision must ensure transparency and provide individuals with mechanisms to object to such processing. This is often challenging in the context of data scraping, where individuals may not even be aware that their data has been collected.

De-Risking? Anonymization and Pseudonymization

To mitigate the risks associated with AI scraping, organizations can employ techniques such as anonymization and pseudonymization. Anonymization involves stripping personal data of all identifiers, making it impossible to trace the data back to an individual. Pseudonymization, while less secure, involves replacing identifiable information with artificial identifiers, reducing the risk of re-identification.

While these techniques can reduce risks to data subjects, they are not foolproof. The risk of re-identification remains a concern, especially when datasets are combined with other sources of information. Moreover, anonymization must be robust enough to ensure that individuals cannot be re-identified, which is increasingly difficult in the age of big data. But if so, a thorough Anonymization and Pseudonymization could be a measure that provide the most legal certainty for organizations and should be recommended as they reduce risk. But they are not absolute solutions, particularly with advances in re-identification technology

High risk? Is a Data Protection Impact Assessment (DPIA) Required?

One of the key questions the DPC will address is whether Google was required to conduct a Data Protection Impact Assessment (DPIA) before scraping and processing personal data. According to Article 35 of the GDPR, a DPIA is mandatory when processing operations are likely to result in a high risk to the rights and freedoms of individuals. For the European Authorities, this includes cases where new technologies are employed, particularly when large-scale processing of personal data is involved. Ultimately, a court has to decide whether the legal opinions of the Authorities are in line with law. So the core question of the investigation might be whether the processing by Google really lead to high risk. If yes, and personal data was scraped in a high-risk processing and used without a proper DPIA, it could be considered a significant breach of GDPR obligations. The DPIA is crucial as the legal intention is to help organizations assess and mitigate potential privacy risks. The total absence of a DPIA could lead to hefty fines and significant regulatory scrutiny.

But if thoroughly conducted, followed by the implementation of respective mitigation measures like the above-described measures, the scraping might be legal.

The legal intention of DPIAs is to serve as a risk assessment tool that helps organizations analyze how personal data processing impacts individuals and how these risks can be mitigated. In the case of Google, the DPC is likely investigating whether the company has sufficiently evaluated the privacy risks posed by scraping personal data for its AI models and whether adequate safeguards were put in place. The European Data Protection Board (EDPB) has issued guidance stressing the importance of such assessments, particularly for high-risk technologies like AI. Further, a recent EDPB guidance on AI-related risks suggests that training models with personal data, even from publicly available sources, can trigger GDPR obligations that companies must not overlook.

Similar Regulatory Actions and Lawsuits

The DPC’s inquiry into Google is not an isolated case. Regulatory bodies worldwide are increasingly scrutinizing AI data practices.

DPC’s Inquiry into Grok

Earlier, the DPC welcomed X’s (formerly Twitter) agreement to suspend its processing of personal data for the purpose of training its AI tool, Grok. The DPC’s press release on this matter highlighted concerns about transparency and the lawful basis for processing personal data without user consent. The Grok case mirrors the issues at stake with Google’s AI models, emphasizing the need for tech companies to align their AI training processes with data protection laws.

Intellectual Property Issues

Beyond data privacy, data scraping raises intellectual property (IP) concerns. Websites and online platforms often contain content protected by copyright, including text, images, and code. Scraping such content for AI training can infringe on the rights of content creators and publishers.

The doctrine of “fair use” or “fair dealing” provides limited exceptions for using copyrighted material without permission, typically for purposes like criticism, news reporting, or research. However, commercial use of copyrighted content for AI training may not fall under these exceptions, exposing companies to legal risks.

In Europe, the Database Directive also offers protection for databases, which could include compilations of data scraped from the web. Unauthorized extraction or reutilization of substantial parts of a database may constitute infringement.

Similar Cases Worldwide

The DPC’s inquiry is not an isolated incident; it reflects a global trend of increased scrutiny over data scraping practices for AI training.

The New York Times vs. OpenAI

In a notable case, The New York Times (NYT) reportedly considered legal action against OpenAI, the developer of ChatGPT, over the unauthorized use of its content to train language models. The NYT argued that such practices violate their intellectual property rights and could undermine their subscription-based business model. While the two parties were in negotiations, the case highlighted the tension between AI development and IP protection.

Getty Images vs. Stability AI

Another significant lawsuit involves Getty Images suing Stability AI, the company behind the AI art generator Stable Diffusion. Getty Images alleges that Stability AI copied millions of images without permission to train its model, infringing on copyrights and trademarks. The outcome of this case could set important precedents for the use of copyrighted material in AI training.

Legal Actions against and by Meta

The DPC fined Meta €265 million in November 2022 after investigating a breach involving the scraping of personal data from over 500 million Facebook users. This case involved Meta’s failure to implement sufficient technical measures to prevent third-party scraping, a violation of GDPR’s Article 25 concerning data protection by design and default​-

Meta challenged this decision, claiming that the DPC misinterpreted GDPR and that it should not be held accountable for data scraping performed by third parties. They argued that publicly accessible data used by scrapers was consistent with user settings and did not constitute a breach​. Meta has been granted leave to appeal this decision in the Irish High Court, and the case is ongoing​

In parallel, the company filed lawsuits against several entities for scraping user data from Facebook and Instagram, emphasizing the illegality of unauthorized data extraction under their terms of service and data protection laws. In 2024 Meta dropped most of those cases, after a US Court ruled in favour of the sued company.

Implications for the Future

The DPC’s inquiry into Google could have far-reaching implications for the AI industry. If the DPC finds that Google’s data scraping practices violate the GDPR, it could lead to significant fines and force changes in how AI models are trained.

Companies may need to reassess their data collection methods, potentially moving away from indiscriminate scraping towards more ethical and legal data sourcing strategies. This could include obtaining explicit consent o, using synthetic or anonymized data, or licensing data from content providers.

Regulators worldwide might follow the DPC’s lead, increasing oversight and enforcement actions related to AI and data practices. This could result in a more fragmented regulatory landscape, with companies navigating different rules in various jurisdictions.

Balancing Innovation and Legal Compliance

The tension between fostering AI innovation and ensuring legal compliance is a central challenge. While large datasets are crucial for developing advanced AI models, they must be obtained and used in ways that respect individuals’ rights and comply with existing laws.

Some experts advocate for clearer regulations and guidelines that balance these interests. Proposals include creating legal frameworks that allow for certain types of data use in AI training while protecting privacy and IP rights, or establishing industry standards for ethical data sourcing.

Conclusion

The DPC’s inquiry into Google’s AI model underscores the urgent need to address the legal complexities of data scraping for AI training. As regulators, companies, and stakeholders grapple with these issues, the outcomes will shape the future of AI development and data governance. Ensuring compliance with data protection and intellectual property laws is not just a legal obligation but also a matter of public trust.

Data privacy risks can be effectively mitigated through anonymization and pseudonymization techniques. Anonymization involves removing or encrypting identifiable data, ensuring that individuals cannot be traced from the dataset. This significantly reduces risks associated with data breaches, as personal data is no longer linkable to individuals. Pseudonymization, meanwhile, substitutes identifiers with pseudonyms, limiting reidentification risks while still enabling data use under controlled conditions. Both techniques play a crucial role in enhancing data security, especially when large datasets are involved in AI training and analytics. But, on the other hand the risks of re-identification are dynamic, as technologies are evolving. There are no absolute solutions

Organisations must navigate these challenges thoughtfully, prioritizing ethical practices while driving innovation. The evolving legal landscape will require adaptability and collaboration to harness the benefits of AI responsibly and lawfully.

Matthias Horn is Data Privacy Lawyer at Ottobock, Berlin (Germany). In addition, he advices organizations on issues at the interface between data, technology and law.