Navigating Data Privacy Challenges in the Age of Large Language Models


In recent years, large language models like OpenAI’s ChatGPT have revolutionized the way we interact with technology – as indicated in previous posts here at Tiber, there are countless ways that ChatGPT can be utilized across all industries. However, as these AI models become more prevalent, data privacy concerns have also been on the rise, from online thinkpieces to protests to legal challenges in court. With these large language models ingesting a tremendous amount of “public” data for training, there are critical concerns with how they fall under current legal protections. Many of these concerns stem primarily from the EU, where the GDPR has provided a comprehensive framework for consumer data privacy protections.


At Tiber Solutions, we embrace newer technology – from using ChatGPT to help build tools for clients, to help debug code, and to build innovative data analytics solutions with cutting-edge tools. We are excited and committed to be working with this groundbreaking technology. On the flip side of that same coin, we are also dedicated towards ensuring our clients have solid data practices, including education around data privacy and protection. Just as we love to explore the breadth and depth of tools that we can use as technology improves, we also understand the need to remain informed on best practices and challenges as they unfold in this dynamic environment. 


In this blog post, we aim to describe the current state of data privacy in the EU, discuss the framework via the GDPR under which large language models may fall short of compliance there, and what that may mean for data privacy protections here in the US. 


ChatGPT and the GDPR

The GDPR, which became effective in May of 2018, are loosely based on the Fair Information Privacy Practices, a set of principles and guidelines that establish best practices for the collection, use, and protection of personal information. 

The key principles of the GDPR are the following:

  1. Lawfulness, Fairness, and Transparency
  2. Purpose Limitation
  3. Data Minimization
  4. Accuracy
  5. Storage Limitation
  6. Integrity and Confidentiality
  7. Accountability
  8. Lawful Basis for Processing
  9. Individual Rights
  10. Cross-Border Data Transfer Safeguards


There are several key areas where data collection for AI models can conflict with the GDPR. Scraping the web for free-floating information, using public repositories, and using online images or videos that contain personal information is generally off-limits to most companies in most situations, as they violate several of the principles in the GDPR. 


One such challenge took place in March 2023 when Italy’s data regulator issued a temporary emergency decision demanding OpenAI stop using the data of Italians used in training data, resulting in a blanket IP ban from ChatGPT across all of Italy. Four problems were cited: 1) There are no age controls to limit usage; 2) It can provide inaccurate information about an individual or entity; 3) People were not informed their data was collected; 4) There is “no legal basis” for collected personal information for training purposes. Several of these were partially resolved in April for Italy and in the EU, with ChatGPT adding in more comprehensive privacy notices, opt-out forms for information collection, and age-control verification. 


However, the question of “legal basis” for collecting personal information has still not been settled. Legal basis, as defined by the GDPR can be given by:

  1. Consent
  2. Contractual obligation
  3. Legal obligation
  4. Vital interests
  5. Public interests
  6. Legitimate interests

Since none of the first three apply, OpenAI would have to prove that ChatGPT can be considered under the latter three interests to proceed in the EU without significant penalties.


Moreover, there are still muddy waters regarding the initial, historical dataset used to train the model. Since no permissions were given for that data, is it legal? How can users correct, add, or delete information from that model if it has already been processed to train ChatGPT? How can the principle of data minimization apply towards training data sets, when there is no specific purpose stated other than to training language models and therefore no limit to the amount of data necessary? These questions are all those that have not necessarily been challenged in court yet, but are clear violations to GDPR, and must be addressed for ChatGPT to continue being utilized globally.


What comes next?

As the world’s most comprehensive legislation around data privacy, the GDPR has had and will continue to have influence on developing data privacy protections around the globe. While Italy was the first to set off the alarm, data regulators in France, Germany, and Spain have begun their own investigations into ChatGPT’s data practices. Even in the United States, which has traditionally been more laissez-faire about federal government regulations surrounding data privacy in favor of innovation, individual states have been implementing their own data privacy programs. All 50 states and the District of Columbia have data breach notification laws, and several are developing their own comprehensive data privacy program. 


In particular, California – home to Silicon Valley – has instituted the California Consumer Privacy Act (CCPA) with a new amendment (the CPRA) effective 2023. It echoes many of the guidelines set in the GDPR and FIPP, with the California Attorney General aggressively pursuing compliance among companies with Californian clientele. Following California’s lead, Connecticut and Colorado have also passed their own respective state privacy laws, and conversations have arisen about a federal data privacy program as well that may encompass existing legislation like HIPAA, FCRA, GLBA, COPPA, and others that have acted as patchwork data privacy solutions in specific industries. 


As a result, many privacy concerns stemming from the EU will likely apply to consumers in California and a growing number of states, until it may have to be addressed federally. Moreover, consumers in the United States have their share of concerns about ChatGPT. There are extensive concerns about copyright protection, with OpenAI being sued by writers whose works have been utilized to train ChatGPT, along with the FTC beginning an investigation into ChatGPT for publishing false information.


Data privacy is a key concern that must be addressed for large language models and other kinds of AI to be fully adopted worldwide. The legal challenges faced within the EU due to the GDPR along with the potential for these (and separate) challenges to migrate to the US and other countries mean that anyone utilizing such technology must remain vigilant about ongoing news regarding generative AI. In the coming years, this larger dialogue will shape how we leverage the power of large language models for innovation and progress. ChatGPT is an incredible tool that is disrupting not only the data analytics and IT industry but every sector of the economy. Here at Tiber, we believe these breakthroughs, complete with this emerging discussion around data privacy, to represent a crucial path towards harnessing the power of AI to provide tremendous value.