Introduction: Our Quest to Make a Difference in Data Privacy Protection on a Global Scale
Why we've developed this collective across industry and disciplines to create a standard that protects personal information through AI/ML.
Today we are plagued with our information strewn all over the web. Every time we create an account for applications, social media, or our favorite retail sites, we are opening up avenues to create an ever-ending footprint that lays witness to our intentions, our motivations, behaviors –– all linked to our identities.
The advent of AI/ML opened up opportunities to contextualize the web, and every piece of data in an effort to surface understanding and insights in a more efficient manner to drive better decisions for businesses and governments.
The fallout of this opportunity has also seen the emergence of industries that have profited from the use of personal information: data brokers, advertising platforms, and emerging technology leveraging biometric, facial recognition, location and sensor data. When the biggest companies that collect, aggregate, and contextualize data fail to remedy or change the way they do things, more vulnerabilities are unleashed.
While legislation continues to address the complaints and harms that emerge within industry, there also needs to be a proactive will to mitigate these downstream impacts without hampering innovation.
Enter BIG Science: Mitigating Privacy Harms in Large Language Models
Big Science’s year-long research workshop was launched to address the impact that Artificial Intelligence and Natural Language Processing (NLP) have on a powerful new artificial intelligence technology called large language models, and ultimately, on society. Working groups were formed, one of which was one to address the protection of Personally Identifiable Information (PII) in large datasets for large models.
As part of this, international researchers and privacy practitioners came together to address the privacy impacts of large language models. These models are capable of communicating like people, but can sometimes expose the private information it was trained on. Our efforts mainly focused on cleaning large datasets of PII and how to mitigate PII leakage.
We began this project to answer the following questions:
What does privacy mean generally to the datasets and stakeholder impacted by the BigScience artifacts, including models?
What are ways to data protect privacy? Have policies that promote norms or privacy protection, legal constructs such as licenses, and technology?
What are some general policies for privacy that BigScience should adhere to?
We dove deep to understand the players in the industry: the data collectors, the data processors and their respective roles in their collection, aggregation and use of data.
We also studied the different privacy legislations to determine the commonalities among jurisdictions around the world. This helped us define the legal definitions of what is deemed personal or personally identifiable information (PII) and how this can be deconstructed to varying degrees of sensitivity.
We probed legislations to understand data privacy from the perspectives of data as property, the individual rights when it comes to their personal information and drew upon the accountability and responsibility of the Data Stewardship Organization (DSO) with the ultimate aims to define, detect and remediate data along these principles:
Anonymity/Privacy: Individuals represented in datasets can be harmfully targeted. Datasets must not infringe on individuals’ privacy in this way without informed and positive consent. Similarly, the Right to be Forgotten (Article 17(2) of the GDPR)) offers remediation through deletion where “it is no longer possible to discern personal data without disproportionate effort.”
Autonomy: All people involved have a “right to autonomy”. This includes:
Consent: (Article 6(1) of the GDPR) Informed consent from data creators/collectors/controllers, and from those who are uniquely represented (PII) in the dataset.
Contestation: aka Right of Access (Article 12(1) of the GDPR) Individuals with data in the dataset will have the ability to request that their data be removed or anonymized. They should also have relatively easy access to know that they are in the data.
Transparency: aka The Right to be Informed (Article 14 of the GDPR):
Dataset transparency can include documenting the dataset creation process, articulating motivations and use cases, describing dataset statistics and characteristics, and/or publishing the dataset in full.
Developers should publish information about the design of AI tools, lowering the barrier for people to know how systems make decisions.
Ideally there is enough transparency that datasets and systems can be fully audited and understood by users and regulators.
Inclusion/Representation: Datasets aim to maximally represent the diversity of human language uses. What this means is further refined by the other ethical considerations we assert.
Finally we took some time to really define personal information (in some jurisdictions, also known as personally identifiable information). This laid the groundwork for beginning to understand context and potential harm to individuals. Through this we developed some initial rules/frameworks for detection and remediation.
Next Evolution: A Global Collaboration to Develop a Data Privacy Protection Standard
But as we did the above, we found that the tools we built could be more generally useful, and there was a lack of harmonization to manage privacy in the open source community.
We quickly realized that the goal of an open privacy framework outgrew Big Science, and needed an existence of its own, so we gathered privacy practitioners across industry to help our goal of making privacy tools more readily accessible to all.
This blog will provide a journey of our progress: in developing the policy framework, and developing the NLP specifications for detection, decisioning and transformation.
The ultimate goal is to build a language-independent specification and a set of guidelines and recommendations that can be shared to underserved communities globally (based on their language specifications) so they too can help minimize risks of PII exposure. This specification could have the form of a data interface, a program interface or an API for PII processing and management (or a mixture of them).
Why We’re Doing This
Here are some quotes from participants in this important project:
I’ve worked in Machine Learning Research for 8 years and one of the standout things for me was how the data we generate was being used liberally to train large language models or “foundation” models. When I joined my current role as a Data Scientist for a leading e-discovery and data breach response company, I realized how much private data is out there. With more opaque LLMs being trained, it’s crucial that we standardize data to preserve privacy and facilitate open research about the ways to maintain these standards. ~Shamik Bose
I’ve been in Privacy for 6 years, but I’ve also been in advertising and data marketing for 20 years. While my career was a well-intentioned iteration to make things easier and more profitable for advertisers and marketers, I realized we were crossing a very dangerous line. From targeted advertising to micro-targeting, we don’t need, nor should we have the right to people’s personal information, especially as technology inevitably will enable more digital footprints to our lives. For those who say, “I’m not worried about privacy because I have nothing to hide”, I say this: The data you produce can be woven in ways that will be outside of your control. One bad day. One bad post on social media can be taken out of context. Narratives will be seen as truth. Your word against the world. We, as data professionals, have to be stewards to minimize these types of harms against individuals and groups. It’s as much about individual rights, as much as it is about individual freedom. ~Hessie Jones
Data privacy has two angles: The individual’s angle and the organization’s angle. While we often hear about the importance of protecting our own privacy, we lack the visibility of how organizations with which we share our data safeguard it and treat it. Despite the criticality of privacy protection, it is still relatively difficult for organizations to find the right mechanisms to preserve our privacy as users, especially when it comes to training Machine Learning models. Therefore, I believe that improving and enriching privacy-preserving technologies would allow organizations to better handle our data as individuals. Furthermore, it would unlock different data sharing opportunities around important topics such as healthcare and human rights. ~Omri Mendels
I think privacy tech should be available to all who want to use it, and so we should help create an open standard for its adoption, in the non-profit, educational and industrial sector, everywhere in the world. ~Huu Nguyen, Co-Founder, Ontocord, LLC
While privacy is a universal inalienable right in my view, too often the operational definition of privacy and the way it is perceived by the population is highly variable across jurisdictions, sometimes stemming from genuine cultural differences and divergent world views. Having worked on privacy-preserving applications in three different continents, I have been fascinated at the differences in expectations and values but remain more convinced than ever about the universality of it. In a globalized digital world, it is only fair that ‘privacy’ is afforded to all participants in line with their individualized expectations of what it implies. ~Suhas Pai
AI is quickly becoming a vacuous buzzword, and privacy but another compliance checkbox. After hearing a fake "We value your privacy!" one time too many, I decided to help companies that genuinely wish to do a better job, but lack the tooling.
And so we launched PII Tools as the automated, no-more-excuses B2B alternative. We combine strong engineering with pragmatic focus on the day-to-day business aspects of sensitive data protection. As such, the Global Data Privacy Protection Standard is a great complement to our mission – shining light into places that were previously dark and full of secrets… whether by design or omission or lack of resources. The world deserves better. ~Radim Řehůřek, Founder and CTO at pii-tools.com
We are trying to clear a path in the privacy jungle, so that both users and industry might find a consistent common ground to work in. And the challenge, but also the opportunity, is achieving that consistency across regulations, regions and cultures. So that the potential of AI-based solutions for improving everybody's lives does not come with obnoxious strings attached. ~Paulo Villegas
Today, there remains a gap between the vision for where privacy should be for everyone and business reality. While there are companies trying to address the gap, currently there is no end-to-end standard for businesses to create plug-and-play solutions. Often, businesses either use a hodgepodge of solutions that create frictions, or one blanket solution to overly remove data while hindering business performance, creating a dichotomy that privacy sacrifices technology development. What we're creating here is about creating context-aware standards that allow plug-and-play pivot for the industry, moving on from dichotomy to alignment on privacy-by-design. ~Ian Yu
Our Team
Shamik Bose
Mathilde Bras
Jenny Chim
Pierre Colombo
Carlos Ferrandis
Hessie Jones
Omri Mendels
Huu Nguyen
Mitchel Ondili
Suhas Pai
Marc Pamies
Long Phan
Eduardo González Ponferrada
Radim Rehurek
Patricia Thaine
Paulo Villegas
Angeline Wairegi
Ian Yu
If you would like to participate in this open project, if you have ideas on how we can improve what we’re doing or you have access to communities who will benefit from our project please contact hessiej1228@gmail.com.