Towards a Common Privacy API: Introducing PIISA
A proposed programming framework for an interoperable open source Personally Identifiable Information System Architecture (PIISA)
API Motivation and Design Principles
Previously, we had introduced our approach to develop a data privacy protection standard and proposed a framework for classifying and scoring risk with personal information. To enable broader adoption, we quickly realized we needed a robust method to incorporate the framework into applications with an API (Application Program Interface) based system architecture.
Developing an interoperable and general framework for PII (Personally Identifiable Information) management is a hard problem and we began with a foundation of design principles. The following principles guide the work of translating our vision and framework into an API architecture with an agile development mindset.
Modularity & Composability: We want to break down the full problem into smaller pieces, to address each component independently and also be able to then assemble into the full framework.
Usefulness: We aim to make our incremental components functional from the very beginning. This way, users can gain value from the beginning without waiting for the full framework to be in place.
Development Scalability: We plan to scale the development and deployment effort, building on the above two principles with incremental and partial working solutions. With composability, we can freely substitute some elements with more powerful modules when they are available, enabling us to mix and match modules.
Adaptability: In contrast to the one-size-fits-all approach, our approach to the framework is that it should be easily configurable and allow users to create custom solutions without much effort.
With these principles in mind, we propose an interoperable open source Personally Identifiable Information System Architecture (PIISA).
Overview of the proposed API architecture
We describe below our proposal for a general PII management framework with a caveat that this continues to be a work-in-progress which will need community input and support. The structure of this framework consists of five major blocks: 1)Preprocess, 2)Detect, 3)Decide, 4)Transform, and 5)Visualize & Return. These blocks are implemented modularly and independently, providing two types of API endpoints:
Program Interfaces: Connect to supported libraries to exercise desired process for that block
Data Interfaces: Connect processed data to the next PIISA block
For example, a user will first utilize the program interfaces on the Preprocess Block to extract text with a supported library, and then send the extracted text, or the data, to the Detect Block through data interfaces. The user will then interact with the Detect Block’s program interfaces to detect PII. To ensure data is consistent and easily connected throughout the entire framework, the data format of elements passing between the blocks is standardized.
Today, solutions to handle PII are mostly custom and are developed in-house by individual users and companies. Although there are several libraries available for various privacy treatment purposes made public by organizations who have developed them, they are hardly interoperable. The Personally Identifiable Information System Architecture (PIISA) aims to solve the interoperability problem.
Processing Blocks
Of the five major blocks, the first four blocks are processing blocks that process the data, and the last block is for statistical evaluation and final usage. (Please reference the draft specification which contains more details on some of the interfaces).
1) Preprocess
One of the biggest challenges for PII detection work is that users have to adapt to the inputs for the selected libraries. If multiple libraries are selected, then multiple formats have to be created for the same document, often without clear traceability. The PIISA framework tries to limit such complexity, taking in raw documents and producing a standardized format throughout the entire process: a document is shaped as sets of chunks, for users to easily trace back on “which chunks were derived from which documents” that contain PII.
Preprocess is the first block of the PIISA framework. In order to work with our standardized framework, data has to be in consistent formats throughout. The Preprocess block first reads a document in an arbitrary format, which could be a Word document, a web page or a PDF file, and then normalizes the document into a standardized format.
The normalized document contains a simplified version of the original content, with high-level structure and all the text data. After the normalized version is produced, documents could be delivered to the Detect block via API. In our model, a document delivered by the Preprocess block is structured as a set of chunks, each having an ID, a payload, and an optional context (to aid further processing). Chunk-based processing will allow the treatment of potentially large documents in a scalable way.
2) Detect
As normalized documents get delivered by the Preprocess block, the Detect block is in charge of processing those documents and performing PII detection. Thanks to many privacy-conscious communities and organizations, there are many PII detection libraries out there, but each has its own pros and cons. Which library to use and how to adapt to the library’s output selected often leads to a customized spaghetti codebase. The Detect block does precisely the opposite, serving as a unified interface to all different technologies with a consistent output.
The Detect block can wrap multiple libraries under the hood and allows users to select a set of PII Detectors, utilizing multiple technologies to detect PII present in the documents. This could be a preset ontology, preset regular expressions, or even trained named entity recognition models. PII Detectors can be further configured, such as specific text languages and applicable countries for contextual considerations.
The Detect block outputs a list of the PII candidates that it has identified. Each candidate represents text segments that likely contain PII, and the output for each candidate would contain associated information, such as document and chunk ID, detected by which PII detector(s), PII types, and scoring or confidence levels.
3) Decide
What should be considered a PII? This is a very contextual work. For example, a famous politician’s name on news articles most likely would not be considered as a PII, but if the same person’s name is present in a medical document, then it should be treated as PII. Context matters, just as much as jurisdiction does. Even within the same country, different jurisdictions may have different regulations of PII. Today, technical users have to be experts on policies and legal documents and unfortunately rely on guesswork to decide how to treat PII for their specific use cases. Even in cases where legal experts are available, the communication and meetings to align on basic understanding are expensive. There is no common standard that exists today.
The Decide block takes away that initial guesswork. In our initial research, we compiled a Legal Playbook to understand jurisdiction and unique considerations when it comes to privacy, which led to our PII policy framework and the S^3 decision matrix. (Please reference this post which details the S^3 framework.)
The Decide block’s mission is to take the PII candidates produced by the Detect block and consolidate them into a final set of PII elements in the text that need to be addressed. During the consolidation process, users can configure languages, jurisdiction, define sensitivity, scarcity, and specificity, document metadata, etc.
The Decide block outputs a collection of PII but modified with the result of the decision. The Decision operation will involve a number of considerations, which may be carried out by different technologies (rules, models, statistical analysis, etc.). For example, a disease detection may have different sensitivity if a Person detection is done in proximity to it. For our S^3 framework, we focused on Highly Sensitive Information only. The prior example about disease detection, “Non-PII can become PII whenever additional information is made publicly available, in any medium and from any source, that, when combined with other available information, could be used to identify an individual”.
4) Transform
As PII candidates are detected and evaluated with contextual information, the last step is to decide what to do with these candidates. In some cases, such as the disclosure of sensitive documents, redaction may be preferred. In other cases, such as training a machine learning model, replacing the original candidates with dummy values or synthetic data may be preferred.
The Transform block’s objective is to do precisely that. It takes the “decided” PII entities and acts upon them, depending on the intended purpose. For the same PII candidates, there can also be different outcomes but sharing the same interface. For example, users can perform different types of anonymization, redacting with dummy values or synthetic PII data, while visualizing the same PII and tracing back to the decision process, understanding why a certain span was decided to be detected as PII with additional metadata on the decision.
5) Visualize & Deliver
The last step in the chain focuses on adapting, for consumption, the result of all prior processing. The destination can be:
a human, in which case UI & UX considerations come into play, to be able to communicate the result of PII processing in a clear and efficient way
automated processing (such as e.g. feeding data to a model training flow); in this case it is important to use delivery formats suitable for efficient and unambiguous parsing, and incorporate any additional algorithmic information that might be useful downstream.
Future Work
We are currently developing the framework’s standard and a prototype implementation. The standard specification defines the minimum structure for these program and data interfaces to guarantee basic interoperability across blocks in the architecture. Based on the specification and our four design principles, we are also implementing a prototype to serve as a proof of concept.
To provide a gradual approach, the first API definition will be local, as software elements connected to each other (our proof-of-concept implementation is being done as a set of Python packages). Later on, we will also define an equivalent REST API, so that blocks can be interconnected across a network, and thus open the path to network provided PII cloud services. We would love your feedback and participation in the development of PIISA. You can find the latest specification here.
Some of the open questions we continue to grapple with include:
Scope of Rights. What rights are we promoting with our API? What are the intended use cases? How do we handle data portability, erasure and correction, right to know and access, etc.
Security. How might sensitive information be managed while protecting the privacy and security of that data? How do you protect PII that you send to other organizations and outside your borders?
Supported Data types. What kinds of data might contain PII that you would want processed? How might secure and sensitive data be stored and accessed in order to do PII process?
If you would like to get involved in answering these questions and building out our initial PIISA implementation, join us! To participate in this open project or just to share ideas on how we can improve what we are doing, please contact Hessie Jones at hessiej1228@gmail.com.