TikTok for Developers
PrivacyGo Data Clean Room: A new tool for data collaboration
by Vini Jaiswal, Open Source Manager, Dayeol Lee, Research Scientist and Mingshen Sun, Research Scientist at TikTok
Tech @ TikTok
Privacy
Open source

Introduction

As a part of our ongoing efforts in Privacy Innovation, we are excited to release a new open-source project called PrivacyGo Data Clean Room. This project allows easy data collaboration on private datasets without compromising individual data. The project started as a use case for TikTok to provide interactive tools to protect security and privacy, strong access control, accurate data analytics, and easy deployment.

Technology background and challenges

Data collaboration is not a new concept, and numerous data collaboration frameworks already exist. However, different frameworks try to apply different privacy-enhancing technologies (PETs), each with its own strengths and weaknesses. Traditional data protection methods, such as encryption at rest and in transit, offer limited protection when data is being processed. SQL policy and differential privacy are two common solutions, but face limitations when verifying and collaborating on data before releasing it. Applications and data remain vulnerable to attacks during runtime, regardless of infrastructure access privileges. This leaves organizations that store and process sensitive and regulated data exposed to potential security breaches. Additionally, without remote attestation, it becomes difficult to verify the integrity and authenticity of the computing environment, raising concerns about the security of data in use. Based on this situation, we designed a two-stage data clean room that combines different technologies to balance usability, accuracy, and privacy.

What is PrivacyGo Data Clean Room?

PrivacyGo Data Clean Room (PGDCR) is an open-source project to easily build and deploy data collaboration frameworks to the cloud using trusted execution environments (TEEs). PGDCR achieves this by combining different privacy-enhancing technologies (PETs) in different stages. In the programming stage, the platform allows data consumers to explore the dataset while providing interactive data usage and protecting data providers' privacy by provisioning two different datasets. In the execution stage, the workload runs in an isolated environment, and data providers can manage the data, code, and output space based on the results of their testing.

Our approach: Two-stage Data Clean Room

PrivacyGo Data Clean Room utilizes different PETs in different processing stages to maximize usability while protecting individual data privacy. Specifically, PGDCR divides data analytics into two stages: Programming Stage and Secure Execution Stage.

  1. Programming Stage: The data scientist uses Jupyter Notebook interface to explore the general data structure and statistical characteristics. The data providers can determine how they protect privacy of their data. For example, they can use differentially-private synthetic data, completely random data, or partial public data. This mathematically limits the leakage of individual data records. The finished notebook files can then be submitted to the Secure Execution Stage.
  2. Secure Execution Stage: The submitted notebook file is built into an image, and scheduled to a confidential virtual machine (CVMs) in the cloud. The data providers can set up their data such that only an attested program can fetch the data. By using attestation, the data providers can control which program can access their data. TEE also assures the data scientists the integrity of their program and the legitimacy of the output from executions by providing a JWT-based attestation report.

The system is based on virtual machines in the cloud and provides integrity protection, access control, and testing capabilities. It generates an attestation report that can be publicly verified for authenticity. Currently, it only supports one-way collaboration, but we plan to extend it to multi-way collaboration in the future and add automated checks.

Benefits of a two-stage Data Clean Room with TEE

Data provider decides a protection mechanism
  • Data at Programming Stage: random data, DP synthetic data, or public data
  • Code/output filtering at Secure Execution Stage: can implement coarse-grained policy, instead of per-query policy
Trusted execution environment (TEE)
  • Offers transition of trust in multi-way data collaboration settings
  • Ensures integrity of code and output
  • Provides an attestation report that can be used as proof of execution
Accurate results in Secure Execution Stage
  • Full data access is securely enabled via TEE

Features of PrivacyGo Data Clean Room

PGDCR is a great tool for data collaboration and offers the following features:

  • Interactive programming: PGDCR integrates with an existing Jupyter Notebook interface such that data analysts can program interactively with popular languages like Python.
  • Multiparty collaboration: PGDCR allows multi-party data collaboration without needing to send private data to each other.
  • Cloud-ready functionality: PGDCR can be easily deployed in TEEs in the cloud, including Google Confidential Space.
  • Accurate results: PGDCR does not sacrifice accuracy for data privacy. This is achieved by a two-stage approach with different PETs applied to each stage.

Use cases for PrivacyGo Data Clean Room

The system built on top of cloud infrastructure allows for multi-cloud usage and provides benefits such as transition of trust, integrated code output, and monitoring. Use cases include providing transparency to researchers and enabling data analytics for marketing purposes. Some of the potential use cases of the PGDCR include:

  • Trusted Research Environments (TREs): Some data may be valuable to various research on public health, economic impact, and many other fields. TREs are a secure environment where authorized and vetted researchers and organizations can access the data. The data provider can choose to use PGDCR to build their TRE.
  • Advertisement and marketing: Advertisement is a popular use case of data collaboration frameworks. PGDCR can be used for lookalike segment analysis for advertisers, or ad tracking with private user data.
  • Machine learning: PGDCR can be useful for machine learning involving private data or models. For example, a private model provider can provide their model for fine-tuning, but not reveal the actual model in the Programming Stage.

Availability

The project was released to open source at the Confidential Computing Summit in San Francisco on June 6, 2024 and is available on the GitHub repo. You can get started by following our getting started guide on GitHub. We are releasing an alpha version, which may miss some necessary features. This version uses Google Cloud Platform as the backend and currently supports computation on CPU. The data provisioning, policy, and attestation is manual for the current alpha version. However, our growth plans include expanding to multi-user collaboration, platform extensibility to support multiple backends, bringing automation to data provisioning, policy, and attestation, and supporting computation on both CPU and GPU.

Community

The PrivacyGo Data Clean Room is an open-source project, and we invite the open source community and the wider industry to contribute to the development of the project. You can follow our contribution guide for detailed instructions. We use GitHub Issues to track community-reported issues and GitHub Pull Requests for accepting changes. Read our Code of Conduct to keep our community approachable and respectful.

Conclusion

The launch of PrivacyGo Data Clean Room is a significant step forward for TikTok in the confidential computing space. By providing a confidential computing solution for enabling industry demands for data collaboration, TikTok is able to enhance the privacy protection of its platform while still allowing for unhindered data collaboration.


Stay up to date by following us on Twitter/X and LinkedIn!

Share this article
Discover more
A Recap of DevDay 2024: TikTok's Inaugural Developer Conference
Our first-ever TikTok DevDay in San Jose was an incredible success! With over 300 developers in attendance, the event provided an immersive experience into TikTok’s growing ecosystem of tools and innovations. Here is the recap blog of our event.
Community
TikTok Donates ManaTEE Open Source Project to the Linux Foundation
TikTok is donating ManaTEE, a platform built on Trusted Execution Environments, to the Linux Foundation’s Confidential Computing Consortium. ManaTEE is designed to address critical challenges in data privacy and security.
Tech @ TikTok
Open source
Make your tests readable with jest-bdd-generator
For frontend developers, a new way to combine behavior-driven development (BDD) with Jest
Tech @ TikTok
Open source