PETAce in Action: Enhancing privacy during Friends Matching in TikTok

by Yongchuan Niu, Research Engineer and Waiyu Lam, Software Engineer at TikTok

Tech @ TikTok

Research

Privacy

In our last post about PETAce, we introduced how advanced cryptography can be used to enhance privacy. In this post, we dive into an application of PETAce at TikTok: protecting privacy when users authorize access to phone contacts for finding friends in TikTok.

I. Recap: PETAce and the Role of Cryptography in Privacy

Cryptography-related techniques play a crucial role in various Privacy-Enhancing Technologies (PETs). It provides precise and provable privacy guarantees while maintaining the ability to gain insights from sensitive datasets. Moreover, cryptographic techniques are often used together with other PETs like differential privacy and federated learning to build a holistic and comprehensive solution to provide sufficient privacy protection throughout the data pipeline lifecycle. PETAce (Privacy-Enhancing Technologies via Applied Cryptography Engineering) is TikTok's PET framework, which is based on state-of-the-art cryptographic research results. It provides various privacy-enhancing computing capabilities such as secure multi-party computation, private set intersection, private information retrieval, and homomorphic encryption, allowing parties to work together on data processing tasks while protecting their private data.

II. Contact Book Matching: A Concrete Example of PETs using PETAce

To understand how PETAce preserves data privacy, let's dive into its application at TikTok for matching friends in TikTok using user private phone contacts. Our approach is based on private set intersection (PSI) with advanced cryptographic techniques, which ensure minimal data collection and set a new standard for safeguarding user sensitive data, enhancing trust and security.

With this new solution, we encourage users to share their contacts, facilitating the discovery of potential friends using the TikTok apps, with the assurance that we will minimize the information exposed to the other parties throughout the process. TikTok users can not only embark on a journey to connect with those they know, but are also more likely to engage with and enjoy the enhanced social functionalities provided by the platform, fostering a vibrant and connected community.

Contact Book Matching for Social Interaction

Contact book matching, often referred to as contact synchronization, is a process of identifying and connecting the user with individuals who are also using TikTok and have matching contact information, typically phone numbers. When a user authorizes access to their contact book, the server can diligently identify potential users who have registered TikTok account and share the phone numbers within the contacts. If there is a match, it allows users to discover and connect with people who they may know. This feature facilitates easier and more personalized connection-building, enhancing the user experience by leveraging existing relationships and making it convenient to find friends or acquaintances on the platform with more friend interaction.

To prioritize data privacy and ensure the confidentiality of user phone contacts throughout the matching process, our initial attempt is to employ the SHA-256 hashing mechanism [1], which is a process of data anonymization to mask sensitive information and ensure no single original item can be identified by other parties, including TikTok. This approach transforms the original data into a fixed-length hash value, making it computationally infeasible for attackers to solve for the original information. However, in case of contact book matching, the hashing method is not strong enough to prevent plain email or mobile recovery due to the nature of their low entropy [2]. The malicious party can take advantage of brute force attacks and reveal the full list of contacts from users without their consent.

To address this potential risk of information leakage, we implement our new privacy-preserving techniques as a proactive measure, offering a more secure approach to ensure a robust and confidential data-sharing experience for our users. Specifically, this new solution provides an additional layer of protection against the vulnerabilities associated with low-entropy information encodings. Our goal is to allow users to seamlessly authorize their phone contacts without revealing any sensitive information.

What is PSI?

Private set intersection is a secure multiparty computation cryptographic technique where two parties hold sets of elements (hash value of phone contacts in this case), and allow one or both of them to discover the intersection between those sets by comparing encrypted versions of these sets. In this scenario:

Neither party (i.e., TikTok users) reveals anything to the counterparty except for the elements in the intersection
Neither party (i.e., TikTok users) reveals anything sensitive about their sets to an eavesdropper

The challenge in PSI is how two or more parties can compute the intersection of their private input sets, while the elements that are not in the intersection remain private. Our PSI protocol is based on the Elliptic Curve Diffie–Hellman [3], which is a key agreement protocol that allows two parties, each having an elliptic curve public-private key pair, to establish a shared secret over an insecure channel. The overall protocol achieves outstanding computational complexity and bandwidth capacity compared to other kinds of PSI protocol such as private set intersection based on RSA blind signature.

Initial Design of Protocols

The following simplified algorithm indicated the basic idea of our protocols:

Suppose the server maintains a secret key b and a set of phone contacts {y}. The client owns a set of phone contacts {x}. Let H denote a cryptographic hash function, and E denote an elliptic curve that both the client and server agree on.

The protocol executes as follows:

The client generates a random secret key a.
The client iteratively applies the H function to compute hash values {H(x)} of each of the values in the original set of user phone contacts.
For each of these hashed values, the client applies "blind" operations and calculates the "blinded" hash value {H(x)^a} using secret key a.
The client sends the series of calculated points {H(x)^a} to the server.
Once receiving the request, the privacy server calculates the shared secrets corresponding to each element received from the client {H(x)^a} using a persistent secret key b. We denote the resulting series of points as {H(x)^ab} , which is a set of "double-blinded" hash values of the original phone contact from the client.
The server will apply the same hash function H and "blind" operation with the secret key b to all the phone contacts stored in the server, like the same way the client did. We denote the result list as {H(y)^b}, which is a set of blinded hash values of the original phone contact from the server.
The server then sends back the list of single-blinded hash values {H(y)^b}, as well as the corresponding double-blinded values {H(x)^ab} in the response to the client.
After receiving the response, the client performs an "unblind" operation to each of the double blinded values {H(x)^ab} using the secret key a . This will result in a reversed set of points {H(x)^b} which are only blinded by the secret key b.
For each of the single-blinded hash values {H(x)^b} derived from the last step, the client now can locally query {H(y)^b} received from the server to see if there is any element of {H(x)^b} is also an element in the set of {H(y)^b} . If the original phone contacts x from the client match any phone contacts y from the server, their corresponding blinded value H(x)^b and H(y)^b will be equivalent.
The matching set consists of the original phone contact present from the client, which is also present from the server. It will be further passed to the original flow of contact book matching to identify friend relations.

How to use PETAce?

In the contact book matching scenario, we use PSI based on the Elliptic Curve Diffie-Hellman (ECDH-PSI) [3] to enhance user privacy. PETAce provides multiple ways to support ECDH-PSI development. In the following, we will describe how to use PETAce to develop the ECDH-PSI protocol.

PETAce provides user-friendly Python APIs for the ECDH-PSI protocol. Below is an example illustrating how to use it in practice.

from petace.setops.psi import PSI
from petace.setops.pysetops import NetParams, NetScheme, PSIScheme

# Init network
net_params = NetParams()

# Init psi engine
psi = PSI(party, net_params, NetScheme.SOCKET, PSIScheme.ECDH_PSI)

# Compute set intersection.
ret = psi.process(data, obtain_result)

PETAce also offers a versatile C++ APIs for the ECDH-PSI protocol. The following example demonstrates how this API can be utilized in practice.

#include "nlohmann/json.hpp"

#include "network/net_factory.h"
#include "solo/prng.h"

#include "setops/psi/ecdh_psi.h"

void ecdh_psi_example() {
    // Read JSON config.
    nlohmann::json params = nlohmann::json::parse(in, nullptr, true);

    // Connect net io.
    auto net = petace::network::NetFactory::get_instance().build(petace::network::NetScheme::SOCKET, net_params);

    // Run ecdh-psi.
    petace::setops::EcdhPSI psi;
    psi.init(net, params);
    psi.process(net, input_keys, output_keys);
}

Furthermore, customized ECDH-PSI protocols can be developed based on PETAce-Solo, which can be more flexible to meet business requirements.

User Flow

The entire workflow for privacy-preserving contact book matching is illustrated in the following steps

User authorizes a contact book with list of phone contacts with prompts
The contact book is hashed and blinded locally before sending it to the server for contact matching
The privacy center invokes the server to get the response with server secret set
The client will then perform the local matching to exclude all user contacts that are not in the data breaches
The filtered contact lists would be further sent to server for friends matching

What Level of Privacy does PSI ensure?

All social media apps face a challenge in balancing user privacy with features that rely on some degree of data sharing. PSI represents an important step forward in upholding TikTok's responsibility to protect the privacy of our users. Our protocols effectively remove privacy barriers that traditionally hinder secure data sharing and users can confidently share their contact books, knowing that their data is shielded from the server and other users.

Privacy-Preserving Data Sharing: The cryptographic operations ensure that neither parties can identify original values of user data or data membership, in the counterpart's non-intersectioning dataset.
Server-Agnostic Operation with data minimization: The server remains entirely agnostic to user data, ensuring no storage or learning of information that doesn't belong to it.
End-to-End Encryption: Every step of the process is fortified with end-to-end encryption, guaranteeing that user data is secure during transmission and inaccessible to any unauthorized entities, including the server. We implement well-established and widely trusted cryptographic techniques for secure secret exchange, which is guaranteed by provable security techniques in cryptography. We use NIST recommended 256-bit elliptic curve secp256r1, which supplies approximately 128-bit security, and it is computationally infeasible to attack secp256r1-based ECDH-PSI protocol. Therefore, based on the Diffie-Hellman Assumption, our system guarantees robustness against various attacks and ensures that only authorized parties can perform the intersection.

How Efficient and Robust is PSI in Practice?

It's important to acknowledge that the incorporation of privacy protocols could lead to a negative impact on user experience due to additional data processing and escalated data transmission between the client and server in the phase of secret exchanging. To facilitate a seamless integration with PET, we offered the following significant optimizations to ensure secure and effective computation of intersection-related stats from private datasets, making it a practical product for various applications where data sharing is essential.

Pre-processing of phone contacts into cipher form on the server: Cryptographic operations during runtime could be expensive, especially considering the substantial data volume on the server. The practical issue arises concerning both latency and bandwidth capacity. Our approach is to asynchronously pre-process the contact data on the server, transform them into cipher format and store them into persistent storage, which allows them to be efficiently shared directly with the client.
Partitioning the set of phone contacts with shared indexing: Instead of processing the whole set of phone contacts from each party, we let the client generate a set of small sharding indexes from the initial bytes of the hashed prefix of contacts. This bucketization technique enables partitioning extensive datasets into smaller subsets of all necessary entries and reduces the amount of data that needs to be transmitted from the server to the client. The risk of potential phone contact leakage through the disclosure of the shared index is negligible, given that there might be billions of potential phone contacts sharing the same index. We choose the index size carefully to balance privacy and performance.
Using Bloom Filters to compress the blinded data set returned from the server: We are leveraging the inherent structure of Bloom filters to further reduce the bandwidth overhead of large volumes of data that need to be transferred. We dump all the data entries into a bloom filter on the server and return it back as a response to the client. This data structure is particularly well-suited for scenarios where there is a need to verify the presence of an element. It is important to note that the FPR of bloom filter is preserved by restricting the optimal size as threshold to avoid saturated bloom filter.