TikTok for Developers
Privacy-Preserving Data Collection from Unknown Domains
by Privacy and Security Data Lifecycle Management and Privacy Innovation Lab
Research
Privacy

In this article, we propose a Quasi-Local Privacy Protocol for collecting items from unknown domains, leveraging central Differential Privacy (DP) techniques and cryptographic tools within the Encryption-Shuffling-Analysis (ESA) architecture. This approach provides privacy guarantees comparable to Local Differential Privacy (LDP) while significantly reducing computational overhead.


Collecting data from unknown domains is a common challenge in real-world applications, such as identifying new words in text inputs, or tracking emerging URLs. Current solutions often suffer from high computational and communication costs. To address this, we propose a Quasi-Local Privacy Protocol for collecting items from unknown domains. Our protocol combines central Differential Privacy (DP) techniques with cryptographic tools within the Encryption-Shuffling-Analysis (ESA) framework. This approach provides privacy guarantees similar to Local Differential Privacy (LDP) while significantly reducing computational overhead.


Our protocol uses an auxiliary server to construct histograms without accessing the original data. This allows the protocol to achieve accuracy similar to the central DP model, while still providing privacy protections akin to Local Differential Privacy (LDP). The technique leverages a stability-based histogram method integrated into the ESA framework, where the auxiliary server builds histograms from encrypted messages, ensuring privacy without tracing the original data. By eliminating the need for message segmentation and reconstruction, our protocol delivers central DP-level accuracy while maintaining LDP-like privacy guarantees.


Protocol Overview


Figure 2. Privacy-preserving data collection protocol


Our protocol uses the stability-based histogram technique [1] for collecting new data. To prevent direct data collection and tracing by the server, we integrate the ESA framework. Additionally, we introduce an auxiliary server to construct the histogram, similar to a central curator in the central DP model, but with a crucial distinction: the auxilary server does not access the original data messages. Instead, it only receives an encrypted version of the data encrypted with the server's public key, along with its hashed value. This ensures the auxiliary server gains no knowledge of the original message, except for the irreversible hash.


To construct the histogram, the auxiliary server counts the number of messages by counting the corresponding hashed value. To enhance security, the messages passing through the shuffler between the auxiliary server and clients are further encrypted using the auxiliary server's public key.


The overall framework contains four phases and is shown in Figure 2. We now describe each step in detail.


Before data collection begins, the server and the auxiliary server send their public keys to each client. They also agree on a set of privacy parameters (ϵ,δ) for the DP guarantees for the release.


Phase 1: On-Device Processing

Each client secures their data through the following steps:

  1. Data Encryption with Server's Public Key: The client encrypts the data with the server's public key.
  2. Data Hashing: The private data is then hashed by a hash function, which is unique for each client but identical across all clients.
  3. Encryption with the Auxiliary Server's Public Key: Finally, the encrypted data and its hashed value are encrypted with the auxiliary server's public key.


Phase 2: Shuffler's Anonymization and Shuffling

The encrypted message is passed on to the shuffler through an end-to-end encrypted channel. The shuffler then performs anonymization and shuffling after receiving each client's input, which is standard.


Phase 3: Auxiliary Server's Differential Privacy Protection

The auxiliary server applies Differential Privacy (DP) protection when releasing the item names.

  1. Decrypt Messages: The first step is to decrypt the messages received from the shuffler using the secret key.
  2. Hash Frequency Calculation: Since each client uses the same hash function, the hashed results for different clients with identical items must be identical. The auxiliary server calculates the frequency of each hashed result and attaches the corresponding encrypted data to it.
  3. Add DP Noise: The auxiliary server then adds Laplacian noise with the appropriate scale to the frequency of each hashed result. For those hashed results with noisy frequencies above a given threshold, the auxiliary server randomly samples from the corresponding encrypted data and releases it to the server.


Phase 4: Server Decrypts Messages

In the final phase, the server decrypts the messages received from the auxiliary server to obtain the plaintext, which contains the item names.


Privacy and Utility Analysis

Privacy Analysis: The proposed protocol ensures differential privacy. Specifically, the released encrypted item set S is (ϵ,δ) -differentially private with


Utility Analysis: We conduct experiments for our unknown domain DP solution with Kaggle's URL dataset. For convenience, we select the first 100,000 data records and limit the analysis to the first 20 characters of each URL. This is because many URLs contain client-ID-like strings towards the end, making them unique and difficult to analyze effectively for frequency-based approaches.


The observed tradeoff is that as the privacy budget ϵ decreases, the threshold T increases. As a result, fewer URLs are collected, reflecting a higher level of privacy protection at the expense of data availability. This trend is clearly illustrated in Figure 2.


Additionally, we observe that as ϵ increases, the number of collected URLs tends to stabilize. However, it's important to note that the stable number, approximately 5,500 URLs, is still significantly lower than the total number of distinct URLs, which is around 40,000. This indicates that even at higher values of ϵ, the method does not capture all unique URLs.


The reason for this is explained by Theorem 4, T is approximately 1 - log(2δ)/ϵ , which is greater than 1 for most values of ϵ. Since the majority of URLs in the dataset have a count of 1, and the Laplacian mechanism adds zero-mean noise, most of these URLs remain unrevealed. This feature ensures that user data with outliers remains inaccessible.


For a detailed analysis, please refer to our latest paper. The source code is available at PrivacyGo.


[1] A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas, “Releasing search queries and clicks privately,” in Proceedings of the 18th international conference on World wide web, 2009, pp. 171–180.


Share this article
Discover more
Using Local Differential Privacy Frequency Estimation for Secure Data Collection
Learn about our Locally Differential Privacy (LDP) frequency estimation protocol that collects user's local data with strong privacy protection.
Research
Privacy
Privacy-Preserving Data Collection from Unknown Domains
Our new protocol to collect unknown domain data combines Differential Privacy and cryptography, and has strong privacy guarantees and reduced computational costs.
Research
Privacy
Advancing Privacy and Security: 2024 Research Publications
TikTok's research efforts in 2024 highlighted four key privacy and security advancements: differential privacy, secure computing, Private Set Intersection (PSI), and AI hallucination.
Privacy