Privacy-Preserving Data Collection from Unknown Domains

by Privacy and Security Data Lifecycle Management and Privacy Innovation Lab

Research

Privacy

In this article, we propose a Quasi-Local Privacy Protocol for collecting items from unknown domains, leveraging central Differential Privacy (DP) techniques and cryptographic tools within the Encryption-Shuffling-Analysis (ESA) architecture. This approach provides privacy guarantees comparable to Local Differential Privacy (LDP) while significantly reducing computational overhead.

Collecting data from unknown domains is a common challenge in real-world applications, such as identifying new words in text inputs, or tracking emerging URLs. Current solutions often suffer from high computational and communication costs. To address this, we propose a Quasi-Local Privacy Protocol for collecting items from unknown domains. Our protocol combines central Differential Privacy (DP) techniques with cryptographic tools within the Encryption-Shuffling-Analysis (ESA) framework. This approach provides privacy guarantees similar to Local Differential Privacy (LDP) while significantly reducing computational overhead.

Our protocol uses an auxiliary server to construct histograms without accessing the original data. This allows the protocol to achieve accuracy similar to the central DP model, while still providing privacy protections akin to Local Differential Privacy (LDP). The technique leverages a stability-based histogram method integrated into the ESA framework, where the auxiliary server builds histograms from encrypted messages, ensuring privacy without tracing the original data. By eliminating the need for message segmentation and reconstruction, our protocol delivers central DP-level accuracy while maintaining LDP-like privacy guarantees.

Protocol Overview

Figure 2. Privacy-preserving data collection protocol

Our protocol uses the stability-based histogram technique [1] for collecting new data. To prevent direct data collection and tracing by the server, we integrate the ESA framework. Additionally, we introduce an auxiliary server to construct the histogram, similar to a central curator in the central DP model, but with a crucial distinction: the auxilary server does not access the original data messages. Instead, it only receives an encrypted version of the data encrypted with the server's public key, along with its hashed value. This ensures the auxiliary server gains no knowledge of the original message, except for the irreversible hash.

To construct the histogram, the auxiliary server counts the number of messages by counting the corresponding hashed value. To enhance security, the messages passing through the shuffler between the auxiliary server and clients are further encrypted using the auxiliary server's public key.

The overall framework contains four phases and is shown in Figure 2. We now describe each step in detail.

Before data collection begins, the server and the auxiliary server send their public keys to each client. They also agree on a set of privacy parameters (ϵ,δ) for the DP guarantees for the release.

Phase 1: On-Device Processing

Each client secures their data through the following steps:

Data Encryption with Server's Public Key: The client encrypts the data with the server's public key.
Data Hashing: The private data is then hashed by a hash function, which is unique for each client but identical across all clients.
Encryption with the Auxiliary Server's Public Key: Finally, the encrypted data and its hashed value are encrypted with the auxiliary server's public key.

Phase 2: Shuffler's Anonymization and Shuffling

The encrypted message is passed on to the shuffler through an end-to-end encrypted channel. The shuffler then performs anonymization and shuffling after receiving each client's input, which is standard.

Phase 3: Auxiliary Server's Differential Privacy Protection

The auxiliary server applies Differential Privacy (DP) protection when releasing the item names.

Decrypt Messages: The first step is to decrypt the messages received from the shuffler using the secret key.
Hash Frequency Calculation: Since each client uses the same hash function, the hashed results for different clients with identical items must be identical. The auxiliary server calculates the frequency of each hashed result and attaches the corresponding encrypted data to it.
Add DP Noise: The auxiliary server then adds Laplacian noise with the appropriate scale to the frequency of each hashed result. For those hashed results with noisy frequencies above a given threshold, the auxiliary server randomly samples from the corresponding encrypted data and releases it to the server.

Phase 4: Server Decrypts Messages

In the final phase, the server decrypts the messages received from the auxiliary server to obtain the plaintext, which contains the item names.

Privacy and Utility Analysis

Privacy Analysis: The proposed protocol ensures differential privacy. Specifically, the released encrypted item set S is (ϵ,δ) -differentially private with

Utility Analysis: We conduct experiments for our unknown domain DP solution with Kaggle's URL dataset. For convenience, we select the first 100,000 data records and limit the analysis to the first 20 characters of each URL. This is because many URLs contain client-ID-like strings towards the end, making them unique and difficult to analyze effectively for frequency-based approaches.

The observed tradeoff is that as the privacy budget ϵ decreases, the threshold T increases. As a result, fewer URLs are collected, reflecting a higher level of privacy protection at the expense of data availability. This trend is clearly illustrated in Figure 2.

Additionally, we observe that as ϵ increases, the number of collected URLs tends to stabilize. However, it's important to note that the stable number, approximately 5,500 URLs, is still significantly lower than the total number of distinct URLs, which is around 40,000. This indicates that even at higher values of ϵ, the method does not capture all unique URLs.

The reason for this is explained by Theorem 4, T is approximately 1 - log(2δ)/ϵ , which is greater than 1 for most values of ϵ. Since the majority of URLs in the dataset have a count of 1, and the Laplacian mechanism adds zero-mean noise, most of these URLs remain unrevealed. This feature ensures that user data with outliers remains inaccessible.

For a detailed analysis, please refer to our latest paper. The source code is available at PrivacyGo.

[1] A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas, “Releasing search queries and clicks privately,” in Proceedings of the 18th international conference on World wide web, 2009, pp. 171–180.

Share this article

Discover more

Transparency by Design: Why Open Source AI Matters for PrivacyAn overview of how open source AI supports algorithmic transparency, auditing, and regulatory compliance, highlighting its role in enabling oversight and aligning with emerging legal frameworks.

Privacy

TikTok and the Linux Foundation PET Initiative: Birds of a Feather at Open Source Summit North America 2025TikTok shares insights on PETs and open-source privacy tools from OSS North America 2025.

Privacy

Open source

TikTok Sans: Now a Free and Open-Source FontInspired by creators and made for creators, the font you've seen all over TikTok is now free for everyone.

Open source