Using Local Differential Privacy Frequency Estimation for Secure Data Collection

by Privacy and Security Data Lifecycle Management and Privacy Innovation Lab

Research

Privacy

This article introduces a privacy-preserving protocol for frequency estimation developed by TikTok Privacy Innovation. This protocol enhances existing frameworks with improved accuracy, privacy, and communication efficiency under Local Differential Privacy (LDP).

Monitoring user activity is a critical goal in application-level data processing, as it improves user experience and supports advanced functionalities. For instance, analyzing the frequency with which users visit specific websites (URLs) can help identify unusual traffic patterns. Such anomalies may indicate potential security vulnerabilities, enabling timely alerts and interventions. However, user behaviour and interactions usually occur locally, making direct access to this data challenging due to privacy concerns. To address these challenges, we have developed a locally differentially private frequency estimation protocol that extends and generalizes Apple’s Count-Mean-Sketch (CMS) method. Our protocol introduces tunable parameters, offering an optimized balance between utility, privacy, and communication costs.

Introduction

Differential Privacy (DP) provides robust privacy guarantees for individuals within a dataset while still enabling meaningful data analysis. Among the various DP models, Local Differential Privacy (LDP) is particularly favoured for real-world applications due to its ability to obscure data at the source, eliminating the need for a central data curator.

Frequency estimation, a fundamental task in data analytics, has various implementations under the LDP model. Typically, clients encode their responses, perturb the encoded values, and send them to an aggregator, who then decodes the values to estimate item frequencies. The "pure" LDP framework includes several protocols, such as Google's RAPPOR, which enable precise analysis and comparison of differential privacy techniques. However, it faces challenges when handling large domain sizes. To mitigate this problem, hash functions are commonly used, but they can introduce collisions, reducing accuracy. Apple’s Count-Mean-Sketch (CMS) framework addresses these issues by incorporating a statistical data structure, though it lacks optimized perturbation and aggregation parameters.

We propose the Generalized Count-Mean-Sketch (GCMS) protocol, which builds on Apple's CMS by accounting for randomness in responses and hash collisions, improving accuracy under the LDP model. GCMS reduces communication costs while maintaining privacy and provides a utility guarantee for optimal parameter design.

To further enhance privacy, we use the Encryption-Shuffling-Analysis (ESA) framework. Clients' encoded outputs are encrypted and shuffled by a trusted shuffler, breaking the link between clients and responses and amplifying privacy. The server then aggregates and analyzes the shuffled data.

General Count Mean Sketch (GCMS) Protocol

Overview of the protocol

Figure 1. General Count Mean Sketch Protocol in the Encryption - Shuffling - Analysis Framework.

Preparation:

Before data collection, the server generates k independent hash functions, where each hash function deterministically maps any input to a discrete number from 1 to m, with m being the hashing range. The server then sends all the hash functions and its public key pk to each client.

The entire data collection pipeline is illustrated in Figure 1. It consists of three distinct phases, each operating on different platforms:

The initial phase involves all on-device algorithms, including hash encoding and privatization, and encryption with the server's public key pk.
The encrypted privatized data is then transmitted through an end-to-end encrypted channel and received by the shuffler. The shuffler then forwards the data to the server after a random shuffling operation.
Finally, the server decrypts the input data, performs data aggregation, and obtains the Frequency Lookup Table - a sketch matrix M.

We now detail operations in each phase.

Phase 1: On-device operations.

We present the process for a single data privatization and encryption, which consists of the following steps.

Hash Encoding: Each client first uniformly picks a number i from 1 to k, and calculates a hashed value of their raw data d with the i-th hash function: r = hash(i, d).

Probabilistic Inclusion: Each client initializes their privatized vector as an empty set, then adds the hashed value to x with probability p ∈ [0.5,1].

Probabilistic Extension: Set the extension domain for each client as a set containing all values but r. If r is added to x, then uniformly select s-1 elements from the extension domain and append them to x. If r is not added to x, then uniformly select s elements from the extension domain and append them to x. Return the privatized vector along with the selected hash index i.

Encryption and Release: Each client encrypts the vector and the hash index with the server's public key pk, then releases the results to the shuffler.

Phase 2: Shuffler's anonymization and shuffling.

This phase, which is standard, involves Anonymization and Shuffling: After receiving each client's input. The shuffler removes all traceable information from the received message, e.g., the communication header. It then randomly shuffles the anonymized messages in batches and then releases the data to the server. Additionally, the shuffler may block clients who have already sent a certain amount of messages to limit their contributions. This helps filter out heavy clients or counter data poisoning attacks, effectively limiting sensitivity for client-level DP analysis.

Phase 3: Server's aggregation and frequency estimation.

The server first decrypts messages with the secret key. Then, the server-side algorithm constructs a sketch matrix, in which the rows are indexed by hash functions, and each row i is the sum of the privatized vector of clients who selected the i-th hash function. To estimate the frequency of a message d, the server calculates all the hashed values of d, and then aggregates the total count from the sketch matrix M to get C(d):

C(d) = 0
for each row i in M:
    C(d) += M[i][hash(i, d)]

An unbiased estimator for estimating the numbers of occurring is

Privacy and utility analysis of GCMS

Privacy guarantee: the GCMS protocol achieves both LDP and DP guarantee for users' instantiated inputs. We start with the LDP guarantee. Formally, the privatized vector x is ϵ -locally differentially private with