TikTok for Developers
SecureNumpy: Empowering Data Scientists with Secure Multi-Party Computation
by The TikTok Privacy Innovation Team
Privacy
Tech @ TikTok
Open source

Last year, we announced PETAce, one of the initiatives undertaken by Privacy Innovation at TikTok to research innovative ways of safeguarding the privacy and security of user data. PETAce is a comprehensive framework for enhancing privacy with applied cryptography.

Within the PETAce ecosystem, we now introduce SecureNumpy, a data analysis module designed to bridge the gap between NumPy, a fundamental package for scientific computing with Python, and the privacy-preserving paradigms of secure Multi-Party Computation (MPC).


PETAce Ecosystem

Privacy-Enhancing Technologies via Applied Cryptography Engineering (PETAce) is a framework for privacy-preserving computing. It can provide a strong guarantee of privacy by allowing data to be analyzed and computed without revealing any sensitive information during data processing. It consists of the following parts:

  • The user interface layer provides users with high-level programming interfaces for collaborative data analysis, joint SQL query, and privacy-preserving machine learning.
  • The virtual machine is responsible for parsing high-level language into MPC operators, and performing automatic optimization and scheduling.
  • The protocol layer includes secure multi-party computation protocols, such as general two-party secure computation protocols, privacy set intersection, privacy information retrieval, and more.
  • The primitive layer consists of standard cryptographic algorithms and protocols, differential privacy mechanisms, abstract network interfaces, and more.

Introduction

The use of large-scale data and machine learning has significantly transformed the way the industry processes data and creates knowledge. However, this technological advancement also brings significant privacy threats, especially when dealing with confidential information. Traditional data analysis tools, such as NumPy, mainly function on plaintext data, making such data more vulnerable to potential misuse or unauthorized access.

Secure Multi-Party Computation (MPC) offers a solution to this problem by allowing computations to be performed on encrypted data, thus maintaining data privacy throughout the analysis process. Despite its privacy protections, MPC is often perceived as having complex interfaces and steep learning curves, limiting its adoption among a wider audience.

SecureNumpy is a library designed to bridge the gap between NumPy, a fundamental package for scientific computing with Python, and the privacy-preserving paradigms of MPC. With its intuitive interface, SecureNumpy provides a seamless experience that allows users to perform NumPy's powerful array manipulations and mathematical functions on encrypted data without compromising their privacy. SecureNumpy aims to address privacy and security concerns while offering the following:

  • Make MPC easier to use: By providing an interface similar to NumPy, SecureNumpy significantly reduces the entry level required to use MPC, empowering more practitioners without deep cryptographic expertise.
  • Facilitate secure data analysis: SecureNumpy facilitates collaboration between organizations by enabling the extraction of valuable insights from shared datasets while ensuring the strict confidentiality of each participant's individual contributions.
  • Enhance performance and efficiency: SecureNumpy strives to optimize the underlying MPC operations and make secure computation practical for real-world applications.
  • Encourage open research and collaboration: As an open-source library, SecureNumpy is expected to facilitate a collaborative environment that combines cryptography and data science. Interdisciplinary collaboration will produce innovative solutions that prioritize user privacy.

In summary, SecureNumpy merges the simplicity and power of NumPy with the security guarantees of MPC, creating an effective and secure tool for data collaboration analysis.

Design Principles

In the development of SecureNumpy, we adopted five main design principles:

  1. Security and reliability: SecureNumpy maintains strict cryptographic standards and leverages state-of-the-art techniques for data confidentiality, correctness of calculations, and resistance to attacks. MPC ensures the confidentiality and integrity of the data in the semi-honest setting, and the data is secret-shared among participants to prevent single-party access.
  2. Simplicity and ease of use: SecureNumpy simplifies the complexity of MPC, providing a straightforward interface reminiscent of NumPy. The goal is to enable users to write code naturally
    and learn quickly. Users can write code as if they were using NumPy, without understanding the underlying cryptographic operations. Interface familiarity significantly reduces the learning curve, allowing new users to become proficient in a short time.
  3. Flexibility: SecureNumpy employs a user-friendly interface design that mirrors the NumPy suite, allowing users to engage with it without learning unfamiliar APIs, significantly lowering the learning curve. More importantly, this approach also empowers users to develop new functions that do not yet exist in the library, thus further enhancing its functionality. For example, even though SecureNumpy does not currently offer a built-in function to compute the ciphertext log function, users can apply the Taylor expansion method to construct a log function based on the existing multiplication function. This design approach not only promotes SecureNumpy's scalability, but also encourages users to customize and extend it to meet their specific needs, catering to a wide range of complex computing requirements. Through the use of modularity and plug-in mechanisms, users can easily integrate custom functions, thereby making SecureNumpy more flexible and powerful in multi-party computing environments.
  4. Efficiency: SecureNumpy, as a core component of PETAce, provides users with an easy-to-use API. For the underlying core protocol, the industry's most advanced research results were integrated and implemented in C++ to ensure optimal performance.
  5. Openness and transparency: SecureNumpy prioritizes openness and transparency by embracing the open-source philosophy, offering users access to its source code and detailed development documentation, facilitating community review and contributions.


Key Designs of SecureNumpy

The N-dimensional array (SecureArray)

SecureArray is the foundational object provided by SecureNumpy, designed specifically to enable MPC while maintaining a user-friendly interface similar to that of NumPy arrays. SecureNumpy aims to bring the power and simplicity of NumPy's array operations to the domain of secure computations, allowing users to perform complex data manipulations and analyses without compromising privacy or security.

A SecureArray is a fixed-size multidimensional container of items of the same type and size distributed in two parties. Like NumPy, SecureArray provides facilities such as shape and dtype, as well as explicit indexing capabilities. SecureArray also supports operator overloading, enabling basic operations like addition, subtraction, multiplication, division, and comparison, with either SecureArray or numpy.ndarray.

Key Features

  • Operator Overloading: SecureArray supports operator overloading, which allows users to perform arithmetic and logical operations using standard Python operators. This means you can use operators like + for addition, - for subtraction, * for multiplication, / for division, and more, directly on SecureArray instances.
    • Arithmetic Operations: You can perform secure addition, subtraction, multiplication, and division between SecureArray instances or between SecureArray and numpy.ndarray.
    • Comparison Operations: You can also perform secure comparison operations such as ==, !=, <, <= and =, enabling secure conditional logic without revealing the underlying data.
    • Element-wise Operations: All these operations are applied element-wise, similar to how they work in NumPy, ensuring a consistent and intuitive experience.
  • Attributes: Array attributes are integral to the structure and behavior of the array itself. The properties associated with an array through its attributes can be accessed, and occasionally modified, without recreating a new array.
    • Shape: The shape attribute allows users to query the shape of the SecureArray. This is particularly useful for understanding the structure of the data and for performing operations that require specific shapes.
    • Number of Dimensions (ndim): The ndim attribute provides the number of dimensions (axes) of the SecureArray.
    • Size: The size attribute returns the total number of elements in the SecureArray.
    • Data Type (dtype): The dtype attribute specifies the data type of the elements stored in the SecureArray, such as np.float64and np.bool_.
  • Methods: SecureArray provides some useful methods to operate an array. All the methods will return an array result. Common methods are as follows.
    • Shape manipulation: SecureArray supports common shape manipulation methods such as reshape and transpose. The reshape method allows users to change the shape of the SecureArray and return a new array. The transpose method returns a new array with axes transposed. These operations are essential for preparing data for specific algorithms that require inputs of a certain shape or for mathematical computations that require rearranging dimensions.
    • Array conversion: In contrast to NumPy, SecureArray is a two-party scientific computing library. Each party holds a share of the data, rather than the original data. At some point, you may want to perform an operation on this share, such as saving it to a file. Therefore, we implemented the to_share method, which transforms the local share into an numpy.ndarrayfor subsequent user operations. Currently, we have also provided the fromshare function for restoring the share of both parties to a SecureArray.
    • Reveal: As a result of executing a series of ciphertext computations, the function reveal_to allows you to reveal the result to one of the parties. This function restores the SecureArray to a numpy.ndarry and transmits it to one of the parties, while the other party will receive a null value.
  • Indexing: SecureArray can be indexed using the standard Python arr[obj] syntax. Basic indexing is available, it includes single element indexing and slicing.
    • Single element indexing: Single index can be used to access individual elements of a SecureArray. If one indexes a 2-d array, one gets a 1-d array. And negative indices are also supported. If Xis a 2-d array, then you can use x[0],x[-2],x[0, 2],x[0][2] to get different objects.
    • Slicing: slice obj constructed by start:stop:step, SecureArray support slicing index and stepmust equal to 1. If Xis a 2-d array, then you can use x[:2],x[-2:],x[0:, 2:]to get different objects.

Routines

One of the main reasons for the widespread use of NumPy is its extensive library of functions. These functions cover a wide range of capabilities, from basic numerical arithmetic to advanced matrix operations. Whether you are performing simple element-wise operations or complex linear algebra computations, NumPy's optimized C and Fortran code ensures that these operations are performed with maximum efficiency.

For example, NumPy supports a variety of mathematical operations including trigonometric, statistical, and algebraic functions. It also provides powerful capabilities for handling arrays. These features make it an indispensable tool for data analysis, scientific computing, and machine learning.

To boost usability, SecureNumpy also offers a series of practical routines, including reshape, stack, sum, argmax, and more. These routines, grouped by functionality, offer an experience similar to using NumPy, making it easy for users to transition between the two libraries. SecureNumpy ensures that the computations are performed securely and efficiently, adhering to best practices in secure computing.

The following are the main modules and supported functions in SecureNumpy:

Module

Description

Examples

Array creation

Methods to create SecureArray

ones,zeros,arange

Array manipulation

Change array shape, transpose an array, and join arrays

reshape,transpose,repeat

Mathematical functions

Some arithmetic operations, exponents and logarithms will be provided in the future

sum, prod,max


Linear algebra

Some matrix and vector product functions

dot, inner

Sorting, searching and counting

Some sort and search functions

where, argmax, sort

Statistics

Statistic functions to calculate order, average and variance

ptp,average,mean

Usage methodology

Setting up the virtual machine

SecureNumpy is developed based on PETAce Duet. In order to utilize SecureNumpy, the first step involves initializing a Duet virtual machine (VM). Here, each party needs to specify its own party ID. Party identification plays a significant role in establishing and securing network connections. Once this is complete, the two parties can communicate using the specified IP and port.

Since this is a two-party computing library, it is essential to execute the subsequent program on two distinct machines or two distinct processes, whilst utilizing your unique party identification (0 or 1) through the command line argument.

import sys
from petace.network import NetParams, NetScheme, NetFactory
from petace.duet import VM

party_id = sys.argv[1]
host = "127.0.0.1"
port0 = 8090
port1 = 8091

net_params = NetParams()
if party_id == 0:
    net_params.remote_addr = host
    net_params.remote_port = port1
    net_params.local_port = port0
else:
    net_params.remote_addr = host
    net_params.remote_port = port0
    net_params.local_port = port1

# init net and mpc engine
net = NetFactory.get_instance().build(NetScheme.SOCKET, net_params)
vm = VM(net, party_id)

Create SecureArray and conduct basic operations

In a two-party computing scenario, one party will provide the original plaintext data to secret-share, and the other only needs to enter None. The array function is used to help users share plaintext securely and convert it into a SecureArray object. To indicate where the plaintext data originated, the user must enter a party ID for each operation.

When data is converted into a SecureArray, it can be treated like a local numpy.ndarray. We've overridden all basic operations, enabling users to freely perform mathematical operations on plaintext and ciphertext. Moreover, the module also provides common attributes for code debugging. However, these properties only showcase meta-info about the data, such as its shape and dimensions, and do not directly reveal the original data source.

Once all the MPC calculations are complete, the final result can be restored to a party using the reveal_to method. In this example code, the final result of the calculation is revealed to party 0, who will receive the correct comparison result, while party 1 will receive a None.

In the example code, the original data source is identified by reading the code. It's important to remember that the original data source should not be created through code, but imported by user. For example, you can load the data using plain_data0 = np.load("/path/data.npy"), which will only reveal that you provided some data, but not the actual data itself. The data in this example is generated solely for the purpose of making it easier to understand.

Currently, our module only supports data sources that are numpy.ndarray with data types of float64 or bool that have at most two dimensions. However, we are planning to add support for more data types in the future.

import petace.securenumpy as snp
import numpy as np

snp.set_vm(vm)


if party_id == 0:
    plain_data0 = np.array([1., 2., 3.])
    plain_data1 = None
else:
    plain_data0 = None
    plain_data1 = np.array([4., 5., 6.])

# transform a numpy.ndarray to a SecureArray
cipher_data0 = snp.array(plain_data0, 0)
cipher_data1 = snp.array(plain_data1, 1)

# some basic operations
res1 = cipher0 + cipher1
print(res1.shape) # (3,)
print(res1.dtype) # np.float64

res2 = cipher0 * 2
res3 = cipher0 / cipher1[0]
res4 = cipher0 > cipher1
print(res4.dtype) # np.bool_

# reveal data
res_plain = res1.reveal_to(0)

Advanced features

SecureNumpy has a comprehensive library of functions designed to enhance the usability of the application. These functions mirror the native functions in the NumPy package, allowing users to seamlessly incorporate them into their workflow. The following example code demonstrates how to use some of the basic functions provided by SecureNumpy, such as creating special arrays, modifying array dimensions, summing array elements, and identifying maximum values.

# create data
data0 = snp.arange(20).reshape((4, 5))
data1 = snp.ones((4, 4))

# manipulation
data = snp.concatenate([data0, data1], axis=1)
print(data.shape) #(4, 9)
data_reshape = snp.reshape(data, (6, 6))

# some math function and statistic
res_sum = snp.sum(data)
res_max = snp.max(data, axis=0)
res_argmax = snp.argmax(data, axis=1)

SecureNumpy's extensive library enables users to customize their unique functions, even without a thorough knowledge of cryptography. It is shown below that users can effectively implement ReLU activation functions by taking advantages of the functions we offer. This is a major driving force behind SecureNumpy's development. We strongly believe that all users, regardless of their cryptography skills, can be part of SecureNumpy and contribute effectively to its development.

def ReLU(x):
    return snp.max(0, x)

def PReLU(x, a):
    return snp.where(x<0, a*x, x)

Use cases

To clarify how SecureNumpy applies to various scenarios, here are a few hypothetical examples.

Scenario 1: Multi-party data analysis

Imagine two affiliates: Company A, which specializes in selling high-end luxury goods, and Company B, which specializes in private banking. Each affiliate maintains a unique database, with Company A holding data on its customers' purchasing habits and financial profiles, while Company B holds data on its customers' banking history and other financial data. In an effort to more precisely identify and tailor wealth management solutions for high-net-worth customers, Company B wishes to integrate these two datasets for a more comprehensive understanding of its clients' financial situations.

Company A has the customers' luxury purchase history data table data0, including two columns: purchase frequency and purchase amount. Company B has the customers' financial asset data data1, including two columns: account balance and wealth management account amount. High-net-worth customers (HNWC) are defined as follows: purchase frequency is greater than 2, purchase amount is greater than 5000, bank account balance is greater than 500w. This logic can be expressed as below:

cond = (data0[:, 0]>2) & (data0[:, 1]>5000) & (data1[:, 0] >5000000)
cond_float = snp.where(cond, snp.ones(cond.shape), snp.zeros(cond.shape))

hnwc_number = snp.sum(cond_float)
hnwc_cost = snp.sum(cond_float * data1[:, 1])
average_cost = hnwc_cost / hnwc_number

Through multi-party data analysis, Company B gains insights into the investment preferences of high-net-worth customers, enabling the creation of customized marketing strategies specifically designed to cater to their needs and preferences. This process not only improves customer satisfaction but also strengthens their brand loyalty. Additionally, SecureNumpy ensures the privacy and security of data during the analysis process, providing enterprises with a reliable solution that allows for both data sharing and joint analysis.

Scenario 2: Privacy-preserving machine learning

Privacy-preserving machine learning is a complex field that requires a comprehensive understanding and careful consideration of multiple aspects.

  • Multi-party computing protocols: You need to understand and implement complex multi-party computing protocols that ensure that participants do not expose their data during the computation process.
  • Design of basic operators: You need to design and implement basic operators based on multi-party computing protocols, such as matrix multiplication, addition, and more. These operators need to perform efficient calculations under the premise of ensuring data privacy.
  • Data encryption and decryption: To ensure the security of data during transmission and computing, it is necessary to encrypt and decrypt the data.
  • Performance optimization: Multi-party computing typically increases computing and communication overhead, so performance optimization is required to ensure computational efficiency.

Using SecureNumpy makes it easy to implement privacy-preserving machine learning. It simplifies the complexity of multi-party computing (MPC), making it easy even for developers who are new to MPC technology to get started. Here are the main advantages of SecureNumpy:

  • Simplified interface: SecureNumpy provides a similar interface to NumPy, allowing developers to develop in a familiar environment. There is no need to relearn complex multi-party computing protocols.
  • Built-in privacy protection: SecureNumpy implements multi-party computation protocols and basic operators at the bottom, so that data is always encrypted during computing. Developers do not need to manage the underlying encryption and decryption operations.
  • Efficient computing: The optimized multi-party computing algorithm ensures the efficiency of computing and minimizes the performance overhead in traditional multi-party computing.
  • Easy to integrate: Existing NumPy-based code can be converted to a privacy-preserving version with only a few modifications. SecureNumpy is designed to make this conversion process very simple.
  • Rich function library: SecureNumpy inherits the rich function library of NumPy and is optimized for multi-party computation, providing powerful numerical computation and matrix operation functions.

For example, consider that you already have a linear regression model based on a NumPy implementation, and now you want to implement the same model in a multi-party computing environment to protect the data privacy of all parties. With SecureNumpy, you can achieve this with only a few modifications to the original code.

import numpy as np

class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.theta = None

    def fit(self, X, y):
        # Add bias term (intercept) to X
        X_b = np.concatenate([np.ones((X.shape[0], 1)), X], axis=1)

        # Number of training samples and features
        m, n = X_b.shape

        # Initialize weights (theta) to zeros
        self.theta = np.zeros(n)

        # Gradient Descent
        for _ in range(self.n_iterations):
            gradients = (1 / m) * np.dot(X_b.T, np.dot(X_b, self.theta) - y)
            self.theta -= self.learning_rate * gradients

    def predict(self, X):
        # Add bias term (intercept) to X
        X_b = np.concatenate([np.ones((X.shape[0], 1)), X], axis=1)
        return np.dot(X_b, self.theta)

    def get_params(self):
        return self.theta

As illustrated in the above figure, the MPC version of linear regression can be conveniently implemented by replacing the initial line of code with import petace.securenumpy as np. This will ensure a privacy-preserving linear regression process.

Example code can be found in the PETAce GitHub repository.

Limitations

Although SecureNumpy is powerful, it still has certain limitations.

Unable to implement if

Using a ciphertext bool in an if statement will result in a runtime error, because the if statement treats the ciphertext bool as a non-empty object, regardless of its actual value, and therefore always evaluates to True. Encryption data is highly sensitive, and malicious actors could exploit the short-circuit effect to potentially gain unauthorized access to information if the if statement were allowed to operate on ciphertext. For instance, the plaintext value of a binary variable could easily be deduced through a simple if statement.

if cond:
    res = a
else:
    res = b

While a direct determination of the conditional statement is not feasible, we offer the where function to assist users in selecting branches. The above code may be modified as follows: res = snp.where(cond, a, b), which achieves the intended result and offers additional security.

For more complex judgment needs, corresponding conversion strategies can be implemented. For example, the code snippet can be translated into the equivalent res=snp.where(cond1, a, snp.where(cond2, b, c))efficiently.

if cond1:
    res = a
elif cond:
    res = b
else:
    res = c

Restricted indexing capabilities

We do not currently support ciphertext indexing. Instead, we primarily support plaintext indexing with comprehensive base indexing capabilities that include integer indexing and slice indexing (which must be continuous). For example, the slice range specified by `start:end:step` must adhere to a step size of 1.

Furthermore, we do not currently support advanced indexing mechanisms such as integer array indexes or boolean indexes. Here's an example of our indexing capabilities:

# Good cases
arr[0]
arr[0][2]
arr[0, 2]
arr[:2]
arr[:-1]
arr[:3, :4]

# Bad cases
arr[[0, 1, 3]]
arr[1:10:2]
arr[np.array([1, 2, 3])]
arr[np.array([True, False, True])]

Dimensional limitation

Currently, SecureNumpy limits the system to two-dimensional arrays at most due to practicality and efficiency considerations. As for ciphertext scalars, SecureNumpy treats them simply as zero-dimensional arrays.

Type restriction

The array function is mainly used for secret-sharing plaintext data and converting it into a SecureArray. It is important to note that the input parameters must comply with the following requirements: first, they must be an numpy.ndarray ; second, the data must be of type float64 or bool.

No cross-type operation function

It is worth noting that NumPy has incorporated cross-type operation capabilities, such that boolean arrays can be operated directly with floating point arrays. At this point, boolean arrays are treated as 0-1 arrays. Unfortunately, SecureNumpy is unable to implement operations between boolean and floating-point arrays at this stage. However, we are committed to addressing this issue in a future release.

If the necessity for this functionality is critical, it is recommended that the boolean array be manually converted to a floating-point array prior to executing subsequent calculations.

arr_float = snp.where(arr_bool, snp.ones(arr_bool.shape), snp.zeros(arr_bool.shape))

Roadmap

We outline four important future directions for development.

  1. Capabilities expansion: SecureNumpy's thriving efforts aim to enhance its library functionality, catering to broader scientific computing demands. These forthcoming features include the following:
    • Adding new data types: Beyond float and bool, int type and conversion interfaces will be incorporated.
    • Augmenting the function library: The existing 30+ functions are being expanded for enhanced utility.
  2. Enhanced computational precision: While prioritizing data privacy, SecureNumpy strives to boost computational precision, particularly for high-precision tasks. Key enhancements include the following:
    • Numeric stability: Algorithmic optimization reduces rounding errors and loss of precision, ensuring precise results.
    • High-precision data types support: Introducing support for high-precision floating point numbers and fixed points, catering to users' needs in high-precision scenarios.
    • Rigorous verification and testing: Comprehensive testing ensures no additional errors or instabilities with new features and improvements.
  3. Optimized computation performance: To meet the demands of large-scale data analysis and machine learning, SecureNumpy will continually optimize computation performance. Measures include the following:
    • Parallel computing support: Harnessing multi-threading and multi-processing technologies for increased efficiency and reduced processing time.
    • Memory optimization: Streamlining memory management for reduced occupancy and improved big data handling capabilities.
    • Algorithm refinement: Enhancing current algorithms using more efficient data structures and computations for further performance enhancement.
  4. User-friendly documentation and tutorials: Open-source projects should provide intuitive documentation and tutorials to facilitate user adoption, understanding, and effective utilization. We plan to deliver the following:
    • Thorough documentation: Providing comprehensive API documentation and usage guides from basic operations to advanced applications, aiding users in quickly mastering SecureNumpy.
    • Diverse sample code: Offering diverse sample codes covering various practical application scenarios, assisting users in comprehending and utilizing SecureNumpy functionality.

In summary, SecureNumpy's future roadmap revolves around enhancing capabilities, precision, performance, and documentation. By continuously expanding functionalities, boosting precision, and refining performance, we aspire to create a powerful, efficient, and superior privacy-preserving computing tool for our users. Whether you are a data scientist, machine learning engineer, or an industry user requiring multi-party computation, SecureNumpy is a reliable and highly effective solution.

Share this article
Discover more
Highlights from our Privacy Innovation Meetup at ACM CCS 2024
TikTok's Privacy Innovation team hosted a meetup at ACM CCS 2024, showcasing privacy-preserving technologies like ManaTEE and reinforcing the team's commitment to privacy and security through industry and academic collaboration.
Privacy
Community
A Recap of DevDay 2024: TikTok's Inaugural Developer Conference
Our first-ever TikTok DevDay in San Jose was an incredible success! With over 300 developers in attendance, the event provided an immersive experience into TikTok’s growing ecosystem of tools and innovations. Here is the recap blog of our event.
Community
TikTok Donates ManaTEE Open Source Project to the Linux Foundation
TikTok is donating ManaTEE, a platform built on Trusted Execution Environments, to the Linux Foundation’s Confidential Computing Consortium. ManaTEE is designed to address critical challenges in data privacy and security.
Tech @ TikTok
Open source