Multiclient Outsourced Distributed Computation#
Secure XGBoost is tailored toward an outsourced computation model – one in which multiple clients with sensitive data want to outsource joint computation on their sensitive data to the untrusted cloud. This tutorial provides an example of such a scenario.
In the multiclient setting, Secure XGBoost contains a mechanism for consensus. It requires that all parties agree to invoke a certain API before running that function. For example, if Party A wants to train a model on Party A and Party B’s data, Secure XGBoost requires that both Party A and Party B submit a train command before proceeding to train the model. In a similar fashion, if Party A wants to load its data, Party B must also agree and allow Party A to do so.
In this example, there are five entities: two clients, who each own sensitive data; an RPC orchestrator; and two untrusted servers, each running an enclave, who communicate to perform distributed computation. Each party holds distinct data that they want to aggregate. The demo consists of the following steps:
The orchestrator starts an RPC server and creates an enclave at each node (untrusted server) in the cluster.
The orchestrator starts an RPC server to listen for client requests.
Each client encrypts its data and sends it to the untrusted servers.
Each client attests the enclaves to ensure that the proper code has been loaded inside every enclave.
To ensure that all parties are ready to commence computation, the orchestrator waits for both parties to make the same request before proceeding. Once both parties have submitted a command, the orchestrator relays the command to the enclave cluster.
Clients make requests to load data, train a model, and serve encrypted predictions.
Each client decrypts its received predictions.
The relevant code is located at demo/python/multiclient-cluster-remote-control
.
Cluster Setup#
First, set up machines that will act as the untrusted servers. We’ll need to start an RPC process on each server to listen for client calls.
On the orchestrator machine, modify
demo/python/multiclient-cluster-remote-control/hosts.config
to contain the IP addresses of the nodes in your cluster. For example, if the nodes in your cluster have IP addresses of13.95.157.223
and40.68.135.193
, yourhosts.config
should look like the following.13.95.157.223:22 40.68.135.193:22
For distributed computation, Secure XGBoost assumes that the orchestrator has SSH access to all nodes in the cluster. You can grant access by pasting the orchestrator’s SSH public key (likely found at
~/.ssh/id_rsa.pub
) into each node’s~/.ssh/authorized_keys
file. If the orchestrator machine does not yet have a SSH keypair, create it:ssh-keygen -t rsa -b 4096
Start the RPC servers on all machines.
secure-xgboost/host/dmlc-core/tracker/dmlc-submit --cluster ssh --host-file hosts.config --num-workers <num_workers_in_cluster> --worker-memory 4g python3 server/enclave_serve.py
Orchestrator Setup#
Next, set up the RPC orchestrator.
Modify the
nodes
argument in thexgb.serve()
function in thedemo/python/multiclient-cluster-remote-control/start_orchestrator.py
script to reflect the IP address of the nodes in the cluster. Theport
argument tells the RPC orchestrator which port to listen for client commands on. Note thatstart_orchestrator.py
contains code that will automatically parsehosts.config
for the node IPs, so you may not have to do this step.xgb.serve(all_users=["user1", "user2"], nodes=["<SERVER_IP_1>", "<SERVER_IP_2"], port=50052)
Run the script to start the orchestrator.
python3 demo/python/multiclient-cluster-remote-control/orchestrator/start_orchestrator.py
Client 1 Setup#
Before doing this setup, ensure that you’ve already set up the cluster and the orchestrator.
This setup will involve encrypting data on client 1, transferring the data to the server, and telling the orchestrator that client 1 is ready.
cd
into the demo/python/multiclient-cluster-remote-control/client1
directory to begin setup.
Encrypt data locally.
Use the
encrypt.py
script to generate a key and encrypt sample data (demo/data/1_2agaricus.txt.train
anddemo/data/agaricus.txt.test
). It will output three files:demo/python/multiclient-cluster-remote-control/client1/key1.txt
: the key used to encrypt the datademo/python/multiclient-cluster-remote-control/data/c1_train.enc
: an encrypted version of client 1’s training datademo/python/multiclient-cluster-remote-control/data/c1_test.enc
: an encrypted version of client 1’s test data
Run the following to encrypt.
python3 encrypt.py
Send encrypted data to the server
We assume that there will be a mechanism to transfer the encrypted data to the server. For the purposes of this demo, the user can try, for example,
scp
to simulate this transfer.Make client calls
On the client, send commands to the orchestrator by running
client1.py
. Theclient1.py
script takes in 5 arguments: the IP address of the orchestrator, the path to the generated key, the path to the user’s private key, the path to the user’s certificate, and the port on which the orchestartor is running. We’ve included a sample private key and certificate for this example.python3 client1.py --ip-addr <orchestrator-ip> --symmkey key1.txt --privkey ../../../data/userkeys/private_user_1.pem --cert ../../../data/usercrts/user1.crt --port 50052
client.py
takes in 5 arguments:--ip-addr
: IP address of the orchestrator--symmkey
: path to the client’s symmetric key--privkey
: path to the client’s private key--cert
: path to the client’s certificate--port
: port on which the orchestrator is listening
For convenience, we added a script run.sh
in this directory that runs this command. It takes in one argument: the orchestrator IP.
Note that the server will not load data, train a model, or serve predictions just yet – the consensus mechanism forces the orchestrator to wait for client 2 to submit commands before relaying commands to the cluster.
Client 2 Setup#
This setup will involve encrypting data on client 2, transferring the data to the server, and telling the orchestrator that client 2 is ready.
cd
into the demo/python/multiclient-cluster-remote-control/client2
directory to begin setup.
Encrypt data locally.
Use the
encrypt.py
script to generate a key and encrypt sample data (demo/data/2_2agaricus.txt.train
anddemo/data/agaricus.txt.test
). It will output three files:demo/python/multiclient-cluster-remote-control/client2/key2.txt
: the key used to encrypt the datademo/python/multiclient-cluster-remote-control/data/c2_train.enc
: an encrypted version of client 2’s training datademo/python/multiclient-cluster-remote-control/data/c2_test.enc
: an encrypted version of client 2’s test data
Run the following to encrypt.
python3 encrypt.py
Send encrypted data to the server
We assume that there will be a mechanism to transfer the encrypted data to the server. For the purposes of this demo, the user can try, for example,
scp
to simulate this transfer.Make client calls
On client 2, send commands to the server by running
client2.py
. Likeclient1.py
,client2.py
takes in 5 arguments: the IP address of the server, the path to the generated key, the path to the user’s private key, the path to the user’s certificate, and the port on which the orchestrator is running. We’ve included a sample private key and certificate for this example.python3 client2.py --ip-addr <server-ip> --symmkey key2.txt --privkey ../../../data/userkeys/private_user_2.pem --cert ../../../data/usercrts/user2.crt --port 50052
For convenience, we added a script run.sh
in this directory that runs this command. It takes in one argument: the orchestrator IP.
Once you have submitted commands from client 2, the orchestrator will relay commands to the cluster. The enclave cluster will load the two parties’ data, train a model over both parties’ data, and serve encrypted predictions back to each party.