Python API Reference#

This page gives the Python API reference of Secure XGBoost, please also refer to the Python Package Introduction for more information.

Initialization API#

Functions used to initialize clients and servers in Secure XGBoost.

securexgboost.init_client(config=None, remote_addr=None, user_name=None, client_list=[], sym_key_file=None, priv_key_file=None, cert_file=None)#

Initialize the client. Set up the client’s keys, and specify the IP address of the enclave server.

Parameters
  • config (file) – Configuration file containing the client parameters. If this is provided, the other arguments are ignored.

  • remote_addr (str) – IP address of remote server running the enclave

  • user_name (str) – Current user’s username

  • client_list (list) – List of usernames for all clients in the collaboration

  • sym_key_file (str) – Path to file containing user’s symmetric key used for encrypting data

  • priv_key_file (str) – Path to file containing user’s private key used for signing data

  • cert_file (str) – Path to file containing user’s public key certificate

securexgboost.init_server(enclave_image=None, client_list=[], log_verbosity=0)#

Launch the enclave from an image. This API should be invoked only by the servers and not the clients.

Parameters
  • enclave_image (str) – Path to enclave binary

  • client_list (list) – List of usernames (strings) of clients in the collaboration allowed to use the enclaves

  • log_verbosity (int, optional) – Verbosity level for enclave (for enclaves in debug mode)

securexgboost.attest(verify=True)#

Verify remote attestation report of enclave and get its public key. The report and public key are saved as instance attributes.

Parameters

verify (bool) –

If true, the client verifies the enclave report

Warning

verify should be set to False only for development and testing in simulation mode

Crypto API#

Functions for clients to encrypt files

securexgboost.generate_client_key(keyfile)#

Generate a new key and save it to keyfile

Parameters

keyfile (str) – path to which key will be saved

securexgboost.encrypt_file(input_file, output_file, key_file)#

Encrypt a file

Parameters
  • input_file (str) – path to file to be encrypted

  • output_file (str) – path to which encrypted file will be saved

  • key_file (str) – path to key used to encrypt file

Core Data Structure#

Core XGBoost Library.

class securexgboost.DMatrix(data_dict, encrypted=True, silent=False, feature_names=None, feature_types=None)#

Bases: object

Data Matrix used in Secure XGBoost.

DMatrix is a internal data structure that used by XGBoost which is optimized for both memory efficiency and training speed.

You can load a DMatrix from one ore more encrypted files at the enclave server, where each file is encrypted with a particular user’s symmetric key. Each DMatrix in Secure XGBoost is thus associated with one or more data owners.

Parameters
  • data_dict (dict, {str: str}) – The keys are usernames. The values are absolute paths to the training data of the corresponding user in the cloud.

  • encrypted (bool, optional) – Whether data is encrypted

  • silent (bool, optional) – Whether to print messages during construction

  • feature_names (list, optional) – Set names for features.

  • feature_types (list, optional) – Set types for features.

property feature_names#

Get feature names (column labels).

Returns

feature_names

Return type

list or None

property feature_types#

Get feature types (column types).

Returns

feature_types

Return type

list or None

num_col()#

Get the number of columns (features) in the DMatrix.

Returns

number of columns

Return type

int

num_row()#

Get the number of rows in the DMatrix.

Returns

number of rows

Return type

int

class securexgboost.Booster(params=None, cache=(), model_file=None)#

Bases: object

A Booster of Secure XGBoost.

Booster is the model of Secure XGBoost, that contains low level routines for training, prediction and evaluation.

Parameters
  • params (dict) – Parameters for boosters.

  • cache (list) – List of cache items.

  • model_file (str) – Path to the model file.

copy()#

Copy the booster object.

Returns

booster – a copied booster model

Return type

Booster

decrypt_dump(sarr, length)#

Decrypt the models obtained from get_dump()

Parameters
  • sarr (str) – Encrypted string representation of the model obtained from get_dump()

  • length (int) – length of sarr

decrypt_predictions(encrypted_preds, num_preds)#

Decrypt encrypted predictions

Parameters
  • key (byte array) – key used to encrypt client files

  • encrypted_preds (c_char_p) – encrypted predictions

  • num_preds (int) – number of predictions

Returns

preds – plaintext predictions

Return type

numpy array

dump_model(fout, fmap='', with_stats=False, dump_format='text')#

Dump model into a text or JSON file.

Parameters
  • fout (str) – Output file name.

  • fmap (str, optional) – Name of the file containing feature map names.

  • with_stats (bool, optional) – Controls whether the split statistics are output.

  • dump_format (str, optional) – Format of model dump file. Can be ‘text’ or ‘json’.

get_dump(fmap='', with_stats=False, dump_format='text', decrypt=True)#

Returns the (encrypted) model dump as a list of strings. The model is encrypted with the user’s symmetric key. If decrypt is True, then the dump is decrypted by the client.

Parameters
  • fmap (str, optional) – Name of the file containing feature map names.

  • with_stats (bool, optional) – Controls whether the split statistics are output.

  • dump_format (str, optional) – Format of model dump. Can be ‘text’ or ‘json’.

  • decrypt (bool) – When this is True, the model dump received from the enclave is decrypted using the user’s symmetric key

Returns

res – A string representation of the model dump

Return type

str

get_fscore(fmap='')#

Get feature importance of each feature.

Note

Feature importance is defined only for tree boosters

Feature importance is only defined when the decision tree model is chosen as base learner (booster=gbtree). It is not defined for other base learner types, such as linear learners (booster=gblinear).

Note

Zero-importance features will not be included

Keep in mind that this function does not include zero-importance feature, i.e. those features that have not been used in any split conditions.

Parameters

fmap (str (optional)) – The name of feature map file

get_score(fmap='', importance_type='weight')#

Get feature importance of each feature. Importance type can be defined as:

  • ‘weight’: the number of times a feature is used to split the data across all trees.

  • ‘gain’: the average gain across all splits the feature is used in.

  • ‘cover’: the average coverage across all splits the feature is used in.

  • ‘total_gain’: the total gain across all splits the feature is used in.

  • ‘total_cover’: the total coverage across all splits the feature is used in.

Note

Feature importance is defined only for tree boosters

Feature importance is only defined when the decision tree model is chosen as base learner (booster=gbtree). It is not defined for other base learner types, such as linear learners (booster=gblinear).

Parameters
  • fmap (str (optional)) – The name of feature map file.

  • importance_type (str, default 'weight') – One of the importance types defined above.

get_split_value_histogram(feature, fmap='', bins=None, as_pandas=True)#

Get split value histogram of a feature

Parameters
  • feature (str) – The name of the feature.

  • fmap (str (optional)) – The name of feature map file.

  • bin (int, default None) – The maximum number of bins. Number of bins equals number of unique split values n_unique, if bins == None or bins > n_unique.

  • as_pandas (bool, default True) – Return pd.DataFrame when pandas is installed. If False or pandas is not installed, return numpy ndarray.

Returns

  • a histogram of used splitting values for the specified feature

  • either as numpy array or pandas DataFrame.

load_model(fname)#

Load the model from a file.

The model is loaded from an XGBoost internal binary format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded. To preserve all attributes, pickle the Booster object.

Parameters

fname (str or a memory buffer) – Input file name or memory buffer(see also save_raw)

predict(data, output_margin=False, ntree_limit=0, pred_leaf=False, pred_contribs=False, approx_contribs=False, pred_interactions=False, validate_features=True, training=False, decrypt=True)#

Predict with data.

Note

This function is not thread safe.

For each booster object, predict can only be called from one thread. If you want to run prediction using multiple thread, call bst.copy() to make copies of model object and then call predict().

Note

Using predict() with DART booster

If the booster object is DART type, predict() will perform dropouts, i.e. only some of the trees will be evaluated. This will produce incorrect results if data is not the training data. To obtain correct results on test sets, set ntree_limit to a nonzero value, e.g.

preds = bst.predict(dtest, ntree_limit=num_round)
Parameters
  • data (DMatrix) – The dmatrix storing the input.

  • output_margin (bool) – Whether to output the raw untransformed margin value.

  • ntree_limit (int) – Limit number of trees in the prediction; defaults to 0 (use all trees).

  • pred_leaf (bool) – When this option is on, the output will be a matrix of (nsample, ntrees) with each record indicating the predicted leaf index of each sample in each tree. Note that the leaf index of a tree is unique per tree, so you may find leaf 1 in both tree 1 and tree 0.

  • pred_contribs (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1) with each record indicating the feature contributions (SHAP values) for that prediction. The sum of all feature contributions is equal to the raw untransformed margin value of the prediction. Note the final column is the bias term.

  • approx_contribs (bool) – Approximate the contributions of each feature

  • pred_interactions (bool) – When this is True the output will be a matrix of size (nsample, nfeats + 1, nfeats + 1) indicating the SHAP interaction values for each pair of features. The sum of each row (or column) of the interaction values equals the corresponding SHAP value (from pred_contribs), and the sum of the entire matrix equals the raw untransformed margin value of the prediction. Note the last row and column correspond to the bias term.

  • training (bool) – Whether the prediction value is used for training. This can effect dart booster, which performs dropouts during training iterations.

  • note: (..) – Using predict() with DART booster: If the booster object is DART type, predict() will not perform dropouts, i.e. all the trees will be evaluated. If you want to obtain result with dropouts, provide training=True.

  • validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.

  • decrypt (bool) – When this is True, the predictions received from the enclave are decrypted using the user’s symmetric key

Returns

  • prediction (list) – List of predictions. Each element in the list is a set of predictions from a different node in the cloud.

  • num_preds (list) – Number of predictions in each element in prediction

save_model(fname)#

Save the model to an encrypted file at the server. The file is encrypted with the user’s symmetric key.

The model is saved in an XGBoost internal binary format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved. To preserve all attributes, pickle the Booster object.

Parameters

fname (str) – Absolute path to save the model to

save_raw()#

Save the model to a in memory buffer representation. The model is encrypted with the user’s symmetric key.

Return type

a in memory buffer representation of the model

set_param(params, value=None)#

Set parameters into the Booster.

Parameters
  • params (dict/list/str) – list of key,value pairs, dict of key to value or simply str key

  • value (optional) – value of the specified parameter, when params is str key

trees_to_dataframe(fmap='')#

Parse a boosted tree model text dump into a pandas DataFrame structure.

This feature is only defined when the decision tree model is chosen as base learner (booster in {gbtree, dart}). It is not defined for other base learner types, such as linear learners (booster=gblinear).

Parameters

fmap (str (optional)) – The name of feature map file.

update(dtrain, iteration, fobj=None)#

Update for one iteration, with objective function calculated internally. This function should not be called directly by users.

Parameters
  • dtrain (DMatrix) – Training data.

  • iteration (int) – Current iteration number.

  • fobj (function) – Customized objective function.

Learning API#

Training Library containing training routines.

securexgboost.train(params, dtrain, num_boost_round=10, evals=())#

Train a booster with given parameters.

Parameters
  • params (dict) – Booster params.

  • dtrain (DMatrix) – Data to be trained.

  • num_boost_round (int) – Number of boosting iterations.

  • evals (list of pairs (DMatrix, string)) – List of items to be evaluated during training, this allows user to watch performance on the validation set.

Returns

Booster

Return type

a trained booster model

Remote Server API#

Functions to enable remote control of computation.

securexgboost.serve(all_users=[], nodes=[], nodes_port=50051, num_workers=10, port=50051)#

Launch the RPC server.

Parameters
  • all_users (list) – list of usernames participating in the joint computation

  • nodes (list) – list of IP addresses of nodes in the cluster. Passing in this argument means that this RPC server is the RPC orchestrator

  • nodes_port (int) – port of each RPC server in cluster

  • num_workers (int) – number of threads to use

  • port (int) – port on which to start this RPC server