Supported functionalities#
This section lists Opaque’s supported functionalities, which is a subset of that of Spark SQL. The syntax for these functionalities is the same as Spark SQL – Opaque simply replaces the execution to work with encrypted data.
SQL interface#
Data types#
Out of the existing Spark SQL types, Opaque supports
All numeric types.
DecimalTypeis supported via conversion intoFloatTypeStringTypeBinaryTypeBooleanTypeTimestampTime,DateTypeArrayType,MapType
Functions#
We currently support a subset of the Spark SQL functions, including both scalar and aggregate-like functions.
Scalar functions:
case,cast,concat,contains,if,in,like,substring,upperAggregate functions:
average,count,first,last,max,min,sum
UDFs are not supported directly, but one can extend Opaque with additional functions by writing it in C++.
Operators#
Opaque supports the core SQL operators:
Projection (e.g.,
SELECTstatements)Filter
Global aggregation and grouping aggregation
Order by, sort by
All join types except: cross join, full outer join, existence join
Limit
DataFrame interface#
Because Opaque SQL only replaces physical operators to work with encrypted data, the DataFrame interface is exactly the same as Spark’s both for Scala and Python. Opaque SQL is still a work in progress, so not all of these functionalities are currently implemented. See below for a complete list in Scala.
Supported operations#
Actions#
Basic Dataset functions#
Streaming#
Typed transformations#
flatMap[U](f: FlatMapFunction[T, U], encoder: Encoder[U]): Dataset[U]
flatMap[U](func: T => TraversableOnce[U])(implicitevidence: Encoder[U]): Dataset[U]
groupByKey[K](func: MapFunction[T, K], encoder: Encoder[K]): KeyValueGroupedDataset[K, T]
groupByKey[K](func: T => K)(implicitevidence: Encoder[K]): KeyValueGroupedDataset[K, T]
joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]
joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]
map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U]
map[U](func: T => U)(implicitevidence: Encoder[U]): Dataset[U]
mapPartitions[U](f: MapPartitionsFunction[T, U], encoder: Encoder[U]): Dataset[U]
mapPartitions[U](func: Iterator[T] => Iterator[U])(implicitevidence: Encoder[U]): Dataset[U]
randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
randomSplitAsList(weights: Array[Double], seed: Long): List[Dataset[T]]
repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]
select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]
sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]
unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T]
Untyped transformations#
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
groupBy(col1: String, cols: String*): RelationalGroupedDataset
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
withColumnRenamed(existingName: String, newName: String): DataFrame
Unsupported operations#
Actions#
Basic Dataset Functions#
Typed transformations#
Untyped transformations#
* Cross joins and full outer joins are not supported. Aggregations with more than one distinct aggregate expression are not supported.
User-Defined Functions (UDFs)#
To run a Spark SQL UDF within Opaque enclaves, first name it explicitly and define it in Scala, then reimplement it in C++ against Opaque’s serialized row representation.
For example, suppose we wish to implement a UDF called dot, which computes the dot product of two double arrays (Array[Double]). We [define it in Scala](src/main/scala/edu/berkeley/cs/rise/opaque/expressions/DotProduct.scala) in terms of the Breeze linear algebra library’s implementation. We can then use it in a DataFrame query, such as logistic regression.
Now we can port this UDF to Opaque as follows:
Define a corresponding expression using Opaque’s expression serialization format by adding the following to [Expr.fbs](src/flatbuffers/Expr.fbs), which indicates that a DotProduct expression takes two inputs (the two double arrays):
table DotProduct { left:Expr; right:Expr; }
In the same file, add
DotProductto the list of expressions inExprUnion.Implement the serialization logic from the Scala
DotProductUDF to the Opaque expression that we just defined. InUtils.flatbuffersSerializeExpression(fromUtils.scala), add a case forDotProductas follows:case (DotProduct(left, right), Seq(leftOffset, rightOffset)) => tuix.Expr.createExpr( builder, tuix.ExprUnion.DotProduct, tuix.DotProduct.createDotProduct( builder, leftOffset, rightOffset))
Finally, implement the UDF in C++. In
FlatbuffersExpressionEvaluator#eval_helper(fromexpression_evaluation.h), add a case fortuix::ExprUnion_DotProduct. Within that case, cast the expression to atuix::DotProduct, recursively evaluate the left and right children, perform the dot product computation on them, and construct aDoubleFieldcontaining the result.