Supported functionalities¶
This section lists Opaque’s supported functionalities, which is a subset of that of Spark SQL. The syntax for these functionalities is the same as Spark SQL – Opaque simply replaces the execution to work with encrypted data.
SQL interface¶
Data types¶
Out of the existing Spark SQL types, Opaque supports
All numeric types.
DecimalType
is supported via conversion intoFloatType
StringType
BinaryType
BooleanType
TimestampTime
,DateType
ArrayType
,MapType
Functions¶
We currently support a subset of the Spark SQL functions, including both scalar and aggregate-like functions.
Scalar functions:
case
,cast
,concat
,contains
,if
,in
,like
,substring
,upper
Aggregate functions:
average
,count
,first
,last
,max
,min
,sum
UDFs are not supported directly, but one can extend Opaque with additional functions by writing it in C++.
Operators¶
Opaque supports the core SQL operators:
Projection (e.g.,
SELECT
statements)Filter
Global aggregation and grouping aggregation
Order by, sort by
All join types except: cross join, full outer join, existence join
Limit
DataFrame interface¶
Because Opaque SQL only replaces physical operators to work with encrypted data, the DataFrame interface is exactly the same as Spark’s both for Scala and Python. Opaque SQL is still a work in progress, so not all of these functionalities are currently implemented. See below for a complete list in Scala.
Supported operations¶
Actions¶
Basic Dataset functions¶
Streaming¶
Typed transformations¶
flatMap[U](f: FlatMapFunction[T, U], encoder: Encoder[U]): Dataset[U]
flatMap[U](func: T => TraversableOnce[U])(implicitevidence: Encoder[U]): Dataset[U]
groupByKey[K](func: MapFunction[T, K], encoder: Encoder[K]): KeyValueGroupedDataset[K, T]
groupByKey[K](func: T => K)(implicitevidence: Encoder[K]): KeyValueGroupedDataset[K, T]
joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]
joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]
map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U]
map[U](func: T => U)(implicitevidence: Encoder[U]): Dataset[U]
mapPartitions[U](f: MapPartitionsFunction[T, U], encoder: Encoder[U]): Dataset[U]
mapPartitions[U](func: Iterator[T] => Iterator[U])(implicitevidence: Encoder[U]): Dataset[U]
randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
randomSplitAsList(weights: Array[Double], seed: Long): List[Dataset[T]]
repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]
select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]
sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]
unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T]
Untyped transformations¶
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
groupBy(col1: String, cols: String*): RelationalGroupedDataset
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
withColumnRenamed(existingName: String, newName: String): DataFrame
Ungrouped¶
Unsupported operations¶
Actions¶
Basic Dataset Functions¶
Typed transformations¶
Untyped transformations¶
* Cross joins and full outer joins are not supported. Aggregations with more than one distinct aggregate expression are not supported.
User-Defined Functions (UDFs)¶
To run a Spark SQL UDF within Opaque enclaves, first name it explicitly and define it in Scala, then reimplement it in C++ against Opaque’s serialized row representation.
For example, suppose we wish to implement a UDF called dot
, which computes the dot product of two double arrays (Array[Double]
). We [define it in Scala](src/main/scala/edu/berkeley/cs/rise/opaque/expressions/DotProduct.scala) in terms of the Breeze linear algebra library’s implementation. We can then use it in a DataFrame query, such as logistic regression.
Now we can port this UDF to Opaque as follows:
Define a corresponding expression using Opaque’s expression serialization format by adding the following to [Expr.fbs](src/flatbuffers/Expr.fbs), which indicates that a DotProduct expression takes two inputs (the two double arrays):
table DotProduct { left:Expr; right:Expr; }
In the same file, add
DotProduct
to the list of expressions inExprUnion
.Implement the serialization logic from the Scala
DotProduct
UDF to the Opaque expression that we just defined. InUtils.flatbuffersSerializeExpression
(fromUtils.scala
), add a case forDotProduct
as follows:case (DotProduct(left, right), Seq(leftOffset, rightOffset)) => tuix.Expr.createExpr( builder, tuix.ExprUnion.DotProduct, tuix.DotProduct.createDotProduct( builder, leftOffset, rightOffset))
Finally, implement the UDF in C++. In
FlatbuffersExpressionEvaluator#eval_helper
(fromexpression_evaluation.h
), add a case fortuix::ExprUnion_DotProduct
. Within that case, cast the expression to atuix::DotProduct
, recursively evaluate the left and right children, perform the dot product computation on them, and construct aDoubleField
containing the result.