Supported functionalities#
This section lists Opaque’s supported functionalities, which is a subset of that of Spark SQL. The syntax for these functionalities is the same as Spark SQL – Opaque simply replaces the execution to work with encrypted data.
SQL interface#
Data types#
Out of the existing Spark SQL types, Opaque supports
- All numeric types. - DecimalTypeis supported via conversion into- FloatType
- StringType
- BinaryType
- BooleanType
- TimestampTime,- DateType
- ArrayType,- MapType
Functions#
We currently support a subset of the Spark SQL functions, including both scalar and aggregate-like functions.
- Scalar functions: - case,- cast,- concat,- contains,- if,- in,- like,- substring,- upper
- Aggregate functions: - average,- count,- first,- last,- max,- min,- sum
UDFs are not supported directly, but one can extend Opaque with additional functions by writing it in C++.
Operators#
Opaque supports the core SQL operators:
- Projection (e.g., - SELECTstatements)
- Filter 
- Global aggregation and grouping aggregation 
- Order by, sort by 
- All join types except: cross join, full outer join, existence join 
- Limit 
DataFrame interface#
Because Opaque SQL only replaces physical operators to work with encrypted data, the DataFrame interface is exactly the same as Spark’s both for Scala and Python. Opaque SQL is still a work in progress, so not all of these functionalities are currently implemented. See below for a complete list in Scala.
Supported operations#
Actions#
Basic Dataset functions#
Streaming#
Typed transformations#
- flatMap[U](f: FlatMapFunction[T, U], encoder: Encoder[U]): Dataset[U] 
- flatMap[U](func: T => TraversableOnce[U])(implicitevidence: Encoder[U]): Dataset[U] 
- groupByKey[K](func: MapFunction[T, K], encoder: Encoder[K]): KeyValueGroupedDataset[K, T] 
- groupByKey[K](func: T => K)(implicitevidence: Encoder[K]): KeyValueGroupedDataset[K, T] 
- joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)] 
- joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)] 
- map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U] 
- map[U](func: T => U)(implicitevidence: Encoder[U]): Dataset[U] 
- mapPartitions[U](f: MapPartitionsFunction[T, U], encoder: Encoder[U]): Dataset[U] 
- mapPartitions[U](func: Iterator[T] => Iterator[U])(implicitevidence: Encoder[U]): Dataset[U] 
- randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]] 
- randomSplitAsList(weights: Array[Double], seed: Long): List[Dataset[T]] 
- repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T] 
- repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T] 
- select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)] 
- sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T] 
- unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T] 
Untyped transformations#
- agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame 
- groupBy(col1: String, cols: String*): RelationalGroupedDataset 
- join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame 
- join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame 
- join(right: Dataset[_], usingColumns: Seq[String]): DataFrame 
- withColumnRenamed(existingName: String, newName: String): DataFrame 
Unsupported operations#
Actions#
Basic Dataset Functions#
Typed transformations#
Untyped transformations#
* Cross joins and full outer joins are not supported. Aggregations with more than one distinct aggregate expression are not supported.
User-Defined Functions (UDFs)#
To run a Spark SQL UDF within Opaque enclaves, first name it explicitly and define it in Scala, then reimplement it in C++ against Opaque’s serialized row representation.
For example, suppose we wish to implement a UDF called dot, which computes the dot product of two double arrays (Array[Double]). We [define it in Scala](src/main/scala/edu/berkeley/cs/rise/opaque/expressions/DotProduct.scala) in terms of the Breeze linear algebra library’s implementation. We can then use it in a DataFrame query, such as logistic regression.
Now we can port this UDF to Opaque as follows:
- Define a corresponding expression using Opaque’s expression serialization format by adding the following to [Expr.fbs](src/flatbuffers/Expr.fbs), which indicates that a DotProduct expression takes two inputs (the two double arrays): - table DotProduct { left:Expr; right:Expr; } - In the same file, add - DotProductto the list of expressions in- ExprUnion.
- Implement the serialization logic from the Scala - DotProductUDF to the Opaque expression that we just defined. In- Utils.flatbuffersSerializeExpression(from- Utils.scala), add a case for- DotProductas follows:- case (DotProduct(left, right), Seq(leftOffset, rightOffset)) => tuix.Expr.createExpr( builder, tuix.ExprUnion.DotProduct, tuix.DotProduct.createDotProduct( builder, leftOffset, rightOffset)) 
- Finally, implement the UDF in C++. In - FlatbuffersExpressionEvaluator#eval_helper(from- expression_evaluation.h), add a case for- tuix::ExprUnion_DotProduct. Within that case, cast the expression to a- tuix::DotProduct, recursively evaluate the left and right children, perform the dot product computation on them, and construct a- DoubleFieldcontaining the result.