Spark SQL 二元逻辑表达式解析

今天查看 Spark SQL 源码时发现针对二元逻辑表达式解析采用了平衡二叉树以规避左递归树的性能下降问题。布尔表达式的语法规则定义位于 SqlBase.g4 at v2.4.4:

booleanExpression
    : NOT booleanExpression                                        #logicalNot
    | EXISTS '(' query ')'                                         #exists
    | valueExpression predicate?                                   #predicated
    | left=booleanExpression operator=AND right=booleanExpression  #logicalBinary
    | left=booleanExpression operator=OR right=booleanExpression   #logicalBinary
    ;

二元逻辑表达式解析即 logicalBinary 规则解析的源码位于 AstBuilder.scala at v2.4.4:

/**
 * Combine a number of boolean expressions into a balanced expression tree. These expressions are
 * either combined by a logical [[And]] or a logical [[Or]].
 *
 * A balanced binary tree is created because regular left recursive trees cause considerable
 * performance degradations and can cause stack overflows.
 */
override def visitLogicalBinary(ctx: LogicalBinaryContext): Expression = withOrigin(ctx) {
  val expressionType = ctx.operator.getType
  val expressionCombiner = expressionType match {
    case SqlBaseParser.AND => And.apply _
    case SqlBaseParser.OR => Or.apply _
  }

  // Collect all similar left hand contexts.
  val contexts = ArrayBuffer(ctx.right)
  var current = ctx.left
  def collectContexts: Boolean = current match {
    case lbc: LogicalBinaryContext if lbc.operator.getType == expressionType =>
      contexts += lbc.right
      current = lbc.left
      true
    case _ =>
      contexts += current
      false
  }
  while (collectContexts) {
    // No body - all updates take place in the collectContexts.
  }

  // Reverse the contexts to have them in the same sequence as in the SQL statement & turn them
  // into expressions.
  val expressions = contexts.reverseMap(expression)

  // Create a balanced tree.
  def reduceToExpressionTree(low: Int, high: Int): Expression = high - low match {
    case 0 =>
      expressions(low)
    case 1 =>
      expressionCombiner(expressions(low), expressions(high))
    case x =>
      val mid = low + x / 2
      expressionCombiner(
        reduceToExpressionTree(low, mid),
        reduceToExpressionTree(mid + 1, high))
  }
  reduceToExpressionTree(0, expressions.size - 1)
}

核心处理逻辑即为最后这部分，将表达式列表进行自顶向下的构建，以使树的高度尽可能低，case 0 及 case 1 即为递归出口，可以看出这段代码还是比较妙的。