Hive #1611

在每日的数据同步任务中，会将在线业务库的数据同步至 Hive，在刚上线的一段时间内，对 Hive 的 SQL 操作每隔十来天会出现 socket read timeout 的异常，该问题导致我经常凌晨收到告警然后起床修复该问题，因为该问题是偶现，不便复现，经过查阅 Hive JDBC 的源码，定位问题为 Hive JDBC 使用了 java.sql.DriverManager 中的静态变量 loginTimeout 的值，而我们的工程中接入了八九种类型的 JDBC 驱动，部分类型还使用了连接池，先看看 java.sql.DriverManager 中的部分源码：

public class DriverManager {
    // List of registered JDBC drivers
    private final static CopyOnWriteArrayList<DriverInfo> registeredDrivers = new CopyOnWriteArrayList<>();
    // 该变量被声明为 static, 导致其他类型的数据源设置了 DriverManager 中的该变量后，Hive JDBC 将会使用到其他类型数据源设置的超时时间
    private static volatile int loginTimeout = 0;
    private static volatile java.io.PrintWriter logWriter = null;
    private static volatile java.io.PrintStream logStream = null;
    // Used in println() to synchronize logWriter
    private final static  Object logSync = new Object();

    /* Prevent the DriverManager class from being instantiated. */
    private DriverManager(){}

    /**
     * Sets the maximum time in seconds that a driver will wait
     * while attempting to connect to a database once the driver has
     * been identified.
     *
     * @param seconds the login time limit in seconds; zero means there is no limit
     * @see #getLoginTimeout
     */
    public static void setLoginTimeout(int seconds) {
        loginTimeout = seconds;
    }

    /**
     * Gets the maximum time in seconds that a driver can wait
     * when attempting to log in to a database.
     *
     * @return the driver login time limit in seconds
     * @see #setLoginTimeout
     */
    public static int getLoginTimeout() {
        return (loginTimeout);
    }

    // ...

可以看出 loginTimeout 变量被声明为了 static，即类级别，我们工程中 hive-jdbc 的依赖版本使用的是 1.2.1，查询该版本的 Hive 连接获取相关代码位于 HiveConnection.java 的 129 行 setupLoginTimeout(); 方法：

// copy loginTimeout from driver manager. Thrift timeout needs to be in millis
private void setupLoginTimeout() {
  long timeOut = TimeUnit.SECONDS.toMillis(DriverManager.getLoginTimeout());
  if (timeOut > Integer.MAX_VALUE) {
    loginTimeout = Integer.MAX_VALUE;
  } else {
    loginTimeout = (int) timeOut;
  }
}

可以看出，Hive JDBC 连接创建时调用了 DriverManager.getLoginTimeout() 方法使用了 java.sql.DriverManager 中的静态变量 loginTimeout 的值，然后继续跟踪调用链，可以看到最终将该超时时间设置到了底层 socket 的 SO_TIMEOUT 上，详见 apache thrift 0.9.2 版本中的 TSocket.java:

public TSocket(String host, int port, int timeout) {
    this.socket_ = null;
    this.host_ = null;
    this.port_ = 0;
    this.timeout_ = 0;
    this.host_ = host;
    this.port_ = port;
    this.timeout_ = timeout;
    this.initSocket();
}

/**
 * Initializes the socket object
 */
private void initSocket() {
  socket_ = new Socket();
  try {
    socket_.setSoLinger(false, 0);
    socket_.setTcpNoDelay(true);
    socket_.setKeepAlive(true);
    socket_.setSoTimeout(timeout_);
  } catch (SocketException sx) {
    LOGGER.error("Could not configure socket.", sx);
  }
}

再看看 Java Socket.java 中的代码：

/**
 *  Enable/disable {@link SocketOptions#SO_TIMEOUT SO_TIMEOUT}
 *  with the specified timeout, in milliseconds. With this option set
 *  to a non-zero timeout, a read() call on the InputStream associated with
 *  this Socket will block for only this amount of time.  If the timeout
 *  expires, a <B>java.net.SocketTimeoutException</B> is raised, though the
 *  Socket is still valid. The option <B>must</B> be enabled
 *  prior to entering the blocking operation to have effect. The
 *  timeout must be {@code > 0}.
 *  A timeout of zero is interpreted as an infinite timeout.
 *
 * @param timeout the specified timeout, in milliseconds.
 * @exception SocketException if there is an error
 * in the underlying protocol, such as a TCP error.
 * @since   1.1
 * @see #getSoTimeout()
 */
public synchronized void setSoTimeout(int timeout) throws SocketException {
    if (isClosed())
        throw new SocketException("Socket is closed");
    if (timeout < 0)
      throw new IllegalArgumentException("timeout can't be negative");

    getImpl().setOption(SocketOptions.SO_TIMEOUT, timeout);
}

其实该问题首次于 2019 年 9 月 11 日被报告，详见该 issue: Hive-22196 Socket timeouts happen when other drivers set DriverManager.loginTimeout，但是在我遇到该问题时依然未修复，于是我在 2020 年 7 月 24 日，clone 下 Hive 的仓库对超时部分的代码进行了调整并自行编译了一个 hive-jdbc 的驱动供公司内部使用以处理偶现的超时问题：

其中具体的变更如下，就是移除掉了从 DriverManager 中读取超时设置并应用于底层 socket 超时设定的代码：

在今天记录该问题时，我又去查看了下该 issue，发现该问题已经于 2021 年 2 月 17 日被修复并合并至 master 分支，详情可查看该 PR: HIVE-12371 Adding a timeout connection parameter for JDBC #1611。