Poison

Hive #1611

在每日的数据同步任务中,会将在线业务库的数据同步至 Hive,在刚上线的一段时间内,对 Hive 的 SQL 操作每隔十来天会出现 socket read timeout 的异常,该问题导致我经常凌晨收到告警然后起床修复该问题,因为该问题是偶现,不便复现,经过查阅 Hive JDBC 的源码,定位问题为 Hive JDBC 使用了 java.sql.DriverManager 中的静态变量 loginTimeout 的值,而我们的工程中接入了八九种类型的 JDBC 驱动,部分类型还使用了连接池,先看看 java.sql.DriverManager 中的部分源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
public class DriverManager {
// List of registered JDBC drivers
private final static CopyOnWriteArrayList<DriverInfo> registeredDrivers = new CopyOnWriteArrayList<>();
// 该变量被声明为 static, 导致其他类型的数据源设置了 DriverManager 中的该变量后,Hive JDBC 将会使用到其他类型数据源设置的超时时间
private static volatile int loginTimeout = 0;
private static volatile java.io.PrintWriter logWriter = null;
private static volatile java.io.PrintStream logStream = null;
// Used in println() to synchronize logWriter
private final static Object logSync = new Object();

/* Prevent the DriverManager class from being instantiated. */
private DriverManager(){}

/**
* Sets the maximum time in seconds that a driver will wait
* while attempting to connect to a database once the driver has
* been identified.
*
* @param seconds the login time limit in seconds; zero means there is no limit
* @see #getLoginTimeout
*/
public static void setLoginTimeout(int seconds) {
loginTimeout = seconds;
}

/**
* Gets the maximum time in seconds that a driver can wait
* when attempting to log in to a database.
*
* @return the driver login time limit in seconds
* @see #setLoginTimeout
*/
public static int getLoginTimeout() {
return (loginTimeout);
}

// ...

可以看出 loginTimeout 变量被声明为了 static,即类级别,我们工程中 hive-jdbc 的依赖版本使用的是 1.2.1,查询该版本的 Hive 连接获取相关代码位于 HiveConnection.java 的 129 行 setupLoginTimeout(); 方法:

1
2
3
4
5
6
7
8
9
// copy loginTimeout from driver manager. Thrift timeout needs to be in millis
private void setupLoginTimeout() {
long timeOut = TimeUnit.SECONDS.toMillis(DriverManager.getLoginTimeout());
if (timeOut > Integer.MAX_VALUE) {
loginTimeout = Integer.MAX_VALUE;
} else {
loginTimeout = (int) timeOut;
}
}

可以看出,Hive JDBC 连接创建时调用了 DriverManager.getLoginTimeout() 方法使用了 java.sql.DriverManager 中的静态变量 loginTimeout 的值,然后继续跟踪调用链,可以看到最终将该超时时间设置到了底层 socket 的 SO_TIMEOUT 上,详见 apache thrift 0.9.2 版本中的 TSocket.java:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
public TSocket(String host, int port, int timeout) {
this.socket_ = null;
this.host_ = null;
this.port_ = 0;
this.timeout_ = 0;
this.host_ = host;
this.port_ = port;
this.timeout_ = timeout;
this.initSocket();
}

/**
* Initializes the socket object
*/
private void initSocket() {
socket_ = new Socket();
try {
socket_.setSoLinger(false, 0);
socket_.setTcpNoDelay(true);
socket_.setKeepAlive(true);
socket_.setSoTimeout(timeout_);
} catch (SocketException sx) {
LOGGER.error("Could not configure socket.", sx);
}
}

再看看 Java Socket.java 中的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/**
* Enable/disable {@link SocketOptions#SO_TIMEOUT SO_TIMEOUT}
* with the specified timeout, in milliseconds. With this option set
* to a non-zero timeout, a read() call on the InputStream associated with
* this Socket will block for only this amount of time. If the timeout
* expires, a <B>java.net.SocketTimeoutException</B> is raised, though the
* Socket is still valid. The option <B>must</B> be enabled
* prior to entering the blocking operation to have effect. The
* timeout must be {@code > 0}.
* A timeout of zero is interpreted as an infinite timeout.
*
* @param timeout the specified timeout, in milliseconds.
* @exception SocketException if there is an error
* in the underlying protocol, such as a TCP error.
* @since 1.1
* @see #getSoTimeout()
*/
public synchronized void setSoTimeout(int timeout) throws SocketException {
if (isClosed())
throw new SocketException("Socket is closed");
if (timeout < 0)
throw new IllegalArgumentException("timeout can't be negative");

getImpl().setOption(SocketOptions.SO_TIMEOUT, timeout);
}

其实该问题首次于 2019 年 9 月 11 日被报告,详见该 issue: Hive-22196 Socket timeouts happen when other drivers set DriverManager.loginTimeout,但是在我遇到该问题时依然未修复,于是我在 2020 年 7 月 24 日,clone 下 Hive 的仓库对超时部分的代码进行了调整并自行编译了一个 hive-jdbc 的驱动供公司内部使用以处理偶现的超时问题:
Git commit

其中具体的变更如下,就是移除掉了从 DriverManager 中读取超时设置并应用于底层 socket 超时设定的代码:
Git changed file

在今天记录该问题时,我又去查看了下该 issue,发现该问题已经于 2021 年 2 月 17 日被修复并合并至 master 分支,详情可查看该 PR: HIVE-12371 Adding a timeout connection parameter for JDBC #1611