Poison

关于 OkHttp3 空闲连接清理的实现

Prometheus 监控指标采集到 JVM 应用线程数较高,随后触发了线程转储自动分析功能,分析结果提示线程名为 OkHttp ConnectionPool 的线程存在数百个,且状态几乎全部为 TIMED_WAITING,根据故障时刻采集的线程转储查看调用 Object.wait() 代码均在 RealConnectionPool.java:62,故查询了线上 OkHttp3 版本 3.14.2 的源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/** The maximum number of idle connections for each address. */
private final int maxIdleConnections;
private final long keepAliveDurationNs;
private final Runnable cleanupRunnable = () -> {
while (true) {
long waitNanos = cleanup(System.nanoTime());
if (waitNanos == -1) return;
if (waitNanos > 0) {
long waitMillis = waitNanos / 1000000L;
waitNanos -= (waitMillis * 1000000L);
synchronized (RealConnectionPool.this) {
try {
// line 62:
RealConnectionPool.this.wait(waitMillis, (int) waitNanos);
} catch (InterruptedException ignored) {
}
}
}
}
};

/**
* Performs maintenance on this pool, evicting the connection that has been idle the longest if
* either it has exceeded the keep alive limit or the idle connections limit.
*
* <p>Returns the duration in nanos to sleep until the next scheduled call to this method. Returns
* -1 if no further cleanups are required.
*/
long cleanup(long now) {
int inUseConnectionCount = 0;
int idleConnectionCount = 0;
RealConnection longestIdleConnection = null;
long longestIdleDurationNs = Long.MIN_VALUE;

// Find either a connection to evict, or the time that the next eviction is due.
synchronized (this) {
for (Iterator<RealConnection> i = connections.iterator(); i.hasNext(); ) {
RealConnection connection = i.next();

// If the connection is in use, keep searching.
if (pruneAndGetAllocationCount(connection, now) > 0) {
inUseConnectionCount++;
continue;
}

idleConnectionCount++;

// If the connection is ready to be evicted, we're done.
long idleDurationNs = now - connection.idleAtNanos;
if (idleDurationNs > longestIdleDurationNs) {
longestIdleDurationNs = idleDurationNs;
longestIdleConnection = connection;
}
}

if (longestIdleDurationNs >= this.keepAliveDurationNs
|| idleConnectionCount > this.maxIdleConnections) {
// We've found a connection to evict. Remove it from the list, then close it below (outside
// of the synchronized block).
connections.remove(longestIdleConnection);
} else if (idleConnectionCount > 0) {
// A connection will be ready to evict soon.
return keepAliveDurationNs - longestIdleDurationNs;
} else if (inUseConnectionCount > 0) {
// All connections are in use. It'll be at least the keep alive duration 'til we run again.
return keepAliveDurationNs;
} else {
// No connections, idle or in use.
cleanupRunning = false;
return -1;
}
}

closeQuietly(longestIdleConnection.socket());

// Cleanup again immediately.
return 0;
}

提交该 Runnable 的逻辑在将连接放入连接池前,等于每个新建的不能复用的连接都会新建一个线程用于空闲清理逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/**
* Background threads are used to cleanup expired connections. There will be at most a single
* thread running per connection pool. The thread pool executor permits the pool itself to be
* garbage collected.
*/
private static final Executor executor = new ThreadPoolExecutor(0 /* corePoolSize */,
Integer.MAX_VALUE /* maximumPoolSize */, 60L /* keepAliveTime */, TimeUnit.SECONDS,
new SynchronousQueue<>(), Util.threadFactory("OkHttp ConnectionPool", true));

void put(RealConnection connection) {
assert (Thread.holdsLock(this));
if (!cleanupRunning) {
cleanupRunning = true;
executor.execute(cleanupRunnable);
}
connections.add(connection);
}

具体含义看注释即可,cleanup 方法中核心的参数就是 maxIdleConnectionskeepAliveDurationNs,业务方没有显式设置过这两个参数,使用的默认配置,其中默认初始化逻辑如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/**
* Manages reuse of HTTP and HTTP/2 connections for reduced network latency. HTTP requests that
* share the same {@link Address} may share a {@link Connection}. This class implements the policy
* of which connections to keep open for future use.
*/
public final class ConnectionPool {
final RealConnectionPool delegate;

/**
* Create a new connection pool with tuning parameters appropriate for a single-user application.
* The tuning parameters in this pool are subject to change in future OkHttp releases. Currently
* this pool holds up to 5 idle connections which will be evicted after 5 minutes of inactivity.
*/
public ConnectionPool() {
this(5, 5, TimeUnit.MINUTES);
}

public ConnectionPool(int maxIdleConnections, long keepAliveDuration, TimeUnit timeUnit) {
this.delegate = new RealConnectionPool(maxIdleConnections, keepAliveDuration, timeUnit);
}

/** Returns the number of idle connections in the pool. */
public int idleConnectionCount() {
return delegate.idleConnectionCount();
}

/** Returns total number of connections in the pool. */
public int connectionCount() {
return delegate.connectionCount();
}

/** Close and remove all idle connections in the pool. */
public void evictAll() {
delegate.evictAll();
}
}

注释已经解释得很清楚,通过我们的 Prometheus 监控,我们发现,线程数在某些时刻会有突发的增长,名为 OkHttp ConnectionPool 的线程数量会增长至 1000 以上,阅读了以上源码后,个人认为 OkHttp3 的空闲连接清理不是一个好的设计,在社区也有不少用户反馈该问题,但是一直没有调整目前的实现方案,笔者为了线程数量峰值降低也只能先将 keepAliveDurationNs 调低,期待后续 OkHttp3 官方能够改进空闲连接清理的实现方案,而不是每个新连接创建一个线程用于空闲连接清理,虽然大部分线程会进入 TIMED_WAITING,但是数量极高的线程会消耗 CPU 切换的时间且有触发 JVM OOM 的风险。

比如如下代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
package me.tianshuang;

public class Test {

public static void main(String[] args) {
for (int i = 1; i < Integer.MAX_VALUE; i++) {
System.out.println(i);
new Thread(new Runnable() {
@Override
public void run() {
try {
Thread.sleep(Integer.MAX_VALUE);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}, "Thread: " + i).start();
}
}

}

经过笔者测试,在不同的机器上触发 OOM 的线程数上限不同,如在 16G 的 MacBook Pro 上是 4072,在两台 16G 的 Windows 10 上分别是 287681、106820,在一台阿里云 32G 的 Linux ECS 上是 25346,在笔者的一台黑苹果上是 8166,然后死机… 店家也表示不清楚具体原因,看来店家声称的极其稳定并不靠谱…

References

ConnectionPool ThreadPool why use Integer.MAX_VALUE maximumPoolSize param ?
About the “okhttp3” thread pool issue🙉
How many threads can a Java VM support?