Poison

Recv-Q

在一次帮同事排查问题的过程中,我发现当派发请求的线程因 OOM 终止后,netstat 命令显示的 Recv-Q 与设置的 backlog 并不完全相同,而是存在 Recv-Q = backlog + 1 的关系,比如执行 netstat -tulnp 命令输出如下:

1
2
3
4
5
6
7
8
9
10
11
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.11:43043 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8005 0.0.0.0:* LISTEN 1/java
tcp 0 0 0.0.0.0:2021 0.0.0.0:* LISTEN 1/java
tcp 0 0 0.0.0.0:22222 0.0.0.0:* LISTEN 1/java
tcp 0 0 0.0.0.0:8719 0.0.0.0:* LISTEN 1/java
tcp 101 0 0.0.0.0:80 0.0.0.0:* LISTEN 1/java
tcp 51 0 0.0.0.0:1234 0.0.0.0:* LISTEN 1/java
tcp 0 0 0.0.0.0:20891 0.0.0.0:* LISTEN 1/java
udp 0 0 127.0.0.11:55285 0.0.0.0:* -

其中 80 端口为 Tomcat 服务器监听的端口,在此 Socket 上,我们使用的默认 backlog 配置,值为 100,源码位于 AbstractEndpoint.java at 8.5.59:

1
2
3
4
5
6
7
8
/**
* Allows the server developer to specify the acceptCount (backlog) that
* should be used for server sockets. By default, this value
* is 100.
*/
private int acceptCount = 100;
public void setAcceptCount(int acceptCount) { if (acceptCount > 0) this.acceptCount = acceptCount; }
public int getAcceptCount() { return acceptCount; }

而 1234 端口为我们 Agent 启动的用于提供 Prometheus 拉取监控数据的 HTTP Server 端口,当时的创建代码如下:

1
HttpServer.create(new InetSocketAddress(PROMETHEUS_SERVER_PORT), 0);

这段代码在创建 HTTPServer 过程中会在 ServerSocket.java at jdk8-b120 中将 backlog 重置为 50 并设置至底层的 Socket:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/**
*
* Binds the {@code ServerSocket} to a specific address
* (IP address and port number).
* <p>
* If the address is {@code null}, then the system will pick up
* an ephemeral port and a valid local address to bind the socket.
* <P>
* The {@code backlog} argument is the requested maximum number of
* pending connections on the socket. Its exact semantics are implementation
* specific. In particular, an implementation may impose a maximum length
* or may choose to ignore the parameter altogther. The value provided
* should be greater than {@code 0}. If it is less than or equal to
* {@code 0}, then an implementation specific default will be used.
* @param endpoint The IP address and port number to bind to.
* @param backlog requested maximum length of the queue of
* incoming connections.
* @throws IOException if the bind operation fails, or if the socket
* is already bound.
* @throws SecurityException if a {@code SecurityManager} is present and
* its {@code checkListen} method doesn't allow the operation.
* @throws IllegalArgumentException if endpoint is a
* SocketAddress subclass not supported by this socket
* @since 1.4
*/
public void bind(SocketAddress endpoint, int backlog) throws IOException {
if (isClosed())
throw new SocketException("Socket is closed");
if (!oldImpl && isBound())
throw new SocketException("Already bound");
if (endpoint == null)
endpoint = new InetSocketAddress(0);
if (!(endpoint instanceof InetSocketAddress))
throw new IllegalArgumentException("Unsupported address type");
InetSocketAddress epoint = (InetSocketAddress) endpoint;
if (epoint.isUnresolved())
throw new SocketException("Unresolved address");
if (backlog < 1)
backlog = 50;
try {
SecurityManager security = System.getSecurityManager();
if (security != null)
security.checkListen(epoint.getPort());
getImpl().bind(epoint.getAddress(), epoint.getPort());
getImpl().listen(backlog);
bound = true;
} catch(SecurityException e) {
bound = false;
throw e;
} catch(IOException e) {
bound = false;
throw e;
}
}

根据 netstat(8) - Linux manual page 我们知道,当端口处于 LISTEN 状态时,Recv-Q 表示 SYN 积压。原文如下:

Listening: Since Kernel 2.6.18 this column contains the current syn backlog.

而为何此处显示的 Recv-Q 值为 backlog + 1 呢?Google 并没有告诉我答案,于是查阅了 Linux 的源码,发现判断接收队列长度是否超过 backlog 的逻辑来来回回改了三四次,GitHub 上最初的版本于 2005-04-17 提交,源码如下:

1
2
3
4
static inline int sk_acceptq_is_full(struct sock *sk)
{
return sk->sk_ack_backlog > sk->sk_max_ack_backlog;
}

即在接收新连接前,当接收队列当前的 backlog 大于 设定的 backlog 时,判定接收队列已满,此时 DROP 掉 SYN 包。随后在 2007-03-03,WeiDong 对该方法进行了修改,提交位于 NET: Fix bugs in “Whether sock accept queue is full” checking · torvalds/linux@8488df8 · GitHub,Commit 信息为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
when I use linux TCP socket, and find there is a bug in function
sk_acceptq_is_full().

When a new SYN comes, TCP module first checks its validation. If valid,
send SYN,ACK to the client and add the sock to the syn hash table. Next
time if received the valid ACK for SYN,ACK from the client. server will
accept this connection and increase the sk->sk_ack_backlog -- which is
done in function tcp_check_req().We check wether acceptq is full in
function tcp_v4_syn_recv_sock().

Consider an example:

After listen(sockfd, 1) system call, sk->sk_max_ack_backlog is set to
1. As we know, sk->sk_ack_backlog is initialized to 0. Assuming accept()
system call is not invoked now.

1. 1st connection comes. invoke sk_acceptq_is_full(). sk-
>sk_ack_backlog=0 sk->sk_max_ack_backlog=1, function return 0 accept
this connection. Increase the sk->sk_ack_backlog
2. 2nd connection comes. invoke sk_acceptq_is_full(). sk-
>sk_ack_backlog=1 sk->sk_max_ack_backlog=1, function return 0 accept
this connection. Increase the sk->sk_ack_backlog
3. 3rd connection comes. invoke sk_acceptq_is_full(). sk-
>sk_ack_backlog=2 sk->sk_max_ack_backlog=1, function return 1. Refuse
this connection.

I think it has bugs. after listen system call. sk->sk_max_ack_backlog=1
but now it can accept 2 connections.

Signed-off-by: Wei Dong <weid@np.css.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

该提交将以上方法实现修改为了:

1
2
3
4
static inline int sk_acceptq_is_full(struct sock *sk)
{
return sk->sk_ack_backlog >= sk->sk_max_ack_backlog;
}

即在接收连接前判断接收队列当前的 backlog 大于等于 设定的 backlog 时,判定为接收队列已满。类似的一个提交位于:AF_UNIX: Test against sk_max_ack_backlog properly. · torvalds/linux@248f067 · GitHub。随即在 2007-03-07 David S. Miller 将上次的改动进行了还原,提交位于:NET: Revert incorrect accept queue backlog changes. · torvalds/linux@64a1465 · GitHub,其 Commit 信息如下:

1
2
3
4
5
6
7
8
9
10
11
12
This reverts two changes:

8488df8
248f067

A backlog value of N really does mean allow "N + 1" connections
to queue to a listening socket. This allows one to specify
"0" as the backlog and still get 1 connection.

Noticed by Gerrit Renker and Rick Jones.

Signed-off-by: David S. Miller <davem@davemloft.net>

即如 Commit 信息所阐述,定义的 backlogN 表示队列中允许存在 N + 1 个连接,虽然这与直觉不符,但是他们确实是这样定义的。难道这就是中西方文化的差异?十四年后的 2021-03-13,liuyacan 进行了一次提交,又将 大于 改为 大于等于 以符合我们国人的直觉,本次提交位于:net: correct sk_acceptq_is_full() · torvalds/linux@f211ac1 · GitHub,其 Commit 信息如下:

1
2
3
4
5
6
7
The "backlog" argument in listen() specifies
the maximom length of pending connections,
so the accept queue should be considered full
if there are exactly "backlog" elements.

Signed-off-by: liuyacan <yacanliu@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

遗憾的是,在 2021-04-01 上面的提交就被还原了,本次还原的提交位于:Revert “net: correct sk_acceptq_is_full()” · torvalds/linux@c609e6a · GitHub,其 Commit 信息如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
This reverts commit f211ac1.

We had similar attempt in the past, and we reverted it.

History:

64a1465 [NET]: Revert incorrect accept queue backlog changes.
8488df8 [NET]: Fix bugs in "Whether sock accept queue is full" checking

I am adding a fat comment so that future attempts will
be much harder.

Fixes: f211ac1 ("net: correct sk_acceptq_is_full()")
Cc: iuyacan <yacanliu@163.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

这次不仅还原,还特地给这段代码加上了注释,以警示后人不要再改了?加上注释的源码为:

1
2
3
4
5
6
7
8
/* Note: If you think the test should be:
* return READ_ONCE(sk->sk_ack_backlog) >= READ_ONCE(sk->sk_max_ack_backlog);
* Then please take a look at commit 64a146513f8f ("[NET]: Revert incorrect accept queue backlog changes.")
*/
static inline bool sk_acceptq_is_full(const struct sock *sk)
{
return READ_ONCE(sk->sk_ack_backlog) > READ_ONCE(sk->sk_max_ack_backlog);
}

直至今日,master 分支上依然是使用的 大于 的版本,即接收新连接前队列中的连接数需要大于设定的 backlog 值才表示连接队列已满,即队列中可以存储的最大连接数为 backlog + 1,这也解释了我观察到的 Recv-Q 值等于 backlog + 1 的现象。

Reproduce

可以使用以下程序自行验证在不同 Linux 内核下的 backlog 表现:

1
2
3
4
5
6
7
8
9
10
11
12
import java.io.IOException;
import java.net.ServerSocket;
import java.util.concurrent.CountDownLatch;

public class App {
public static void main(String[] args) throws IOException, InterruptedException {
ServerSocket serverSocket = new ServerSocket(1234, 3);
System.out.println("I'm listening on port 1234");

new CountDownLatch(1).await();
}
}

使用 javac App.java 编译并用 java App 执行该程序,随即使用 telnet 尝试连接 1234 端口,即可验证该问题。

Remark

在 《TCP/IP详解 卷1:协议》13.7.4 进入连接队列 这一节中,使用 backlog = 1 进行演示时,FreeBSD 服务器接收了两个连接,后续的连接不能接收到任何响应并最终在客户端超时。

Reference

502 Bad Gateway - HTTP | MDN
How TCP backlog works in Linux
TCP Flags: PSH and URG - PacketLife.net
TCP half-open - Wikipedia
Detection of Half-Open (Dropped) Connections
TCP/IP详解 卷1:协议 - 图书 - 豆瓣
listen() ignores the backlog argument? - Stack Overflow
SYN packet handling in the wild