A problem with the skyrocketing of threads in the Go program system

A problem with the skyrocketing of threads in the Go program system

Recently, a problem that the number of threads in the Go program system has skyrocketed has been fixed. The number of threads has remained at 2 to 30,000, and sometimes even more. This situation is obviously not in line with Go's concurrency principle. The first time I found that the number of threads is so large because the program suddenly crashed. Since the maximum number of threads available for the program is set, the number of threads will crash if there are too many threads.

This program is actually the very hot Swarm now. The mode of the Swarm program is to establish a connection to tens of thousands of docker daemon servers as a client, maintain a long connection state, and then periodically read data from these docker daemons. After investigation, it is found that the number of long connections is basically the same as the number of threads. It is completely impossible to understand this phenomenon.

The initial suspicion was DNS query and cgo mode, which caused threads to be created continuously, but constantly looking at the code, it was confirmed that it was the IP used to build the connection, not the domain name. Finally, in order to eliminate the influence of DNS cgo query, I did it again The configuration forced the use of pure Go to do DNS resolution, but it did not reduce the number of threads. To force the use of a pure Go DNS resolver, you only need to set the following environment variables.

Then, continue to investigate on the cgo call road, and then start to locate whether there is any cgo call in the code in the code, etc., and did not find any possibility of cgo call, and finally forcibly shutting down cgo is still useless.

At this time, I also analyzed the pprof data and found no doubts. There was originally a threadcreate in pprof, and I feel that this tool is very useful, but I don t know why I always get only the number of threads, and there is no stack information introduced by the document. If you know the correct posture for using the pprof/threadcreate tool, please tell me, thank you.

I have nothing to do. I tried to use the runtime.Stack() method to print out the call stack at the location where the goroutine scheduler created the thread, but because the array used to store the stack would escape to the heap, this method failed. It should be my posture. . Finally, simply turn on the status information of the scheduler, you can see that many threads are indeed created, and these threads are not in the idle state, that is to say, they are all working, which is extremely incomprehensible, and it is still not clear why so many are created. Thread.

Open scheduler status information method:

export GODEBUG=scheddetail=1,schedtrace=1000

In this way, the scheduling status is output to the standard output every 1000 milliseconds.

After a colleague reminded me to use pstack to see what each thread is doing, it turns out that most of the threads are calling the read system, and the thread stack is as follows:

At this time, strace tracks this thread and the following phenomena occur:

You can see that this thread has been blocked on the read call for a long time. Of course, I can t believe my eyes. I went to further verify that fd 16 is indeed a network connection. Network io cannot be blocked in the Go world. Otherwise, Everything won't work anymore. At this time, I seriously suspected the problem of the strace tool tracking, so I added a log to the Go net library to confirm whether it would exit the read system call, recompile Go and swarm, and ran it on the line and found that the read system call was no data. It does not return, which means it is blocked. Look at the thread stack after adding logs:

Add start/stop log information before and after the read system call, you can clearly see that the last read is blocked. At this time, I can explain why the number of threads has skyrocketed and the number of connections is basically the same, but the Go underlying net library does set the nonblock attribute for each connection, but it turns into a block in the end. This situation is either a kernel bug, or the connection attribute is finally destroyed; I am definitely more willing to believe the latter possibility, and I suddenly remembered that the custom Dial() method in the code has the option to set the TCP connection for the connection.

//TCP_USER_TIMEOUT is a relatively new feature to detect dead peer from sender side.
//Linux supports it since kernel 2.6.37. It's among Golang experimental under
//golang.org/x/sys/unix but it doesn't support all Linux platforms yet.
//we explicitly define it here until it becomes official in golang.
//TODO: replace it with proper package when TCP_USER_TIMEOUT is supported in golang.

const tcpUserTimeout = 0x12
syscall.SetsockoptInt(int(f.Fd()), syscall.IPPROTO_TCP, tcpUserTimeout, msecs)

Remove this TCP option, recompile and put it on the line for a run. Sure enough, everything is right, and the number of threads has dropped to the normal dozens.

The TCP_USER_TIMEOUT option was introduced in the 2.6.37 kernel. It may be that the kernel we are using does not support this option, which leads to a problem with the mandatory setting. Why does the setting of TCP_USER_TIMEOUT cause the connection to change from nonblock to block? It is worth further investigation. [ Note: This conclusion is actually not completely correct, the update is as follows ]

The real reason

This is the code for setting the TCP_USER_TIMEOUT option. I commented this function call before, and I thought it lightly that it was a problem with setting this option, and finally found that the conn.File() call actually caused the block. The underlying code is like this:

//File sets the underlying os.File to blocking mode and returns a copy.
//It is the caller's responsibility to close f when finished.
//Closing c does not affect f, and closing f does not affect c.
//The returned os.File's file descriptor is different from the connection's.
//Attempting to change properties of the original using this duplicate
//may or may not have the desired effect.
func (c *conn) File() (f *os.File, err error) {
	f, err = c.fd.dup()
	if err != nil {
		err = &OpError{Op: "file", Net: c.fd.net, Source: c.fd.laddr, Addr: c.fd.raddr, Err: err}

func (fd *netFD) dup() (f *os.File, err error) {
	ns, err := dupCloseOnExec(fd.sysfd)
	if err != nil {
		return nil, err

	//We want blocking mode for the new fd, hence the double negative.
	//This also puts the old fd into blocking mode, meaning that
	//I/O will block the thread instead of letting us use the epoll server.
	//Everything will still work, just with more threads.
	if err = syscall.SetNonblock(ns, false); err != nil {
		return nil, os.NewSyscallError("setnonblock", err)

	return os.NewFile(uintptr(ns), fd.name()), nil

You don't need to look at the code, just look at the comments. So the posture of setting the TCP option here is wrong.

I found the following comments in the swarm code a long time ago:

//Swarm runnable threads could be large when the number of nodes is large
//or under request bursts. Most threads are occupied by network connections.
//Increase max thread count from 10k default to 50k to accommodate it.

const maxThreadCount int = 50 * 1000

The person who wrote this code discovered the huge number of threads at the beginning, but thought that a large number of connections and concurrent requests would consume so many threads. From this point of view, there may not be a deep understanding of Go's concurrency principle and event-driven on Linux.

It took me a lot of time to troubleshoot this problem. The main reason is that the phenomenon of this problem is completely inconsistent with my own understanding of the "world". It is difficult for me to think that this is a Go bug, and indeed it is not. Where to start, I can only keep going back to the runtime code to find details such as the conditions under which the thread is created.

Some blog posts I wrote before: www.skoo.me