Why should Tencent's self-developed trillion-level messaging middleware TubeMQ donate to Apache?

Why should Tencent's self-developed trillion-level messaging middleware TubeMQ donate to Apache?

Introduction | Recently, the cloud + community technology salon "Tencent Open Source Technology" came to a successful conclusion. This salon invited a number of Tencent technical experts to discuss with developers around Tencent open source, revealing in depth the Tencent open source projects TencentOS tiny, TubeMQ, Kona JDK, TARS and MedicalNet. This article is a compilation of teacher Zhang Guocheng's speech.
Key points of this article:
  • Principles and characteristics of Message Queue;
  • TubeMQ related implementation principles and usage introduction;
  • Follow-up development and discussion of TubeMQ.

1. Introduction to Message Queue

For Message Queue (hereinafter referred to as MQ), the definition on Wikipedia refers to: a communication method between different processes or between different threads of the same process. It is a communication method.

Then why should we adopt MQ? This is determined by the characteristics of MQ. The first is because it can integrate multiple different systems to work together; the second is that it can be decoupled for data transfer and processing; the third is that it can be used for peak buffer processing. We usually come into contact with Kafka, RocketMQ, Pulsar Basically, it also has such characteristics.

What are the characteristics of MQ as a big data scenario? From my personal understanding, it is high throughput and low latency, the system is as stable as possible, the cost is as low as possible, and the protocol does not need to be particularly complicated, especially the horizontal expansion capability should be as high as possible.
Because mass data is basically tens of billions, hundreds of billions, and trillions. For example, our own production environment may double in a month or a year. If there is no horizontal expansion capability, the system is very easy. Various problems have occurred.

2. TubeMQ implementation principle and introduction

1. Features of TubeMQ

So, what are the characteristics of TubeMQ developed by Tencent? TubeMQ is a trillion-level distributed messaging middleware, focusing on data transmission and storage under massive data, and has unique advantages in performance, reliability, and cost.

For the application of big data scenarios, we have given a test plan (the detailed plan can be found in the Tencent TubeMQ open source docs directory). 1. we need to define a practical application scenario, and then collect and organize system data under this definition. The result is: our throughput has reached 140,000 TPS, and the message delay can be less than 5ms.

Some people may be curious, because many research reports have analyzed that the same distributed publish-subscribe messaging system, such as Kafka, can reach millions of TPS. In comparison, will our data look too bad?

In fact, there is a premise here, that is: we are in 1000 Topic, and the performance index achieved in the scenario of 10 Partitions for each Topic. For Kafka, it may be able to reach a million-level TPS, and perhaps the delay can be reduced to a very low level, but in our big data scenario, it is far from this magnitude.

Our system has been running stably online for 7 years. The architecture of the system uses a thin client, partial server control model. Such technical characteristics determine its corresponding application scenarios. For example, real-time advertising recommendation, massive data reporting, indicators & monitoring, stream processing, and data reporting under IOT scenarios, etc.
Part of the data is allowed to be damaged, because the big deal can be solved by repeating the report. However, considering that the magnitude of the data is too large, part of the reliability can only be sacrificed in exchange for high-performance data access.

2. TubeMQ system architecture

What is the system architecture of TubeMQ? As shown in the figure below, it interacts with the outside through an SDK. Of course, it is also possible to directly connect with the TCP protocol defined by us.
It is worth noting that TubeMQ is written in pure Java. It has the coordination node of Master HA and uses weak zk to manage Offset. The message storage mode has also been improved, and the data reliability scheme has been adjusted. It uses disk RAID10 multiple copies + fast consumption instead of multiple Broker nodes and multiple copies, and enables metadata self-management mode.

3. The development history of TubeMQ

As shown in the figure below, since we have data records, we have gone through a total of four stages, from introduction to improvement, to the beginning of self-research, to the current self-innovation. From the initial 20 billion in June 2013 to 35 trillion in November 2019, it is expected to reach 40 trillion in 2020.

In general, this is also an important reason why we chose to study on our own. Because when the amount of data is not large, for example, in the order of 1 billion or less, it is actually possible to use any MQ. But once it reaches tens of billions, tens to hundreds of billions, hundreds of billions to trillions, or even more than trillions of data, more and more restrictive factors will appear one after another, including system stability, performance, and machine costs. Operation and maintenance costs and other issues.

Our live network TubeMQ has 1500 machines, and 1500 machines only need about one manpower to operate and maintain it (two people who are not full-time). For our Broker storage server, it can be online for 4 months without restarting. These are the improvements and innovations we have made on the original basis.

4. Horizontal comparison between TubeMQ and other MQ

The table in the figure below is a horizontal comparison of TubeMQ and other MQ data. Our project is open source. If you are interested, you can directly verify it. In general, TubeMQ is more suitable for scenarios that require high performance, low cost, and tolerate data loss under extreme conditions, and can withstand the test of practice.

3. TubeMQ's storage mode and control measures

The core of MQ is its storage mode. As shown in the figure below, the storage scheme list on the right is provided to us by a user named Chen Dabai, and the storage scheme on the left is TubeMQ.

TubeMQ adopts the storage instance scheme of organizing memory + files according to Topic. The data is first written to the main memory. After the main memory is full, it is switched to the backup memory. The data is asynchronously flushed from the backup memory to the file to complete the data writing. It determines whether it consumes data from main memory, backup memory, or files by the degree of consumption deviation. In this way, it can reduce the burden on the system and increase its storage capacity.

You can see from the storage diagram on the right: Kafka, RocketMQ and JD.com released JMQ in 2019, but there is not much difference. However, it should be noted that under different resource requirements for each storage mode, its performance indicators and corresponding magnitudes are different.

Because what we are doing is detrimental to the service, what is going on with the detrimental service? That is, when the machine is powered off, there is no storage, or there is no flash disk, the data will be lost, and the data will be lost in the case of a disk failure that cannot be recovered by the disk RAID10. Except for these two situations, other situations will not be lost.

Because the above two types of failures may occur at any time, TubeMQ is actually not suitable for use in scenarios where persistent and repeated consumption requires complete data consistency before and after. So why should we do this? Can we not make multiple copies? Actually it is not.

The problem lies in cost considerations. We do this. If we compare horizontally, do you know how many machines we can save? How much can be saved when converted into money?

Here is a piece of data for everyone: On November 8, 2019, LinkedIn, the open source Kafka project, published an article that they used 4000 machines for more than 70,000 data. This information can be found online. The other is an example of our domestic company doing big data and application-related scenarios, using native Kafka for big data access. At the end of 2018, it also reached a data volume of 7 or 8 trillion, and spent 1,500 Gigabit machines.

Having said that, how many machines do we need in this mode? We now use 1500 machines for the data volume of 35 trillion. Under the same premise, we compare the number of machines used by external MQ, and the number of machines used is only 1/4 or 1/5 of them. How much is it converted into RMB? A commercial machine is about 100,000, and we can save several hundred million in machine cost alone, which is why this scheme is adopted.

Compared with the Kafka asynchronous node replication solution, we only need about 1/4 of the amount of machines. Of course, even with a single copy, our performance will be much better than Kafka, which can save a lot of machines. For relevant data, you can see our test report.

All control logic of TubeMQ, including all API interfaces, are built around its storage, including its Topic configuration and flow control processing, data query, API inventory, and so on. The following figure shows the core API interface definition of TubeMQ, which is mainly divided into 4 chapters. If you just use it, you can directly control it through the control console, but if you want to fine-tune the system, you need to understand the definition of API.

The control module Master of TubeMQ is based on the BDB embedded database for cluster Broker node management. The data storage of Topic information configured by each Broker, as long as the operation is in the operation bar marked in red, there will be a status to inform the operator what kind of process is currently in, whether it is based on the execution operation or read-only write-only or read-write Happening. You can also query through this page. This system can be run on Windows, everyone is welcome to try it out.

The authentication and authorization design of TubeMQ is also different from the traditional one, because we redefine the authentication machine of TubeMQ, as shown in the figure below.

4. Why choose open source?

1. based on the company's open source policy requirements: internal open source collaboration, external technical influence, so we chose open source. 2. from the information we have, we believe that open source TubeMQ in this field can produce practical value for students in need. 3. we feel that open source is breaking barriers.

In different corners of the world, many people are studying this issue. Just like parallel universes, everyone is studying and analyzing in their own universes. There is not much communication between each other. We believe that someone must do better than us. Even better, there is something worthy of our study, so we open it up to form a state where everyone knows and can learn from each other, which is also a kind of promotion for ourselves. Based on these three points, we finally chose open source.

Why have you contributed to Apache when it is already open source? In fact, I also understand that many students who do development dare not use some open source projects, because many companies open source a project, and after using it, it turns out that no one maintains it.

In order to solve this problem, we hope to donate it to a neutral foundation, and make the project a mature project acceptable to everyone through its documented standardized process, including its documentation and multiple access conditions. Even if the original team does not take over the project in the end, someone will take over it later, so that the project can continue to improve.

So we donated it to the foundation. Why choose Apache? Because we are an MQ focusing on big data scenarios, and Apache is the most famous community based on the big data ecology, and we also benefit from this ecology, so of course we want to give back to the community and donate the project to Apache. Some time ago TubeMQ has become an incubation project of Apache.

V. Discussion on the follow-up development of TubeMQ

In the first half of 2020, under the collaborative promotion of open source, more and more business data will be accessed internally, and it is believed that the average daily access volume will soon exceed 40 trillion.

Our machine will also be upgraded from the previous TS60 to BX2. What kind of changes will it bring? The previous machine used CPU 99% and disk IO 30-40%. According to the latest test data, it becomes CPU 30-40% and disk IO 99% on BX2. This shows that we need to reduce the IO of its disk as much as possible, or choose other more suitable machines, which needs to be studied.

In addition, because we have open sourced, how to cultivate the community in the future is also a key point. From the current point of view, we will open source it based on the mechanism of collaboration. Whether it is students inside or outside the company, they will contribute together to make this project bigger. We will continue to consolidate this thing in our areas of expertise. Need to use our project.

At the same time, if you can find some imperfections in the process of using, you also hope to contribute through the community, and everyone will work hard to make this project a good one.

In fact, we are not only MQ, we are also doing the convergence layer and the collection layer, and above it there is a management layer. Our hope is to make the MQ part stable, and then open source the whole as a whole. We will allow this set of systems to accept different MQs and provide them to external businesses based on the different characteristics of MQ, but they are unaware of external businesses.

6. Q&A

Q: Teacher Zhang, you just made a comparison between TubeMQ and Kafka, and also introduced the internal storage structure of TubeMQ, but I found that its internal storage structure is no different from Kafka storage. You only have one more backup cache, I don t know. Why can you throw Kafka so far behind just a backup problem?

A: Kafka is based on the structure of Partition. A Partition will have a file block. TubeMQ is based on Topic, and Partition is already a logical concept. The second difference is that our memory is in active/standby mode. As you mentioned earlier, why does one more memory block make it faster? It is basically a consensus to write memory faster, and then fill a disk with a full disk, and cut the full disk as a spare block to asynchronously flash to the file, and then change the block of memory to continue writing, so that the read-write conflict will be much less when the main and standby are switched. Overall it will be faster.

Why do we change to such a storage structure? Like Kafka, it has become a random read and write at 1000 10, and the data index is not very good when it runs, and it is unstable. RocketMQ stores all data in one file, and each partition constructs another file. This brings about a problem: the data file will have a write bottleneck, and the entire system index will not come up when the traffic increases.

JMQ defines data files according to Topic, but each Partition defines a new file. It is a bit broader than RocketMQ. Its data is not concentrated into one file. It is based on Topic and solves some problems of RokcetMQ.
What is TubeMQ? TubeMQ is one topic and one data file. Different topics have different files, and we don't have Partition. We all define the storage unit according to Topic, a data file + an index file. You can analyze it. They have their own characteristics. The performance characteristics in different scenarios are different, including your hardware scenario. In fact, there are still big differences.

Instructors

Zhang Guocheng , a senior engineer at Tencent, has been responsible for the research and development of the TubeMQ project since 2015. He has gone through the process of system optimization and transformation from trillions of data to 35 trillion, and has rich project practical experience in the field of massive data access.