The architect of Jingdong tells: The evolution of distributed database system

The architect of Jingdong tells: The evolution of distributed database system

Regarding the use of databases, there are several trends in In the early days, SqlServer and Oracle were mainly used in and a small amount of MySQL was used. With the accumulation of business development technology and use cost factors, many businesses began to use MySQL, including early use of SqlServer and Oracle Many core businesses of the company have gradually begun to migrate to MySQL. The stand-alone MySQL often cannot support this type of business. It is necessary to consider a distributed solution. In addition, the business that originally used MySQL will also encounter with the increase in data volume and access volume. The bottleneck will eventually consider the use of distributed solutions. The overall development trend of JD is shown in Figure 1.

Figure 1 Evolution trend of business use database

There are many kinds of distributed database solutions, which are also very common in various Internet companies. In essence, the data is separated and stored on multiple nodes to relieve the pressure of a single node. The business level can also be carried out according to the characteristics of the business. Split, as shown in Figure 2. Suppose there is a user table with ID as the split key. If split into two, the simplest is that data with odd IDs falls on a storage node, and data with even IDs falls on On another storage node, the actual deployment diagram is shown in Figure 3.

In addition to splitting the business level, you can also consider adopting some more general solutions, which are mainly divided into two categories. One is the client solution. This solution is to introduce a specific client package into the business application and pass the client The end package completes operations such as data splitting, querying and result aggregation. This solution is somewhat intrusive to the business. As the number of business application instances deployed is larger, the database end may face the problem of greater pressure on the number of connections. Another version Upgrading is also more difficult. The advantage is that the link is shorter, from the application instance directly to the database.

Figure 2 Schematic diagram of data split

The other is a middleware solution. This solution is to provide a proxy compatible with database transfer protocols and syntax specifications. When connecting to the middleware, the business can directly use traditional JDBC and other clients, thus greatly reducing the level of business development. The disadvantage is that the difficulty of developing middleware is slightly higher than that of the client solution. In addition, the network transmission link takes an extra period of time, which theoretically has a slight impact on performance. In the actual use environment, these systems are all accessed from the computer room intranet. , The impact of this kind of network is completely negligible.

Figure 3 Schematic diagram of system deployment

According to the above analysis, in order to better support JD s large-scale data business, we have developed a middleware solution for a distributed database compatible with the MySQL protocol. We call it JProxy. This solution has gone through many times. The evolution of finally completed and supported the mission of going to Oracle/Sqlserver for the entire group.

The first version of JProxy is shown in Figure 4. Each JProxy will have a configuration file. We will configure the database table split information and routing information of the corresponding business in the configuration file. After JProxy receives the SQL, it will parse the SQL again. According to the routing information, determine whether the SQL needs to be rewritten and to which nodes it should be sent. After the results of each node are returned, the results are summarized and returned to the application according to the MySQL transmission protocol.

Combining the above example, when the user queries the user table, suppose the SQL statement is select * from user where id = 1 or id = 2. After receiving this SQL, JProxy will split the SQL into select * from user where id=1 and select * from user where id = 2, then send these two sql statements to the back-end node respectively, and finally return the two records obtained on the two nodes to the application together.

This solution is feasible when the business database tables are relatively small. With the development of the business, the number of database tables may continue to increase, especially for businesses that go to Oracle. When switching databases, it may switch several tables at a time. Switching to several other tables at a time requires frequent modification of the configuration file. In addition, JProxy needs to deploy at least two or more copies when deploying, as shown in Figure 5. At this time, a problem is how to ensure that all configuration files are completely consistent during the continuous modification process. In the early operation and maintenance process, we manually modify a configuration file, and then copy the corresponding configuration file to other JProxy to ensure that the content of the JProxy configuration file is consistent. This process is mentally burdened and error-prone.

Figure 4 Version One

Figure 5 Configuration file

In later versions, we introduced the JManager module, which is responsible for managing the routing meta information in the configuration file, as shown in Figure 6. JProxy routing meta information is obtained uniformly through JManager. We only need to add and modify routing metadata to the meta database through JManager. After the operation is completed, notify each JProxy to dynamically load the routing information to ensure that the routing information of each JProxy is completely consistent. , So as to solve the pain points of maintaining the consistency of routing meta-information.

Figure 6 Version Two

One of the issues that must be considered when referring to distributed database solutions is the issue of capacity expansion. There are two ways to expand capacity. One is called re-sharding. Simply put, one piece is split into two pieces, and two pieces are split into four pieces. As shown in Figure 7, there was originally only one MySQL instance and one shard, then it was split into two shards, shard1 and shard2, and then a new MySQL instance was added, and shard1 was split into two shards, shard11 and shard12. Shard2 is split into two shards, shard21 and shard22, and placed on another newly added MySQL instance. This expansion method is the most ideal, but it will be slightly troublesome when implemented. We chose another bias in the short term. Be conservative. We call the expansion model that is sufficient to support business development under the premise of reasonable estimation. We call it the pre-sharding scheme. This scheme is to split the number of shards that are sufficient for a certain period of time in advance. When the amount of data in the early stage is small These shards can be placed on one or a small number of MySQL instances. After the later data volume increases, new MySQL instances can be added to the cluster, and the original shards can be migrated to the newly added MySQL instance, as shown in Figure 8. As shown, we split into four shards, shard1, shard2, shard3, and shard4 at the beginning. These four shards were originally on a MySQL instance. After the amount of data increases, we can add new MySQL instances. Migrating shard3 and shard4 to a new MySQL instance, the number of shards in the entire cluster has not changed but the capacity has doubled.

Figure 7 Re-sharding scheme

Figure 8 pre-sharding scheme

The pre-sharding solution is equivalent to achieving expansion through migration. The change of sharding position involves a series of changes such as data migration verification and routing metadata changes, so we introduced the JTransfer system, as shown in Figure 9. JTransfer can perform online seamless migration. You only need to submit a migration plan when migrating and expanding, specifying which source instance to migrate a shard from to which target instance, and you can specify when to start the migration task. When the time comes, the system will The migration starts automatically. The entire migration process involves the migration of the basic full data and the incremental data generated by business access during the migration process. At the beginning, the basic full data will be dumped from the source instance to the target instance to restore. After confirming that the data is correct, start to catch up with the incremental data. When the incremental data catches up to a certain level, the system estimates that it can catch up quickly. We will do a short lock operation to ensure that all the last increments are caught up. This lock time is also a time that can be specified when submitting the migration task. Parameters, for example, can only be locked for 20s at most. If the final remaining increment fails to catch up within 20s due to a sudden increase in the amount of access at this time, the entire migration task will be abandoned to ensure that the impact on online access is minimized. After the migration is completed, the routing meta-information will be modified, and the routing meta-information will be pushed to all JProxy at the same time, and finally unlocked, the access will be sent to the new location of the shard according to the route.

Figure 9 Version Three

When the system is used in a production environment, in addition to the above introduction, many deployment and operation and maintenance issues need to be considered. The first thing to consider is how the system survives, the self-protection ability of the system needs to be considered, and the stability of the system must be considered. , To achieve performance that can meet business needs.

Inside JProxy, we adopted an event-driven network IO model while taking into account the characteristics of multi-core, etc., to maximize the performance of the entire system. During the stress test, the performance of JProxy showed almost linear growth with the increase of MySQL instances. Trends, and there is no pressure on the machine where JProxy is located during the whole process.

To ensure performance is not enough, you also need to consider the number of control connections, control system memory, etc. The number of connections is mainly to control the number of connections. This is a better understanding. Controlling memory mainly refers to the amount of memory required by the control system during use, such as When data is collected, the sql statement is a full query like select * from table. At this time, all the MySQL data in the back-end will send data to the middleware concurrently through multiple connections. There is only one connection from the middleware to the application. Failure to control the memory will result in middleware OOM. In the specific implementation, we control the network flow rate of the front and back ends of the middleware by pressing the data in the TCP stack to ensure that the memory of the entire system is in a controllable range. Inside.

In addition, permissions need to be considered. Which IP can access and which IP cannot be accessed need to be precisely controlled. Specific to a certain table, you also need to control the permissions for adding, deleting, modifying, and checking. We recommend that the business should be split as much as possible when writing SQL. The field guarantees that SQL can fall on a certain shard to ensure that the entire access is sufficiently simple and controllable. We provide fine access control for it, which can achieve table-level addition, deletion, modification, and check permissions, including whether to have split Fields, to achieve maximum control of SQL, to ensure that SQL written in the test phase that does not meet expectations can be found in time, which greatly reduces the risk of later online operations.

In addition to basic stability, a high-availability solution for services needs to be considered in the overall system. JProxy is stateless. A business must deploy at least two JProxy in the same computer room and must cross racks to ensure that JProxy is highly available in the same computer room. Two JProxy will be deployed in another computer room to achieve high availability across computer rooms. In addition to the high availability of the middleware itself, it is also necessary to ensure the high availability of the database. The high availability of the entire link is the real high availability. The database level will be deployed in the same machine room according to one master and one slave, and another backup will be deployed in the backup machine room, as shown in Figure 10. JProxy accesses MySQL through the domain name. If the MySQL master is abnormal, the database will perform the corresponding master-slave switch operation. JProxy can access the new master after the switch. If the database of the entire computer room is abnormal, you can directly switch the domain name of the data to the backup computer room. , To ensure that JProxy can access the database in the standby computer room. Business access to JProxy is also through domain name access. If the JProxy in a computer room is abnormal, the domain name of the JProxy front-end is directly switched to the standby computer room similar to the database, so as to ensure that the business can always access JProxy normally.

High reliability of data is also a very critical point. We will regularly back up the data in the database and store the backup data in the corresponding storage system to ensure that the data in the database can still be restored even if it is deleted.

Figure 10 Schematic diagram of deployment

Monitoring and alarming when the system is running online is extremely important. The monitoring can be divided into multiple levels, as shown in Figure 11, from the information of the host and operating system to the information of the application system to the monitoring of specific information within a specific system, etc., for operations System and host monitoring. Jingdong has a MJDOS system that can incorporate various information such as system memory/cpu/magnetic/network card/machine load into the monitoring system. The basic information of these operating systems is critical to the diagnosis of system abnormalities, such as network loss. Service exceptions caused by packages can be found in this monitoring system in time.

Jingdong also has a unified monitoring and alarm system UMP. This monitoring system mainly serves all application systems. All application systems expose interfaces according to certain rules. After registering in the UMP system, the UMP system can provide a complete set of monitoring and alarm services. , The most basic such as the survival monitoring of the system and whether there are slow queries and so on.

In addition to these two basic monitoring systems, we have also developed a customized monitoring system JMonitor for the entire middleware system. The reason for developing this monitoring system is that we need to collect more customized monitoring information. When the system is abnormal It can locate the problem in the first time. For example, when the business finds that the TP99 declines, it is often accompanied by slow SQL. The application passes through JProxy to MySQL and then from MySQL to JProxy and then back to the application from sending SQL to receiving the result. Any link on the link may be slow. No matter which stage is time-consuming, we need to refine the record of this slow SQL to the level of time spent in each stage, so that it can be fast and accurate when slow SQL occurs. Find the source of the problem and quickly solve the problem.

In addition, when going to Oracle/SqlServer with the business, we do not recommend using cross-database transactions, but there will be a situation where the SQL in the same transaction has split fields, and each SQL is a single node. There are multiple pieces of such SQL in the same transaction, but it turns out that this transaction is cross-database. We will have detailed records of this kind of transaction. The business side can directly find this kind of transaction through JMonitor for better further improvement. In addition to this, in the test environment, the SQL written at the beginning of the business system did not consider too much optimization. There may be more slow SQL. We will collect these slow SQLs in a unified manner on the JMonitor system for analysis and processing to help the business party to iterate quickly. Adjust the SQL statement.

Figure 11 Monitoring system

Businesses should try to avoid cross-database SQL when using this system. One very important reason is that when cross-database SQL appears, it will consume more MySQL connections. As shown in Figure 12, one does not have split fields. The SQL will be sent to all shards. If there are 64 shards on a MySQL instance, then one such SQL will consume 64 connections on the MySQL instance. This resource consumption is very considerable. If you can Controlling SQL to fall on a single shard can greatly reduce the connection pressure on the MySQL instance.

Figure 12 Number of connections

Cross-database distributed transactions should be avoided as much as possible. One is that MySQL-based distributed database middleware cannot guarantee strict distributed transaction semantics. The other is to avoid strict distributed transaction semantics even if it can support strict distributed transaction semantics. Broken database transactions, the deadlock of multiple cross-database distributed transactions on one shard will cause the transactions on other shards to be unable to continue causing a large area of deadlock, even for transactions on a single node. Try to control the transaction as small as possible to reduce the probability of deadlock.

Businesses with different specific routing strategies can be treated specially. Take the Jingdong sorting center as an example. The size of each sorting center is very different. The sorting center in large cities such as Beijing and Shanghai has a large amount of data. Smaller, we will customize the routing strategy for this feature, so that the data of the large sorting center falls on a specific MySQL instance with better performance, and the data of other small sorting centers can be split according to ordinary Processing in points.

At the JProxy system level, we can support the multi-tenant mode, but considering that the business to Oracle/SqlServer is often a very important business with a huge amount of data, our system is a set of independent deployments for different businesses, avoiding it at the deployment level Interaction between various businesses. Considering that independent deployment will cause some waste of resources, we have introduced a container system to isolate operating system resources through containers to ensure full utilization of system resources. Many problems do not have to be solved at the code level. Things that are troublesome to solve at the code level or cannot be 100% controlled can be solved at the architecture level. Things that are not solved at the architecture level can be solved at the deployment level. The things that are not easy to solve can be solved at the product level. There are various ways to solve the problem. It needs to be considered from the overall system perspective. To quote Deng Gong s sentence, "No matter the black cat or the white cat, it is good to catch mice. Cat", the same reasoning is that a system that can support business development is a good system.

In addition, I will briefly discuss why the MySQL-based distributed database middleware system cannot guarantee strict distributed transaction semantic support. The so-called distributed transaction semantics is essentially the semantics of the transaction, including ACID attributes, which are atomicity, consistency, durability, and isolation.

Atomicity means that a transaction either succeeds or fails, and there can be no intermediate state. Persistence means that once a transaction is submitted successfully, it is still successful to recover after the system crashes. Isolation means that each concurrent transaction is isolated and invisible, and there may be many isolation levels in the specific implementation of the database. The consistency of transactions means to ensure that the system is in a consistent state. For example, if 500 yuan is transferred from account A to account B, then the total amount of the system has not changed from the perspective of the overall system, and the account of A cannot appear If 500 yuan is subtracted, but account B does not increase 500 yuan.

Figure 13 Serializable scheduling

When transactions are executed in the database system, there is a problem of serializable scheduling. Assuming there are three transactions of T1, T2, and T3, the effect of the execution of these three transactions should be the same as the effect of the serial execution of the three transactions. That is, the final effect should be one of {T1/T2/T3, T1/T3/T2, T2/T1/T3, T2/T3/T1, T3/T1/T2, T3/T2/T1} when it comes to When it comes to distributed transactions, the scheduling between each sub-transaction must be consistent with the scheduling sequence of the global distributed transaction to meet the requirements of serializable scheduling. As shown in Figure 13, the three distributed T1/T2/T3 For transactions, the scheduling sequence in one library is T1/T2/T3 consistent with the global scheduling sequence, and the scheduling sequence in the other library becomes T3/T2/T1, which breaks from a global perspective. In order to achieve serializable scheduling, serializable scheduling ensures the realization of isolation. When serializable scheduling is broken, natural isolation will also be broken. In the implementation of MySQL-based distributed middleware solutions, because The transaction ID of each sub-transaction of the same distributed transaction is generated on each MySQL, and no global transaction ID is provided to ensure that the scheduling order of each sub-transaction is consistent with the global distributed transaction, resulting in isolation that cannot be guaranteed , So the current MySQL-based distributed transaction cannot guarantee strict distributed transaction semantic support. Of course, with the introduction of GR into MySQL, the strong consistency in the CAP theory can be achieved, and it is also possible to support strict distributed transactions by strengthening the related functions of middleware and customizing MySQL related functions.