In-depth understanding of the kafka knowledge points of the distributed system

In-depth understanding of the kafka knowledge points of the distributed system

 Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn and later became part of the Apache project. Kafka is a distributed, partitionable, redundant and persistent log service. It is mainly used to process active streaming data.

In a big data system, a problem is often encountered. The entire big data is composed of various subsystems, and data needs to flow continuously in each subsystem with high performance and low latency. Traditional enterprise messaging systems are not very suitable for large-scale data processing. In order to be able to handle both online applications (messages) and offline applications (data files, logs) at the same time, Kafka appeared. Kafka can play two roles:

  1. Reduce the complexity of system networking.

  2. To reduce programming complexity, each subsystem is no longer a mutual negotiation interface, each subsystem is plugged into a socket like a socket, and Kafka assumes the role of a high-speed data bus.


The main features of Kafka:

  1. Provides high throughput for both publish and subscribe. It is understood that Kafka can produce about 250,000 messages per second (50 MB) and process 550,000 messages per second (110 MB).

  2. Persistent operations can be performed. The message is persisted to disk, so it can be used for batch consumption, such as ETL, and real-time applications. Prevent data loss by persisting data to hard disk and replication.

  3. Distributed system, easy to expand outwards. There will be multiple producers, brokers, and consumers, all of which are distributed. The machine can be expanded without downtime.

  4. The state of the message being processed is maintained on the consumer side, not on the server side. It can automatically balance when it fails.

  5. Support online and offline scenarios.

    Kafka's architecture:

The overall architecture of Kafka is very simple, it is an explicit distributed architecture, there can be multiple producers, brokers (kafka) and consumers. Producer and consumer implement the interface registered by Kafka. Data is sent from the producer to the broker, and the broker assumes the role of intermediate caching and distribution. The broker distributes the consumers registered in the system. The role of a broker is similar to a cache, that is, a cache between active data and offline processing systems. The communication between the client and the server is based on a simple, high-performance, and programming language-independent TCP protocol. Several basic concepts:

  1. Topic: specifically refers to the different categories of feeds of messages handled by Kafka.

  2. Partition: Topic physical grouping. A topic can be divided into multiple partitions, and each partition is an ordered queue. Each message in the partition will be assigned an ordered id (offset).

  3. Message: Message is the basic unit of communication. Each producer can publish some messages to a topic.

  4. Producers: Message and data producers. The process of publishing messages to a topic in Kafka is called producers.

  5. Consumers: Message and data consumers. The process of subscribing to topics and processing the messages they publish is called consumers.

  6. Broker: Cache proxy. One or more servers in the Kafka cluster are collectively referred to as a broker.

Message sending process:

  1. The Producer publishes the message to the partition of the specified topic according to the specified partition method (round-robin, hash, etc.)

  2. After the Kafka cluster receives the message sent by the Producer, it persists it to the hard disk and retains the message for a specified duration (configurable), regardless of whether the message is consumed.

  3. The consumer pulls data from the Kafka cluster and controls the offset for obtaining messages


Kafka design

1. Throughput

High throughput is one of the core goals that Kafka needs to achieve. For this reason, Kafka has made the following designs:

  1. Data disk persistence: Messages are not cached in memory, but written directly to disk, making full use of the sequential read and write performance of disks

  2. zero-copy: Reduce IO operation steps

  3. Data batch sending

  4. data compression

  5. Topic is divided into multiple partitions to improve parallelism

2. Load balancing

  1. The producer sends the message to the specified partition according to the algorithm specified by the user

  2. There are multiple partitions, each partition has its own replica, and each replica is distributed on a different Broker node

  3. Multiple partitions need to select the lead partition, the lead partition is responsible for reading and writing, and zookeeper is responsible for fail over

  4. Manage the dynamic joining and leaving of brokers and consumers through zookeeper

3. Pull system

Since the Kafka broker will persist data and the broker has no memory pressure, the consumer is very suitable for consuming data in a pull mode, which has the following advantages:

  1. Simplify kafka design

  2. Consumers autonomously control the message pull speed according to their consumption ability

  3. Consumers choose their own consumption mode according to their own situation, such as batch, repeated consumption, consumption from the end, etc.

4. Scalability

When a broker node needs to be added, the new broker will register with zookeeper, and the producer and consumer will perceive these changes according to the watcher registered on zookeeper, and make timely adjustments.

Application scenarios of Kafka

1. Message queue

Compared with most messaging systems, Kafka has better throughput, built-in partitioning, redundancy and fault tolerance, which makes Kafka a good solution for large-scale messaging applications. Message systems generally have relatively low throughput, but require smaller end-to-end delays, and try to rely on the strong durability guarantee provided by Kafka. In this area, Kafka is comparable to traditional messaging systems such as ActiveMR or RabbitMQ.

2. Behavior tracking

Another application scenario of Kafka is to track user browsing, search, and other behaviors, and record them in real-time in corresponding topics in a publish-subscribe model. Then, after these results are received by the subscriber, they can be further processed in real time, or monitored in real time, or placed in a Hadoop/offline data warehouse for processing.

3. Meta-information monitoring

Used as a monitoring module for operation records, that is, collecting and recording some operation information, it can be understood as data monitoring of the nature of operation and maintenance.

4. Log collection

In terms of log collection, there are actually many open source products, including Scribe and Apache Flume. Many people use Kafka instead of log aggregation. Log aggregation is generally to collect log files from the server, and then put them in a centralized location (file server or HDFS) for processing. However, Kafka ignores the details of the file and abstracts it more clearly into a message stream of logs or events. This makes Kafka processing lower latency and easier to support multiple data sources and distributed data processing. Compared to log-centric systems such as Scribe or Flume, Kafka provides the same efficient performance and higher durability guarantees due to replication, as well as lower end-to-end latency.

5. Stream processing

There may be many scenarios for this and it is easy to understand. Save and collect stream data to provide Storm or other stream computing frameworks for later docking for processing. Many users will process the data from the original topic in stages, aggregate, expand, or convert it to the new topic in other ways before continuing the subsequent processing. For example, the processing flow of an article recommendation may be to first grab the content of the article from the RSS data source, and then throw it into a topic called "article"; subsequent operations may need to clean up the content, such as returning to normal Data or delete duplicate data, and finally return the result of content matching to the user. This creates a series of real-time data processing procedures in addition to an independent topic. Strom and Samza are very well-known frameworks that implement this type of data conversion.

6. Event source

The event source is a way of application design in which state transitions are recorded as a sequence of records sorted in chronological order. Kafka can store a large amount of log data, which makes it an excellent backend for applications in this way. Such as dynamic summary (News feed).

7. Persistence log (commit log)

Kafka can provide services for an external persistent log distributed system. This log can back up data between nodes and provide a resynchronization mechanism for data recovery from failed nodes. The log compression function in Kafka provides conditions for this usage. In this usage, Kafka is similar to the Apache BookKeeper project.

Design points of Kafka

1. Directly use the cache of the Linux file system to efficiently cache data.

2. Use linux Zero-Copy to improve sending performance. The traditional data transmission needs to send 4 context switches. After the sendfile system call is used, the data is directly exchanged in the kernel mode, and the system context switches are reduced to 2 times.

3. The cost of data access on the disk is O(1). Kafka uses topics for message management. Each topic contains multiple partitions. Each part corresponds to a logical log and consists of multiple segments. Multiple messages are stored in each segment, and the message id is determined by its logical location, that is, the message id can be directly located to the message storage location, avoiding additional mapping from id to location. Each part corresponds to an index in memory, and records the offset of the first message in each segment. The message sent by the publisher to a topic will be evenly distributed to multiple parts (randomly or according to the callback function specified by the user). The broker receives the published message and adds the message to the last segment of the corresponding part. When the number of messages on a segment reaches the configured value or the message publishing time exceeds the threshold, the messages on the segment will be flushed to the disk. Only the message subscribers flushed to the disk can subscribe to it. After the segment reaches a certain size, it will not be Will write data to the segment again, and the broker will create a new segment.

4. Explicitly distributed, that is, there will be multiple producers, brokers, and consumers, all of which are distributed. There is no load balancing mechanism between Producer and Broker. Zookeeper is used for load balancing between broker and consumer. All brokers and consumers will be registered in zookeeper, and zookeeper will save some of their metadata information. If a broker and consumer changes, all other brokers and consumers will be notified.

If you want to learn the content of the following route, here I recommend an architecture learning exchange group. Communication learning group number: 478030634 Some video recordings recorded by senior architects will be shared: Spring, MyBatis, Netty source code analysis, principles of high concurrency, high performance, distributed, microservice architecture, JVM performance optimization, distributed architecture, etc. These become the necessary knowledge systems for architects. You can also receive free learning resources and benefit a lot at present

Note: Follow the author's WeChat official account to learn more about distributed architecture, microservices, netty, MySQL, spring, performance optimization, and other knowledge points. Public number: "Java Rotten Pigskin"