Learning Kafka from Scratch: A Guide to Kafka (Part 2)
By Lv Renqi.
If you haven’t already, you’ll want to check out the first part of this two part series before you continue any further. You can find this article here.
In this part, we’re just going to continue to jump right into all the technical stuff.
Build Data Channels
Take into account the following factors when building a data channel: timeliness, reliability, throughput, security (such as channel security and auditing), online compatibility of data formats, along with extract, transform, and load (ETL) or extract, load, and transform (ELT), and unified or exclusive mode (for example, GoldenGate is Oracle proprietary, with strong coupling). Select Kafka Connect first.
Understanding Kafka Connect
The Connector API is implemented for connectors and consists of two parts. The “divide and rule” principle is observed. Connectors are equivalent to splitters, and tasks are equivalent to executors after splitting.
1. A connector provides the following functions:
- Determines how many tasks to run.
- Splits data replication by task.
- Fetches task configuration from the worker process and passes it on.
2. A task is used to transfer data to or remove data from Kafka.
Compared with the Publisher API and Consumer API of Kafka that are directly used to implement the incoming and outgoing data logic, Kafka Connect has done much fundamental work for the abstract definition of the Connector API, rich connector implementation in the ecosystem, and the functions of the worker process, including a RESTful API call, configuration management, reliability, high availability, scalability, and load balancing. Experienced developers knew that it takes one to two days to read data from Kafka and insert the data into a database during coding, but several months to handle issues such as configuration, exceptions, RESTful APIs, monitoring, deployment, scaling, and failure. A connector handles many complex issues when it is used to replicate data. Connector development is simplified by the offset tracking mechanism of Kafka. An offset indicates not only the message consumption progress in Kafka but also the data sourcing progress. Behavior consistency is guaranteed when multiple connectors are used.
The source connector returns a record that contains the partitions and offsets of the source system. The worker process sends the record to Kafka. The worker process saves the offsets if Kafka returns an acknowledgement message to indicate that the record is saved successfully. The offset storage mechanism is pluggable, and offsets are stored in a topic, which, in my opinion, is a waste of resources. If a connector crashes and restarts, it can continue to process data from the most recent offset.
Cross-cluster Data Image
The distributed cluster that Alibaba Cloud built is based on the High-Speed Service Framework (HSF) is a large and unified cluster, with data centers distributed in Shenzhen, Zhangjiakou, and other regions, but with intra-data center clusters that do not converge with each other. Millions of clusters are supported by ConfigServer and other fundamental services. Local on-premises is a non-default behavior. From the perspective of service governance, I think it is more reasonable to build a cluster that spans the entire city. Service integration and calling across data centers require special control and management.
Scenarios of Cross-cluster Images
- Regional and central clusters: data synchronization across data centers
- Disaster recovery: a feature intended for clusters
- Cloud migration: cluster synchronization between on-premises data centers and cloud services in hybrid cloud
The following architectural principles must be followed considering actual limits, such as high latency-the biggest technical issues-limited bandwidth, and the high costs of public network broadband.
- Each data center must have at least one cluster.
- Data replication between every two data centers must ensure that each event is replicated only once, unless retry is required to solve an error.
- Data should be read from rather than written to a remote data center if possible.
Hub and Spoke Architecture
The hub and spoke architecture is applicable to the scenario where a central Kafka cluster corresponds to multiple local Kafka clusters, as shown in the following figure. Data is generated only in local data centers, and the data of each data center is mirrored to the central data center only once. An application that processes the data of only one data center can be deployed in the local data center, whereas an application that processes the data of multiple data centers must be deployed in the central data center. The hub and spoke architecture is easy to deploy, configure, and monitor because data replication is unidirectional, and consumers always read data from the same cluster. However, applications in one data center cannot access data in another data center.
The active-active architecture provides services for nearby users by leveraging its performance benefits. It does not compromise functionality for data availability (functional compromise exists in the hub and spoke architecture). The active-active architecture supports redundancy and elasticity. Each data center has complete functions, which allows it to redirect users to another data center when it fails. Such redirection is network-based and provides the simplest and most transparent failover method. This architecture needs to solve the major issue of how to avoid conflicts when data is asynchronously read and updated in multiple locations. Consider the issue of how to stop the endless mirroring of the same data record. Data consistency is more critical.
We recommend that you use the active-active architecture if you can properly solve the conflict that occurs when data is asynchronously read and updated in multiple locations. This architecture is the most scalable, elastic, flexible, and cost-effective solution we have ever known about. It is worth our effort to find ways to avoid circular replication, to paste requests of the same user to the same data center, and to solve conflicts. Active-active mirroring is bidirectional mirroring between every two data centers, especially when more than two data centers exist. If 5 data centers exist, at least 20 and sometimes up to 40 mirroring processes need to be maintained, and each process must be configured with redundancy for high availability.
The active-standby architecture is easy to implement, but a cluster is wasted.
MirrorMaker is completely stateless, and its receive-and-forward mode is similar to the proxy mode, with performance determined by the threading model design.
Try to make MirrorMaker run in the target data center. That is to say, if you want to send data from New York City to San Francisco, deploy MirrorMaker in the data center located in San Francisco. Long-distance external networks are less reliable than the internal networks of data centers. If network segmentation occurs and inter-data center connection is interrupted, a consumer that cannot connect to a cluster has much fewer security threats than a producer failed cluster connection. A consumer that fails to connect to a cluster cannot read data from the cluster, but the data is still retained in the Kafka cluster for a long time, without the risk of being lost. Data may be lost if MirrorMaker has read data before network segmentation occurs but fails to generate the data to the target cluster. Therefore, remote reading is more secure than remote writing.
Note: If traffic across data centers needs to be encrypted, we recommend that you deploy MirrorMaker in the source data center to allow it to read local non-encrypted data and generate the data to a remote data center over SSL. The performance problem is mitigated because SSL is used by the producer. Ensure that MirrorMaker does not commit any offset until it receives a valid replica acknowledgement message from the target broker and immediately stops mirroring when the maximum number of retries is exceeded or the producer buffer overruns.
What Is Streaming Data or Data Stream?
- Borderless: Data streams are abstract representations of borderless data or event sets. Borderless means infinite and sustained growth.
- Orderly: Events occur in order.
- Immutable: An event cannot be changed once it occurs, just as time cannot be reversed. Each event is equivalent to a database transaction (ACID).
- Replayable: It is a valuable property of event streams. You can easily find unreplayable streams, such as the TCP data packets that pass through sockets. However, most services require to replay the raw event streams that happened several months or even years ago. The purpose may be to correct past errors by using a new analytical method or to perform auditing. This is why we believe that Kafka can support effective streaming for modern services by capturing and replaying event streams. Without this capability, streaming is at best an unpractical experiment in data science labs.
Let us look at the concept of “prefer event over command.” Here, “event” is a definite fact that has occurred.
Basics of Streaming
- Time: This is the core and fundamental concept. For example, the fund system of Ant Financial that I used to work on uses diverse types of time. Streaming also includes many time types, such as the event occurrence time, log append time, and processing time. Pay attention to the time zone.
- State: This specifies the information between events. States are typically stored in a local variable of an application. For example, a mobile counter is stored in the local hash table. However, such storage is unreliable because states may be lost if the application is closed, causing result changes. This is not what we expect. Therefore, be careful to apply persistency to recent states, and restore the states if the application restarts.
- Stream-table duality: You can understand this concept by looking at the architectural ideas and practices of event sourcing and Command Query Responsibility Segregation (CQRS). To convert a table to a stream, you need to capture all changes in the table and store the Insert, Update, and Delete events in the stream. This process is also called change data capture (CDC). To convert a stream to a table, you need to apply all changes in the stream. This process is also called stream materialization. Then, create a table in the memory, internal state storage, or external database, traverse all events in the stream, and modify states one by one. Then, a table is generated to indicate the state at a time point, that is, a materialized view.
- Time window: This includes the window size, change window, and updatable time length. I think about the implementation of transactions per second (TPS) throttling rules in the High-Speed Service Framework (HSF). TPS throttling is sensitive to the time window. I can better understand new scenarios by referring to past similar scenarios. The time window is divided into the scrolled window and hopping window. The TPS throttling rules of HSF are used to refresh the number of tokens by means of rotation.
- Single-event processing: It is stateless and only implements the computing logic. Examples: filter, map, and execute. Event-driven microservices are a type of message driver based on the publish/subscribe pattern and also are the simplest and commonest streaming form. I used to think that the enterprise service bus (ESB) was prone to single point failure due to its dependency on one message center and lags behind the service-oriented architecture (SOA) of HSF. In retrospect, this opinion was biased. I further thought about Alibaba Data Mid-End. I once participated in a star ring test with complex details and meaningless abstraction. Second-level domains have order placement journal systems to control state transition for services. Additional information: After understanding the transactional semantics of Kafka, I identified great differences between the transactions of Kafka and the transactional messages of RocketMQ and Notify. Kafka transactions guarantee the atomicity of messages that are processed in a consume-transform-produce workflow during streaming or are sent across partitions, whereas the transactional messages of RocketMQ and Notify guarantee message reliability for services as definite events. RocketMQ is more suitable for message integration in transaction systems. If you need to use Kafka for transactional messages in transaction scenarios, implement a service-specific database to commit services in the form of transactions. Kafka drives downstream systems by consuming database binlogs through
- Use the local state: Each operation is merged by group, such as merging based on each stock code rather than the stock market. A Kafka partitioner is used to ensure that events with the same stock code are always written to the same partition. Each instance of an application fetches events from its allocated partitions. This is the consumer guarantee offered by Kafka. That is to say, each instance of the application maintains the state of a stock code subset. See the following figure. The following issues must be solved:
- Memory usage: Application instances must have available memory to save the local state.
- Persistence: Make sure that states are not lost when applications are closed and that states are restored after applications restart or switch to another application instance. These issues can be solved by Streams, which uses the embedded RocksDB to store local states in the memory and also persistently store them in a disk so that they are restored after applications restart. Changes in local states are sent to Kafka topics. Local states are not lost if the Streams node crashes. You can re-create local states by reading events from Kafka topics. For example, if the local state contains “the current minimum price of IBM is 167.19” and is stored in Kafka, you can re-create the local cache by reading the data later. These Kafka topics use compressed logs to ensure that they do not increase infinitely and to facilitate state reconstruction.
- Rebalancing: Sometimes, partitions are reassigned to different consumers. In this case, the instance that loses a partition must save the final state, while the instance that obtains a partition must know how to restore to the correct state.
- Multi-phase processing and repartitioning: MapReduce coders may be familiar with multi-phase processing because they often need to use multiple Reduce steps. The applications that process various Reduce steps must be isolated from each other. Unlike MapReduce, most streaming frameworks place multiple steps in the same application and determine the application or worker for processing each step. I think about the processing mode “split -> execute -> merge -> finalize” for fund file development, which has the additional “merge” procedure compared with the streaming mode.
- Stream-table combination: This may cause serious latency ranging from 5 ms to 15 ms during external searches. It is unfeasible in many cases. Stream-table combination also causes extra load that is unacceptable by external data storage. A streaming system can process 100,000 to 500,000 events per second, while a database can process only 10,000 events per second. A more scalable solution is required. You can cache database information in a streaming application for better performance and scalability. However, managing this cache is a daunting task. For example, you need to figure out a way to maintain up-to-date data in the cache. Frequent refreshing places a burden on the database and makes the cache ineffective. Delayed refreshing may cause the data that is used in streaming to be obsolete. You can refresh the cache by using Kafka Connect CDC, as shown in the following figure.
- Stream interconnection based on the time window: Take into account the time window.
- Reprocessing: Consider whether to reset states or activate a new application group.
- Benefits: The Kafka Stream API is deployed in the form of a class library together with service applications and requires no extra dependencies or scheduling services that exist in Spark, Flink, and Storm. It is similar to the three-layer scheduling framework of Ant Financial, which is compact, simple, and effective, able to solve most batch processing issues.
- The Kafka Stream API features scalability and fault tolerance based on the partition characteristic of Kafka and the automatic load balancing capability of consumers, naturally giving Kafka the inherent foundation of a data stream platform.
- The Kafka Stream API is used similarly to the Stream API of Java Development Kit 8.