Learning Kafka from Scratch: A Guide to Kafka (Part 2)

Build Data Channels

Some Considerations

Take into account the following factors when building a data channel: timeliness, reliability, throughput, security (such as channel security and auditing), online compatibility of data formats, along with extract, transform, and load (ETL) or extract, load, and transform (ELT), and unified or exclusive mode (for example, GoldenGate is Oracle proprietary, with strong coupling). Select Kafka Connect first.

Understanding Kafka Connect

The Connector API is implemented for connectors and consists of two parts. The “divide and rule” principle is observed. Connectors are equivalent to splitters, and tasks are equivalent to executors after splitting.

  • Splits data replication by task.
  • Fetches task configuration from the worker process and passes it on.

Cross-cluster Data Image

The distributed cluster that Alibaba Cloud built is based on the High-Speed Service Framework (HSF) is a large and unified cluster, with data centers distributed in Shenzhen, Zhangjiakou, and other regions, but with intra-data center clusters that do not converge with each other. Millions of clusters are supported by ConfigServer and other fundamental services. Local on-premises is a non-default behavior. From the perspective of service governance, I think it is more reasonable to build a cluster that spans the entire city. Service integration and calling across data centers require special control and management.

Scenarios of Cross-cluster Images

  • Regional and central clusters: data synchronization across data centers
  • Disaster recovery: a feature intended for clusters
  • Cloud migration: cluster synchronization between on-premises data centers and cloud services in hybrid cloud

Multi-cluster Architecture

The following architectural principles must be followed considering actual limits, such as high latency-the biggest technical issues-limited bandwidth, and the high costs of public network broadband.

  • Data replication between every two data centers must ensure that each event is replicated only once, unless retry is required to solve an error.
  • Data should be read from rather than written to a remote data center if possible.

Hub and Spoke Architecture

The hub and spoke architecture is applicable to the scenario where a central Kafka cluster corresponds to multiple local Kafka clusters, as shown in the following figure. Data is generated only in local data centers, and the data of each data center is mirrored to the central data center only once. An application that processes the data of only one data center can be deployed in the local data center, whereas an application that processes the data of multiple data centers must be deployed in the central data center. The hub and spoke architecture is easy to deploy, configure, and monitor because data replication is unidirectional, and consumers always read data from the same cluster. However, applications in one data center cannot access data in another data center.

Active-Active Architecture

Active-Standby Architecture


MirrorMaker is completely stateless, and its receive-and-forward mode is similar to the proxy mode, with performance determined by the threading model design.


What Is Streaming Data or Data Stream?

  • Borderless: Data streams are abstract representations of borderless data or event sets. Borderless means infinite and sustained growth.
  • Orderly: Events occur in order.
  • Immutable: An event cannot be changed once it occurs, just as time cannot be reversed. Each event is equivalent to a database transaction (ACID).
  • Replayable: It is a valuable property of event streams. You can easily find unreplayable streams, such as the TCP data packets that pass through sockets. However, most services require to replay the raw event streams that happened several months or even years ago. The purpose may be to correct past errors by using a new analytical method or to perform auditing. This is why we believe that Kafka can support effective streaming for modern services by capturing and replaying event streams. Without this capability, streaming is at best an unpractical experiment in data science labs.

Basics of Streaming

  • Time: This is the core and fundamental concept. For example, the fund system of Ant Financial that I used to work on uses diverse types of time. Streaming also includes many time types, such as the event occurrence time, log append time, and processing time. Pay attention to the time zone.
  • State: This specifies the information between events. States are typically stored in a local variable of an application. For example, a mobile counter is stored in the local hash table. However, such storage is unreliable because states may be lost if the application is closed, causing result changes. This is not what we expect. Therefore, be careful to apply persistency to recent states, and restore the states if the application restarts.
  • Stream-table duality: You can understand this concept by looking at the architectural ideas and practices of event sourcing and Command Query Responsibility Segregation (CQRS). To convert a table to a stream, you need to capture all changes in the table and store the Insert, Update, and Delete events in the stream. This process is also called change data capture (CDC). To convert a stream to a table, you need to apply all changes in the stream. This process is also called stream materialization. Then, create a table in the memory, internal state storage, or external database, traverse all events in the stream, and modify states one by one. Then, a table is generated to indicate the state at a time point, that is, a materialized view.
  • Time window: This includes the window size, change window, and updatable time length. I think about the implementation of transactions per second (TPS) throttling rules in the High-Speed Service Framework (HSF). TPS throttling is sensitive to the time window. I can better understand new scenarios by referring to past similar scenarios. The time window is divided into the scrolled window and hopping window. The TPS throttling rules of HSF are used to refresh the number of tokens by means of rotation.

Streaming Design

  • Single-event processing: It is stateless and only implements the computing logic. Examples: filter, map, and execute. Event-driven microservices are a type of message driver based on the publish/subscribe pattern and also are the simplest and commonest streaming form. I used to think that the enterprise service bus (ESB) was prone to single point failure due to its dependency on one message center and lags behind the service-oriented architecture (SOA) of HSF. In retrospect, this opinion was biased. I further thought about Alibaba Data Mid-End. I once participated in a star ring test with complex details and meaningless abstraction. Second-level domains have order placement journal systems to control state transition for services. Additional information: After understanding the transactional semantics of Kafka, I identified great differences between the transactions of Kafka and the transactional messages of RocketMQ and Notify. Kafka transactions guarantee the atomicity of messages that are processed in a consume-transform-produce workflow during streaming or are sent across partitions, whereas the transactional messages of RocketMQ and Notify guarantee message reliability for services as definite events. RocketMQ is more suitable for message integration in transaction systems. If you need to use Kafka for transactional messages in transaction scenarios, implement a service-specific database to commit services in the form of transactions. Kafka drives downstream systems by consuming database binlogs through Connect+Kafka transactional semantics.
  • Use the local state: Each operation is merged by group, such as merging based on each stock code rather than the stock market. A Kafka partitioner is used to ensure that events with the same stock code are always written to the same partition. Each instance of an application fetches events from its allocated partitions. This is the consumer guarantee offered by Kafka. That is to say, each instance of the application maintains the state of a stock code subset. See the following figure. The following issues must be solved:
  • Memory usage: Application instances must have available memory to save the local state.
  • Persistence: Make sure that states are not lost when applications are closed and that states are restored after applications restart or switch to another application instance. These issues can be solved by Streams, which uses the embedded RocksDB to store local states in the memory and also persistently store them in a disk so that they are restored after applications restart. Changes in local states are sent to Kafka topics. Local states are not lost if the Streams node crashes. You can re-create local states by reading events from Kafka topics. For example, if the local state contains “the current minimum price of IBM is 167.19” and is stored in Kafka, you can re-create the local cache by reading the data later. These Kafka topics use compressed logs to ensure that they do not increase infinitely and to facilitate state reconstruction.
  • Rebalancing: Sometimes, partitions are reassigned to different consumers. In this case, the instance that loses a partition must save the final state, while the instance that obtains a partition must know how to restore to the correct state.
  • Reprocessing: Consider whether to reset states or activate a new application group.

Kafka Stream

  • Benefits: The Kafka Stream API is deployed in the form of a class library together with service applications and requires no extra dependencies or scheduling services that exist in Spark, Flink, and Storm. It is similar to the three-layer scheduling framework of Ant Financial, which is compact, simple, and effective, able to solve most batch processing issues.
  • The Kafka Stream API features scalability and fault tolerance based on the partition characteristic of Kafka and the automatic load balancing capability of consumers, naturally giving Kafka the inherent foundation of a data stream platform.
  • The Kafka Stream API is used similarly to the Stream API of Java Development Kit 8.

Original Source



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store