Learning Kafka from Scratch: A Guide to Kafka (Part 2)

By Lv Renqi.

If you haven’t already, you’ll want to check out the first part of this two part series before you continue any further. You can find this article here.

In this part, we’re just going to continue to jump right into all the technical stuff.

Build Data Channels

Some Considerations

Understanding Kafka Connect

1. A connector provides the following functions:

  • Determines how many tasks to run.
  • Splits data replication by task.
  • Fetches task configuration from the worker process and passes it on.

2. A task is used to transfer data to or remove data from Kafka.

Compared with the Publisher API and Consumer API of Kafka that are directly used to implement the incoming and outgoing data logic, Kafka Connect has done much fundamental work for the abstract definition of the Connector API, rich connector implementation in the ecosystem, and the functions of the worker process, including a RESTful API call, configuration management, reliability, high availability, scalability, and load balancing. Experienced developers knew that it takes one to two days to read data from Kafka and insert the data into a database during coding, but several months to handle issues such as configuration, exceptions, RESTful APIs, monitoring, deployment, scaling, and failure. A connector handles many complex issues when it is used to replicate data. Connector development is simplified by the offset tracking mechanism of Kafka. An offset indicates not only the message consumption progress in Kafka but also the data sourcing progress. Behavior consistency is guaranteed when multiple connectors are used.

The source connector returns a record that contains the partitions and offsets of the source system. The worker process sends the record to Kafka. The worker process saves the offsets if Kafka returns an acknowledgement message to indicate that the record is saved successfully. The offset storage mechanism is pluggable, and offsets are stored in a topic, which, in my opinion, is a waste of resources. If a connector crashes and restarts, it can continue to process data from the most recent offset.

Cross-cluster Data Image

Scenarios of Cross-cluster Images

  • Disaster recovery: a feature intended for clusters
  • Cloud migration: cluster synchronization between on-premises data centers and cloud services in hybrid cloud

Multi-cluster Architecture

  • Each data center must have at least one cluster.
  • Data replication between every two data centers must ensure that each event is replicated only once, unless retry is required to solve an error.
  • Data should be read from rather than written to a remote data center if possible.

Hub and Spoke Architecture

Active-Active Architecture

The active-active architecture provides services for nearby users by leveraging its performance benefits. It does not compromise functionality for data availability (functional compromise exists in the hub and spoke architecture). The active-active architecture supports redundancy and elasticity. Each data center has complete functions, which allows it to redirect users to another data center when it fails. Such redirection is network-based and provides the simplest and most transparent failover method. This architecture needs to solve the major issue of how to avoid conflicts when data is asynchronously read and updated in multiple locations. Consider the issue of how to stop the endless mirroring of the same data record. Data consistency is more critical.

We recommend that you use the active-active architecture if you can properly solve the conflict that occurs when data is asynchronously read and updated in multiple locations. This architecture is the most scalable, elastic, flexible, and cost-effective solution we have ever known about. It is worth our effort to find ways to avoid circular replication, to paste requests of the same user to the same data center, and to solve conflicts. Active-active mirroring is bidirectional mirroring between every two data centers, especially when more than two data centers exist. If 5 data centers exist, at least 20 and sometimes up to 40 mirroring processes need to be maintained, and each process must be configured with redundancy for high availability.

Active-Standby Architecture

The active-standby architecture is easy to implement, but a cluster is wasted.

MirrorMaker

Try to make MirrorMaker run in the target data center. That is to say, if you want to send data from New York City to San Francisco, deploy MirrorMaker in the data center located in San Francisco. Long-distance external networks are less reliable than the internal networks of data centers. If network segmentation occurs and inter-data center connection is interrupted, a consumer that cannot connect to a cluster has much fewer security threats than a producer failed cluster connection. A consumer that fails to connect to a cluster cannot read data from the cluster, but the data is still retained in the Kafka cluster for a long time, without the risk of being lost. Data may be lost if MirrorMaker has read data before network segmentation occurs but fails to generate the data to the target cluster. Therefore, remote reading is more secure than remote writing.

Note: If traffic across data centers needs to be encrypted, we recommend that you deploy MirrorMaker in the source data center to allow it to read local non-encrypted data and generate the data to a remote data center over SSL. The performance problem is mitigated because SSL is used by the producer. Ensure that MirrorMaker does not commit any offset until it receives a valid replica acknowledgement message from the target broker and immediately stops mirroring when the maximum number of retries is exceeded or the producer buffer overruns.

Streaming

What Is Streaming Data or Data Stream?

  • Orderly: Events occur in order.
  • Immutable: An event cannot be changed once it occurs, just as time cannot be reversed. Each event is equivalent to a database transaction (ACID).
  • Replayable: It is a valuable property of event streams. You can easily find unreplayable streams, such as the TCP data packets that pass through sockets. However, most services require to replay the raw event streams that happened several months or even years ago. The purpose may be to correct past errors by using a new analytical method or to perform auditing. This is why we believe that Kafka can support effective streaming for modern services by capturing and replaying event streams. Without this capability, streaming is at best an unpractical experiment in data science labs.

Let us look at the concept of “prefer event over command.” Here, “event” is a definite fact that has occurred.

Basics of Streaming

  • State: This specifies the information between events. States are typically stored in a local variable of an application. For example, a mobile counter is stored in the local hash table. However, such storage is unreliable because states may be lost if the application is closed, causing result changes. This is not what we expect. Therefore, be careful to apply persistency to recent states, and restore the states if the application restarts.
  • Stream-table duality: You can understand this concept by looking at the architectural ideas and practices of event sourcing and Command Query Responsibility Segregation (CQRS). To convert a table to a stream, you need to capture all changes in the table and store the Insert, Update, and Delete events in the stream. This process is also called change data capture (CDC). To convert a stream to a table, you need to apply all changes in the stream. This process is also called stream materialization. Then, create a table in the memory, internal state storage, or external database, traverse all events in the stream, and modify states one by one. Then, a table is generated to indicate the state at a time point, that is, a materialized view.
  • Time window: This includes the window size, change window, and updatable time length. I think about the implementation of transactions per second (TPS) throttling rules in the High-Speed Service Framework (HSF). TPS throttling is sensitive to the time window. I can better understand new scenarios by referring to past similar scenarios. The time window is divided into the scrolled window and hopping window. The TPS throttling rules of HSF are used to refresh the number of tokens by means of rotation.

Streaming Design

  • Use the local state: Each operation is merged by group, such as merging based on each stock code rather than the stock market. A Kafka partitioner is used to ensure that events with the same stock code are always written to the same partition. Each instance of an application fetches events from its allocated partitions. This is the consumer guarantee offered by Kafka. That is to say, each instance of the application maintains the state of a stock code subset. See the following figure. The following issues must be solved:
  • Memory usage: Application instances must have available memory to save the local state.
  • Persistence: Make sure that states are not lost when applications are closed and that states are restored after applications restart or switch to another application instance. These issues can be solved by Streams, which uses the embedded RocksDB to store local states in the memory and also persistently store them in a disk so that they are restored after applications restart. Changes in local states are sent to Kafka topics. Local states are not lost if the Streams node crashes. You can re-create local states by reading events from Kafka topics. For example, if the local state contains “the current minimum price of IBM is 167.19” and is stored in Kafka, you can re-create the local cache by reading the data later. These Kafka topics use compressed logs to ensure that they do not increase infinitely and to facilitate state reconstruction.
  • Rebalancing: Sometimes, partitions are reassigned to different consumers. In this case, the instance that loses a partition must save the final state, while the instance that obtains a partition must know how to restore to the correct state.
  • Multi-phase processing and repartitioning: MapReduce coders may be familiar with multi-phase processing because they often need to use multiple Reduce steps. The applications that process various Reduce steps must be isolated from each other. Unlike MapReduce, most streaming frameworks place multiple steps in the same application and determine the application or worker for processing each step. I think about the processing mode “split -> execute -> merge -> finalize” for fund file development, which has the additional “merge” procedure compared with the streaming mode.
  • Stream-table combination: This may cause serious latency ranging from 5 ms to 15 ms during external searches. It is unfeasible in many cases. Stream-table combination also causes extra load that is unacceptable by external data storage. A streaming system can process 100,000 to 500,000 events per second, while a database can process only 10,000 events per second. A more scalable solution is required. You can cache database information in a streaming application for better performance and scalability. However, managing this cache is a daunting task. For example, you need to figure out a way to maintain up-to-date data in the cache. Frequent refreshing places a burden on the database and makes the cache ineffective. Delayed refreshing may cause the data that is used in streaming to be obsolete. You can refresh the cache by using Kafka Connect CDC, as shown in the following figure.
  • Stream interconnection based on the time window: Take into account the time window.
  • Reprocessing: Consider whether to reset states or activate a new application group.

Kafka Stream

  • The Kafka Stream API features scalability and fault tolerance based on the partition characteristic of Kafka and the automatic load balancing capability of consumers, naturally giving Kafka the inherent foundation of a data stream platform.
  • The Kafka Stream API is used similarly to the Stream API of Java Development Kit 8.

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.