Using Flink Connectors Correctly

Section 1: Flink Streaming Connectors

  • Predefined sources and sinks
  • Built-in bundled connectors
  • Connectors provided by third-party project Apache Bahir
  • Async I/O

Method 1: Predefined Sources and Sinks

  • File-based sources and sinks
env.readFile(fileInputFormat, path)
  • Socket-based sources and sinks
  • Sources based on collections and iterators in memory

Method 2: Bundled Connectors

Method 3: Connectors Provided by Apache Bahir

Method 4: Async I/O

Section 2: Flink Kafka Connectors

1 Flink Kafka Consumer

  • Data Deserialization
  • Consumer Start Offset Setup
  • setStartFromGroupOffsets is the default policy. It reads data from the group offset, which refers to the last consumer offset of a group recorded by a Kafka broker. However, the Kafka broker does not have the group information. It determines the offset to start consumption based on the Kafka parameter auto.offset.reset
  • setStartFromEarliest reads data starting from the earliest offset of Kafka.
  • setStartFromLatest reads data starting from the latest offset of Kafka.
  • setStartFromTimestamp(long) reads data starting from a particular offset, the timestamp of which is greater than or equal to a specified timestamp. A Kafka timestamp means the timestamp added by Kafka to each message. This timestamp could mean the time when the message was generated at the producer, or when it enters the Kafka broker.
  • setStartFromSpecificOffsets reads data starting from the offset of a specified partition. If the consumer needs to read a partition that does not have a specified offset within the provided offsets collection, it will fall back to the default group offsets behavior (setStartFromGroupOffsets()) for that particular partition and read data starting from the group offset. We need to specify the partition and offsets collections.
  • Dynamic Discovery of Topics and Partitions
  • Commit Offsets
  • Timestamp Extraction/Watermark Generation

2 Flink Kafka Producer

  • When we use FlinkKafkaProducer to write data to Kafka, FlinkFixedPartitioner will be used by default, if we do not set a separate partitioner. This partitioner manages partitions by determining the remainder of the total number of parallel task IDs divided by the total partition length: parallelInstanceId % partitions.length.
  • If we have four sinks and one partition, all four tasks write data to the same partition. However, if the number of sink tasks is less than that of partitions, some partitions will end up with no data. For example, if we have two sink tasks and four partitions, the tasks only write data to the first two partitions.
  • If we set the partitioner to null when we build the FlinkKafkaProducer, the default round-robin Kafka partitioner will be used. In this case, each sink task writes data to all downstream partitions in a round-robin manner. The advantage is that data is evenly written to all downstream partitions. It has a drawback, too. When there are a lot of partitions, many network connections must be maintained because each task must connect to the broker of each partition.
  • Fault Tolerance

Section 3: Frequently Asked Questions

Original Source:




Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

6. Object-Oriented Ice Cream Flavors and Mixins

GCP Cloud Storage — manage bucket and objects

2020 MACtion Transfer Portal and Player Tracker Updates

eSourcing vs. eProcurement: What are the differences?

Lawyers in Utah

lawyers in utah

Optimizing Kafka: Hardware Selection Within AWS (Amazon Web Services)


How To Use Yara

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

More from Medium

Error handling with Apache Beam : presentation of Asgarde

Apache Pulsar Client Application Best Practices

Historize elastic APM server data

Using Test Results For CI Optimisation