First Impressions on Apache Flink
At $DAYJOB, we use a lot Apache Flink for all our stream processing needs. I've used Apache Spark at previous employers and found a lot things very interesting about Flink and figured that writing down my thoughts may help me how I feel about it 😄.
What is Flink?
As their website says:
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
Typically, stream processing engines are used to build near-real-time systems that can consume data that is published to streams (like Kafka) and "do" stuff with said data - aggregate data, enrich it, persist it, etc.
Things I like about Flink
Flink is built from the ground up to process streams. The API that it exposes, from sources to sinks, is quite intuitive.
At the core of the API is the concept of a DataStream. A DataStream represents an immutable collections of data that can contain duplicates. They can be bounded (e.g. if reading from a Table) or unbounded (e.g. if reading from a Stream).
DataStreams can be created via Sources (and there's a rich collection of predefined sources already).
KafkaSource<String> source = KafkaSource.<String>builder()
.setBootstrapServers("localhost:9092")
.setTopics("source")
.setGroupId("my-group")
.setStartingOffsets(OffsetsInitializer.latest())
.setValueOnlyDeserializer(new SimpleStringSchema())
.build();
DataStream<String> sourceStream = env.fromSource(
source, WatermarkStrategy.forMonotonousTimestamps(), "Kafka Source")
.uid("kafkasourceuid");Which brings me to the first thing I really like about Flink - DataStreams are strongly typed. Which means your code is generally much more legible on the contents of your stream, when, e.g. you're keying your streams to create "Keyed" Streams:
KeyedStream<Event, String> keyedStream = inputStream.keyBy(event -> event.logKey())
Joining streams is also super easy. In fact, it's kind of absurd that there's such an easy way to express such a complicated operation.
// this will join the two streams so that
// key1 == key2 && leftTs - 2 < rightTs < leftTs + 2
keyedStream.intervalJoin(otherKeyedStream)
.between(Duration.ofMillis(-2), Duration.ofMillis(2)) // lower and upper bound
.upperBoundExclusive(true) // optional
.lowerBoundExclusive(true) // optional
.process(new IntervalJoinFunction() {...});What could be better:
Generally this is more a list of nits, but a quick list:
- Fine-Grained Resource Management remains hard to configure and grok and scale, even though it's pretty important at scale. Tuning flink apps which use it with increasing scale is rough.
- The API is most definitely designed for Streams and isn't great if for batch use cases. This is kind of in contrast with Spark which is a Batch API that is not well suited for Streaming use cases.
- We're currently deploying Flink using the Kubernetes Operator - which makes deployment of apps easy, but doesn't have blue-green deployments implemented yet. This is on their roadmap, but results in extended downtime when you're dealing with apps with thousands of Task Managers.