Netflix: Evolving Keystone to an Open Collaborative Real-time ETL Platform
Download the “Real Time is the Future — Apache Flink Best Practices in 2020” whitepaper to learn about Flink’s development and evolution in the past year, as well as Alibaba’s contributions to the Flink community.
Netflix is committed to providing joy to its subscribers, constantly focused on improving the user experience and providing high-quality content. In recent years, Netflix have invested a great deal in its technology-driven studio and content production. In this process, they have encountered many unique and interesting challenges in the field of real-time data platforms. For example, in a microservices model, domain objects are distributed across different apps and their stateful storage, which makes low-latency and high-consistency real-time reporting and object search and discovery particularly challenging.
During the real-time data warehouse session of the Flink Forward Asia conference held in Beijing on November 28, 2019, Xu Zhenzhong, a senior software engineer from Netflix, shared some interesting cases, challenges, and solutions involving distributed systems. He also discussed his company’s achievements in the development and O&M processes, its vision of open self-service real-time data platforms, and new thoughts on basic real-time extract, transform, load (ETL) platforms.
This article briefly introduces the data platform team and the related product, Keystone. This product helps companies track events, build agents, release events, and collect event information throughout all microservices, stores the information in different data warehouses, such as Hive or Elasticsearch, and helps users compute and analyze data in real time.
What Is Keystone?
For users, Keystone is a self-contained platform, allowing multiple users to easily declare and create pipelines on its user interfaces (UIs). In terms of platforms, Keystone provides difficult solutions implemented in all underlying distributed systems, such as Container Orchestration and Workflow Management. These solutions are invisible to users. In terms of products, Keystone helps users move data from edge devices to data warehouses and compute data in real time. In terms of data, Keystone is essential for Netflix. Any developer who deals with data definitely uses Keystone, which means thousands of Netflix employees use it. Keystone has 100 Apache Kafka clusters to support about 10 PB of data every day.
The architecture of Keystone is divided into two layers. At the underlying layer, Apache Kafka and Apache Flink work as the underlying engines. The underlying layer abstracts difficult technical solutions in all distributed systems, which are invisible to users. Applications are created at the upper layer. The service layer provides abstract services and UIs are simple and easy to use. Users do not need to pay attention to the implementation at the underlying layer.
Now, let’s look at the development of Keystone over the past four or five years. Netflix’s initial motivation was to collect data from all devices and store the data to data warehouses. At that time, they used Apache Kafka technology because it allowed us to move data easily. This was essentially a multi-concurrency issue. Later, users put forward a new demand: to conveniently process data, such as filtering data, during data movement. They also needed to implement the projection function, for which Keystone provides corresponding features. After a time, users wanted more complex ETL operations, such as Streaming Join. Therefore, Keystone decided to provide underlying APIs for users and abstract the underlying solutions of all distributed systems so that users can focus more on the content of the upper layer.
This section describes the features developed by Netflix’s two superheroes, Elliot and Charlie. Elliot is a data scientist from the data science and engineering team. She searches for response patterns in massive data to improve the user experience. Charlie is an application developer from the studio team. He helps developers produce higher quality products by developing a series of applications. Elliot’s data analysis results allow us to provide better recommendations and customization to improve the user experience, while Charlie’s work can help developers raise their efficiency.
Recommendation and Customization
As a data scientist, Elliot needs a simple and easy-to-use real-time ETL platform. She does not want to write complex code but needs to ensure a low latency for the entire pipeline. Her work and related demands are as follows:
- Recommendation and customization: The same videos can be pushed to users in different ways based on their characteristics. The videos can be divided into multiple rows, with each row representing a category, and users can modify the rows based on their personal preferences. In addition, artwork is used for the title of each video, and users in different countries and regions may have different artwork preferences. Algorithms are also used to calculate and customize suitable artwork for users.
- A/B testing: Netflix offers a 28-day free trial for non-Netflix users. The users who can find videos they enjoy will be more likely to subscribe to the Netflix service. Therefore, they have 28 days to complete the A/B testing. Errors may occur during the A/B testing, so Elliot wants to discover problems in advance before the A/B testing ends.
When watching a Netflix video on a device, a user interacts with the gateway through requests, and then the gateway distributes these requests to the backend microservices. For example, when the user taps the play, pause, forward, or rewind button on the device, different microservices process different requests, and related data needs to be collected for further processing. The Keystone platform team needs to collect and store data generated in different microservices. Elliot needs to integrate different data to solve her issues.
Stream processing is used for real-time reporting, real-time alerts, quick machine learning model training, and resource efficiency. The quick training of models and resource efficiency are the more important processes. During the 28-day A/B testing, data is processed with data from the previous 27 days in batches every day, which involves many repeated operations. Stream processing helps improve the overall resource efficiency.
Keystone provides command line tools to users, and the user only needs to enter related commands to perform operations. The tools ask the user some simple questions, for example, what repository is needed, and then generate a template based on the user’s answers. Then, the user can start to use the tools for development. Keystone also provides a series of simple SDKs, including the Hive, Iceberg, Apache Kafka, and Elasticsearch SDKs. Iceberg is a major table format of Netflix and will replace Hive in the future. It provides many special features to help users with optimization. Keystone provides users with simple APIs that can help them directly generate sources and sinks.
After completing some work, Elliot can submit the code to the repository. Then, a CI/CD pipeline is automatically started at the backend to pack all source code and products in docker images, ensuring version consistency. Elliot only needs to select the version to be deployed on the UI and then click the deploy button to deploy the JAR to the production environment. Keystone helps to solve difficult problems, such as how to orchestrate containers, in the underlying distributed system at the backend. Containers are now orchestrated based on resources and will be deployed in the Kubernetes environment in the future. A JobManager cluster and a TaskManager cluster are deployed during job package deployment. Therefore, each job is completely independent from the perspective of users.
Keystone provides default configuration options and allows users to modify and overwrite the configuration information on platform UIs, without rewriting the code. Elliot needs to read data from different topics during stream processing. She may need to perform operations in Apache Kafka or a data warehouse in the event of a problem. To solve a problem, Elliot needs to switch between sources without modifying the code. The platform UIs can easily meet these needs. In addition, the platform helps users select resources required to run jobs during deployment.
Many users have many required products, such as the schema, when switching from batch processing to stream processing. The platform also helps users integrate these products easily.
Many users want to write ETL projects on the platform. The platform must be scalable as the number of users grows. Therefore, the platform adopts a series of patterns to solve this problem, including the extractor pattern, join pattern, and enrichment pattern.
Content production is a process of predicting costs of video production, developing a program, reaching a deal, producing videos, post-processing videos, releasing videos, and providing financial reports.
Charlie’s studio department develops applications to support content production. Each application is developed and deployed based on the microservices model, and each microservice application has its own responsibilities. For example, some microservice applications only manage movie titles, and other microservice applications only manage deals and contracts. With so many microservice applications, Charlie needs to join data from different places when performing a real-time search, such as when searching for an actor. In addition, data is increasing every day, which makes it difficult to ensure the consistency of the data updated in real time. This is caused by the characteristics of distributed microservice systems. Different microservices may use different databases, which makes it difficult to ensure data consistency.
Three common solutions are used to solve the problem:
1) Dual writes: When developers know that the data needs to be stored in the primary database and another database, they can write the data into the two databases separately. However, this operation is not fault-tolerant. Once an error occurs, data may be inconsistent.
2) Change data table: Databases are required to support transactions. Changes caused by any database operations will be added to the transaction change statement and stored in a separate table. Then, you can query the change table to obtain the changes and synchronize the changes to other tables.
3) Distributed transaction: Distributed transactions are complex to implement in a multi-data environment.
Charlie needs to copy all movies from the movie datastore to one movie search index supported by Elasticsearch. Data is pulled and replicated through a polling system, where the data consistency is ensured by the change data table solution. The disadvantage of this solution is that it only supports regular data pulling. In addition, the polling system is closely associated with the data source. Once the schema of the movie datastore changes, the polling system must be modified. Therefore, an event-driven mechanism is introduced to the microservices model to read all implemented transactions in the database and transfer them to the next job through stream processing. To popularize this solution, Change Data Capture (CDC) for different databases is implemented on the source side, including for databases that are commonly used in Netflix, such as MySQL, PostgreSQL, and Cassandra. This is processed through Keystone pipelines.
In this process, Netflix faced the following challenges and solutions:
- Ordering Semantics
In a data change event, the event order must be ensured. For example, data must be created, updated, and deleted in sequence in an event. An operation event that strictly follows the order must be returned to the consumer. One solution is to control the order through Apache Kafka. Another solution is to ensure that the order in the event captured in a distributed system is consistent with that for reading data from the database. In this solution, when all change events are captured, data may be duplicated or disordered. Flink deduplicates and resorts the data.
- Processing Contracts
In most cases, the schema details are unknown during stream processing. Therefore, a contract must be defined in the message, including the wire format and schema-related information defined at different levels, such as the infrastructure and platforms. Processor contract aims to help users combine different processor metadata to minimize the possibility of writing duplicate code. For example, if Charlie wants to be promptly notified of a new deal, the platform combines different related components, such as the DB connector and filter, and helps Charlie to implement an open and combinable stream data platform by defining a contract.
Most ETL projects are suitable for data engineers or data scientists. However, experience shows that the entire ETL process could be more widely used. The earliest version of Keystone was easy to use but less flexible. Although the flexibility was improved later, its complexity also increased. Therefore, the team plans to perform further optimization and launch an open, cooperative, combinable, and configurable ETL platform to help users solve problems very quickly.