Hybrid Big Data Architecture in Practice: MaxCompute + Hadoop in a Hybrid Cloud Environment

Image for post
Image for post

During the session on enterprise-grade big data service at the 2019 Apsara Conference held in Hangzhou, Zhang Long, Senior Big Data Engineer from Douyu, delivered a speech titled “Hybrid Big Data Architecture in Practice: MaxCompute + Hadoop in a Hybrid Cloud Environment.” This article describes the development process of Douyu’s big data architecture from Apache Hadoop to Cloudera CDH. It outlines the problems and challenges encountered by Douyu during cloud migration, including data security, data synchronization, and migration tasks. It also concludes with the observation that the hybrid cloud model has improved resource efficiency and reduced resource costs for Douyu.

The following introduces highlights from the Zhang Long’s lecture.

Development History of Douyu’s Big Data Architecture

Image for post
Image for post

In the middle of 2014, Douyu began to use big data, starting out with the simple architecture of HBase and Hadoop. In 2015, Douyu began to use CDH to operate and maintain big data clusters, mainly for the purpose of O&M visualization. In the second half of 2017, Douyu became more familiar with Alibaba Cloud’s big data products and compared them with their competitors. Finally, we decided to choose Alibaba Cloud’s MaxCompute.

At that time, with a small user base and simple business scenarios, only a few components and clusters were required. Therefore, we enjoyed the benefits of flexible operations, easy O&M, free use of open-source resources, and efficient development of talents. However, as business grew and our workforce expanded, we began to experience obstacles, including more components, higher O&M costs, complex cluster scaling operations, inconvenient physical machine operations, and higher data and environment security requirements.

Why did Douyu choose Cloudera CDH? First of all, it could meet our business development needs, reduce multi-component O&M costs, and facilitate cluster scaling, data security, and environment security. Second, CDH is widely used by Chinese companies. The most important reason was that the Douyu staff included CDH professionals.

Cloudera CDH was very beneficial to Douyu. It supported a wide variety of components without the need to consider compatibility. It enabled unified and web-based management through CM and supported Chinese. In addition, it supported security management and Kerberos authentication.

On-premises clusters experienced development bottlenecks that were related to resource efficiency and costs. Resource efficiency challenges included slow resource budget approval, long machine purchase cycle, and low deployment efficiency of on-premises data centers. Resource cost challenges included high machine resource costs, costly and unstable on-premises data centers, and considerable idle resources during slow periods.

Image for post
Image for post

Challenges in Big Data Cloud Migration

The major challenge of cloud migration is to ensure data security. Data is the core asset of enterprises and its security is crucial. The second challenge is to maintain data synchronization because there are massive amounts of on-premises and off-premises data. Finally, it is also difficult to securely migrate massive historical businesses from on-premises clusters to the cloud.

To address the data loss issue, Alibaba backs up raw data, which is crucial. The chance of a core data breach is very small because the potential risks are far greater than any competitive advantage gained in this way. To ensure secure access in the cloud environment facing the public network, we added account access IP whitelist and auditing feature, allowing access only from inside the company.

Petabytes of historical data and terabytes of incremental data are generated every day. To quickly and accurately sync data, we use data synchronization tools, most of which are based on improvements to DataX. In addition, we improved network leased line capability by adding multiple leased lines and enable automatic failover and isolation from businesses in the cloud. We use data verification tools to verify data synchronization tasks and data volumes.

Three requirements must be met for secure business migration: 1. No failure occurs and verification is performed to ensure migration feasibility. 2. Migration costs are reasonable and the impact on the business is minimized. 3. Data must be able to be moved to and from the cloud to ensure the consistency of on-premises and off-premises operations.

To prevent failures, three tasks must be accomplished. Business scenario tests are conducted to ensure that all business scenarios are covered and identify the business scenarios that can be migrated. Data quality checks are performed to ensure the consistency of on-premises and off-premises data for the same business. Data efficiency verification is performed to confirm the off-premises task data output duration without disrupting the business.

Douyu runs two types of tasks in the on-premises data center. One is Java tasks, which account for a small proportion of total tasks. These tasks are characterized by query and calculation based on encapsulated HiveClient tools. The other is XML configuration tasks, which are based on custom XML files and support exportation to other storage after HiveSQL statistics are compiled. Douyu has made modifications based on the different characteristics of these two types of tasks. To encapsulate OdpsClient, we have changed HiveClient to OdpsClient and changed the Hive URL to the cloud environment. To add templates and modify URLs, we have introduced the MaxCompute parameter model and changed the Hive URL to the cloud environment.

To ensure a business can be migrated to and from the cloud, first, data must be able to be moved to and from the cloud. This involves the data synchronization center mentioned earlier. Second, a full kit of tools must be prepared to ensure the transparent use of the on-premises and off-premises environments. Third, common features are used wherever possible to cover most scenarios through user-defined functions (UDFs) in SQL.

Changes Introduced by the Hybrid Cloud Model

Image for post
Image for post
Image for post
Image for post

The hybrid cloud model primarily introduced improvements in two areas: low resource efficiency which encumbers business development and high resource costs which places financial pressure on the enterprise. In terms of resource efficiency, the transition from on-premises clusters to MaxCompute brought about the following changes. In the past, we had to make budget proposals six months or one year in advance, but now we are billed on a pay-as-you-go basis. It used to take one to three months to purchase resources, but now we have unlimited resources to use. It used to take one week or more to set up and configure physical machines in the on-premises data center, but now we have done away with the burden of such data centers. Compared with on-premises clusters, MaxCompute reduces costs by about RMB 10 million per year while ensuring zero cluster failure. There are additional benefits too, including Alibaba Cloud’s professional services. When we encounter technical problems, we can call on Alibaba experts to help solve them. In addition, with computing resources being quantified, we can easily see how our money has been spent. We can also talk to Alibaba experts for assistance with business challenges.

Image for post
Image for post

When building on-premises data centers, Douyu had to perform development. The following figure shows data development, including Hue-based query computing and cloud-based DataStudio data development. The Hue API and DataStudio API are then integrated to form Douyu’s open big data platform, which can be used by professionals from the data department and analysts from business departments.

Image for post
Image for post

In practice, Douyu builds what we call a multi-active data center, as shown in the following figure. By establishing the respective roles of on-premises data and Alibaba Cloud data in these two data centers, the multi-active data center architecture allows Douyu to support greater business volumes.

Image for post
Image for post

In summary, resource costs and resource efficiency are the two biggest improvements introduced by the hybrid cloud system. In addition, we benefit from the quantifiable costs, value-added services, and additional professional services. These advantages are open to both our own staff and the staff of external business departments, who can directly see their usage costs. This is all I want to share today. Thank you.

Image for post
Image for post

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store