By Qiyang Duan, Data Scientist at Alibaba Cloud
Over the last ten years, we have witnessed a dramatic technology shift from traditional databases to diversified, single purpose, massive scale big data platforms. In this wave of big data technology, Hadoop technology (including MR, HDFS, SPARK, HIVE etc) had quickly surged in almost all enterprise data centers but then given ways to those managed Hadoop services on the cloud. The database technologies are also reinvented with big data demands.
Alibaba has built advanced database technologies along the massive “de-IOE” (a-decade-of-evolution-of-alibabas-databases) movement . Those internet giants in USA also had very similar process  to upgrade their stack for the ever growing big data requirements.
One of the most mystery output from Alibaba’s big data journey should be its Data Middle Office. This concept is kind of unique, living only in Chinese Technology world, but so popular, just like the De-IOE. Another equally unique concept only in China would be Mini-Programs in the mobile domain.
In Sep 2019, Alibaba announced its middle office strategy in investor day meeting as three middle offices — Business, Data and AI respectively. Here I focus only on data middle office.
The origin of Middle office was in fact about business agility. This is very similar to the middle office concept [3,4] in investment banks. In Alibaba group, the term middle office could refer to two different things:
- Inside Alibaba group, there are thousands of data engineers and scientists supporting the daily data operations for its core e-commerce and other internet business operations. They have developed tools solving their own problems. This organization is named as middle office.
- Alibaba cloud is packaging its internal tools and selling them on the cloud. Here the solution is named “Data Middle Office”. Sometimes you also hear a term “Data Mid End”, or “Data Middle End”. They mean the same solution on the cloud.
So, what does middle office mean to the big data world? It includes two parts:
- First of all, it should be a comprehensive and powerful big data platform on the cloud to enable all the essential data analytics and machine learning functions.
- Secondly, it is a DataOps framework to assemble all those independent tools into one integrated environment.
Big Data on the Cloud
People have been dreaming about a single source of truth for a long time. However, it was never really accomplished in any large organization, due to both technical difficulties and the natural political struggle inside the organization.
A lots companies can build a complete big data system when they have enough budgets. Data may be collected into a single place. However, simply because the system is on-premise, it becomes a luxury to have the necessary agility to adapt to the variety and velocity part of big data business. Common problems are how to get new APIs, new libraries, new server capacities, etc. Life is short, let’s not waste time on those basic tedious things.
Alibaba offers mainstream big data tools to support all big data usage scenarios. You can use those tools to build up your own solution:
- Large Scale Batch Processing: Alibaba has Maxcompute for PB-level database, Analytical DB for online queries which demands MPP databases, EMR for Hadoop ecosystem, etc.
- Real Time processing: Alibaba acquired Flink and offers this framework as managed service.
- Machine Learning: PAI Platform. Personally, I am more in favor of the notebook service DSW.
- Data Visualization: QuickBI, DataV.
Leveraging recent hardware advances (RDMA, SSD, etc), Alibaba has built those products with state of art architecture. As a result, recently Alibaba AnalyticDB set a new record for TPC-DS benchmark, beating its own previous record by Elastic Map Reduce (EMR).
On Artificial Intelligence, Alibaba also set a new record on DAWN Deep Learning Benchmark (DAWNBench), on its own Cloud. Apart from this, PAI also offers a full set of tools covering drag-and-drop GUI, Notebook, traditional algorithms (RF, SVM, etc) and deep learning frameworks.
Flink and DataV are quite special breeds for dealing with their own specific problems like real time processing and large screen dashboard (like the one from 11.11).
Though those technologies deliver better performances than other cloud vendors, functional wise, you may still find counterparts in most other cloud vendors. The overall technology stack does not yet look too different from the rest. What’s the special ingredient to differentiate itself from the traditional names like Data Warehouse, Data Lake, Big Data Platform? The answer is data governance and DataOps.
Data Governance and DataOps Framework
Data governance is a broad topic, including data quality, security, lineage, etc. You may find a list of vendors in Gartner Quadrant 2019. Normally data governance tools are provided from a third party, instead of the data platform vendors like traditionally Teradata, Oracle, or more recently those cloud vendors.
When you have your big data development team working in one environment and data governance team in another, it simply won’t work. Under this governance setting, people tend to believe their systems look like this:
In fact, under the cover, it often looks like this:
Do you see similarities between those cables and your complex data relationships?
Alibaba’s answer to this problem is a fully integrated DataOps environment: DataWorks. It offers native tools to deal with typical problems in developing and operating an enterprise big data platform. To name a few:
- How to manage the data asset to enable everyone to see what they should see?
- How to orchestrate thousands if not millions of daily ETL jobs? And more importantly, who is responsible for which part when it goes wrong?
- How to ensure good Data Quality?
DataWorks was born out of Alibaba’s own big data daily development and operations. It is being used by those “Middle Office” organizations inside Alibaba group. On the cloud, DataWorks enables DataOps by integrating with different big data engines on Alibaba cloud, including MaxCompute, EMR, etc.
With the DataOps framework and a full spectrum of big data tools, Alibaba cloud can help customers build a big data platform to support business innovations through an agile development process. Through this framework, you are also copying Alibaba’s internal development best practices, and avoiding lots of pitfalls Alibaba went through along the big data journey.