By Xu Ri (Huanbo), head of the Alibaba Cloud DataWorks modeling engine team, and Li Qiping (Shouyi), Freshippo’s data R&D leader
Released by DataWorks Team
In this article, we will be discussing how new retail enterprises can build a data mid-end based on Alibaba Cloud DataWorks. This article will also introduce the best practices for data mid-end, including business model selection, architecture design, and product selection. Finally, this article will introduce how to use the data mid-end to feedback businesses and assist AI in decision-making.
New Retail Business Model
It is very important for a new retail enterprise to understand the business before establishing the data mid-end because in my opinion, data and business are closely related. A colleague told me before that it was difficult to build a data mid-end. And that is why when building the entire data mid-end, we must first have a deep understanding of the business.
New retail enterprises have a variety of business forms, such as online e-commerce platforms, offline stores, official apps, distribution channels, and supply chains. It is not necessary for us to collect the data of all channels from the very beginning, then unify the data, and regard this as making a data mid-end. What we need to know at the beginning is the business model of the whole enterprise. Based on the business model, we will define the business form that needs to be done. Finally, we begin to plan the construction of the new retail data mid-end for the enterprise.
For example, many new retail enterprises used to focus on offline stores, but now they need some online apps or e-commerce business. But the problem is, their online inventory is not synchronized with the offline inventory, or the e-commerce style is different from the offline style. This means that they actually traditional retail business that have just opened another online business while maintaining a traditional business model. An effective data mid-end first needs to break the original business model of an enterprise and design a business form that truly integrates the online and offline aspects of a business. Therefore, we often say that the data mid-end is a top-ranking project of an enterprise.
After the business model has been determined, there are also many other factors to consider for new retail enterprises. For example, a fresh food businesses may need to deliver foods to a wide area in just minutes. In this example, enterprises with offline stores will import offline traffic to online stores, and at the same time use offline stores as warehouses for online stores. Some enterprises also allow customers to pick up goods at offline stores after making an online purchase. By taking these challenges into consideration, let us now talk about how a data mid-end can support these businesses, and complete a closed loop of the entire business model through data integration.
Product Technical Architecture Design for New Retail Enterprises
After the business model is determined, we need to decide on the product technical architecture design. At this stage, many retail enterprises will feel confused because they may find ready-made software systems by traditional software vendors, such as Enterprise Resource Planning (ERP), Warehouse Management System (WMS). So, do enterprises only need to buy a software system?
Some of the traditional ERP software or logistics software has also been digitized. However, the important difference is that the digitization of the data mid-end is not only for the purpose of simply digitizing and structuring the data, but for the purpose of providing strong support for the upper policy layer so that the data mid-end can intelligently support traffic, logistics fulfillment, process optimization, and financial policy.
Let us discuss the case of large retailers and super-enterprises with offline stores. These enterprises also do online apps, but their inventory is isolated online and offline. If they have a total of 100 products, their apps will allocate them in advance. Only 10 products will be sold online, and there will be no more online. After owning the data mid-end, the 100 products will be available online and offline at the same time. Meanwhile, retailers can use algorithms to make inventory warnings, discounts, cross-selling, supply chain adjustments, etc. Compared with roughly dividing goods into two, through this strategic model, the data mid-end basically integrates the entire online data and goods, and also reconstructs some business forms. Therefore, the data mid-end does not simply structuralize data.
If enterprises have certain technical capabilities, it is suggested that all core business systems be self-developed, because new retail enterprises need to fully digitize many traditional businesses, including transactions, stores, warehousing, transportation, distribution, procurement, supply chain, labor, and so on. If the enterprise use external procurement, based on the business model, it must let the system form a closed loop, from trading stores, warehousing and freight, procurement supply chain, labor, etc. You need to make sure that apps, stores and e-commerce companies do not have different systems. If the systems are different, when you make the data mid-end, the barriers to data itself are already very high.
An important part of the closed loop is the data layer on the right. In addition to the design of the business system, if there is no unified data platform construction, it is difficult to support the whole enterprise project. This is also the part that will be highlighted today.
In our view, data mid-end is not only a solution, but also a function of a team. An independent data mid-end team is required to support businesses. Data, like commodities, members, and devices, is a very important asset for enterprises. Data mid-end teams serve as asset builders, managers, and operators. Through these assets, the whole retail supply chain can be driven to upgrade intelligently. By collecting, managing, and constructing data, data can be better applied to businesses.
The preceding figure shows the overall architecture of a generic data mid-end, which has both particularity and generality.
First, I would like to talk about generality. Alibaba Cloud’s infrastructure has been used for the construction of the whole infrastructure. The combination of DataWorks and MaxCompute on Alibaba Cloud has supported the construction of the data mid-end of Alibaba Group for 11 years. In the entire data layers, the source data layer basically comes from the business system, and the access layer is relatively complex. Many enterprises now talk about all channel coverage, including app and offline business. Even some enterprises have their own delivery staff, electric vehicles, Internet of Things (IoT) device data in stores, and human resources. Therefore, there will be a lot of structured and unstructured data. The data processing layer processes unstructured data to form a critical data asset layer.
After the data asset layer is built, it has a certain business meaning. This data can be directly used by the business. But on the data asset layer, we will set up a data service layer to make the data more convenient to use, out of the box. The service layer may be invisible. The business side hope that the business users can use data directly, instead of looking up data in many tables. Therefore, at the data application layer above the data service layer, the data mid-end team can build a lot of data products and provide real data use to the business side through the product mode. There will also be many product forms, on different ends, including PC, DingTalk, Palm, and many small IoT devices. Data may pass through a small black and white screen. The data mid-end has a management system on the rightmost part, which allows the entire operations and O&M of an enterprise to be effectively implemented. This architecture diagram is what we understand as a business oriented hierarchical architecture of data mid-end.
Based on the business-oriented data mid-end hierarchical architecture, we need to continue to design a technical architecture for the data mid-end. If you have worked on big data and you may often encounter the problem when collecting data. What should we do if there are offline and real-time computing at the same time?
We recommend Alibaba Cloud MaxCompute for offline computing. Almost all of Alibaba’s offline data was stored in MaxCompute. MaxCompute processed 1.7EB-level data every day during Double 11 in 2020. We recommend that you use Realtime Compute for Apache Flink for real-time computing. The peak number of messages processed per second is 4 billion. Its computing performance is also very powerful.
In addition to computing, data storage is also required. For example, after the data in Flink is collected and processed, it can be stored in the Hologres shared cluster of MaxCompute to build our real-time data warehouse. Hologres of MaxCompute allows writing 0.596 billion data records at the peak time and can respond to query requests for petabytes of data in sub-seconds. These stores will become data services. Data services contain metric details, as well as features and tags.
The data can be extended to the most commonly used devices, operation platforms, DingTalk Suit, and intelligent management. These are more at the runtime layer. At the data mart operation layer, there are also metadata, data quality, disaster recovery and control, and data governance. This technical architecture diagram is more like a technical requirement architecture diagram. It is what the technical team of new retail enterprises need to do when developing the data mid-end.
Data Mid-End Solution for New Retail Based on DataWorks
After analyzing the business model of an enterprise, the technical architecture of its business products, and the technical requirements of the data mid-end, we need to conduct a technical selection and technical survey of the data mid-end. We need to select products and systems to support the entire technical architecture of new retail enterprises.
As mentioned earlier, we recommend that enterprises develop their own business systems, but the entire data mid-end technology may not be self-developed, because Alibaba Cloud provides a mature product system that allows new retailers to build their own data mid-end. As mentioned earlier, we can select a big data computing engine. We can select MaxCompute for offline data warehouse and select Flink and Hologres for real-time data warehouses. These three products can also be seamlessly combined to build a complete set of real-time and offline integrated data warehouses.
DataWorks can be selected as a data development and governance tool for building a data mid-end. DataWorks serves almost all business departments of Alibaba Group. Every day, DataWorks is used by tens of thousands of operations personnel, product managers, data engineers, algorithm engineers, and R&D engineers in the group. DataWorks also serves a large number of Alibaba Cloud users. The following figure shows the overall architecture of DataWorks:
Data integration is the first step to build a data mid-end. DataWorks provides external data integration capabilities. It has a lot of batch, incremental, real-time and whole database data integration, and can support multiple and complex data sources of enterprises. Currently, DataWorks supports over 50 types of data sources for offline synchronization and over 10 types of data sources for real-time synchronization. Regardless of whether the data sources are in the public network, Internet Data Center (IDC), Virtual Private Cloud (VPC), etc, DataWorks can integrate data in a secure, stable, flexible, and fast manner.
DataWorks also provides a centralized metadata management service to support unified task scheduling and a wide range of one-stop data development tools, covering the entire lifecycle of data development. It can greatly improve the data development efficiency of enterprises. The upper layer also includes data governance, data services, and it provides an important open platform.
For most enterprises, their business systems may be self-developed or purchased products. DataWorks APIs allow secondary processing of many functions and integration of various self-developed and project systems. For example, alert notifications can be sent to the monitoring and alerting systems of enterprises. DataWorks provides more than 100 APIs to make it easy for enterprises to meet these requirements.
When we compare this data mid-end technical requirement diagram with DataWorks, the data collection part corresponds to the data integration provided by DataWorks. DataWorks can meet the data synchronization requirements shown on the left. At the data development layer, DataWorks can complete offline, online, and real-time data development for enterprises through its DataStudio, HoloStudio, and StreamStudio.
DataWorks also provides data services and open interface capabilities. It can integrate with existing enterprise systems and products through APIs. Crucially, DataWorks provides data map and data governance capabilities, which seem to be edge functions. However, these functions are very essential in building the data mid-end for the entire enterprise, which will be introduced later.
The former can be seen as the preparation process of the data mid-end. After understanding the business of the enterprise, designing the product system, and making a technical selection, we need to determine the objectives of enterprise data mid-end construction. Objectives do not represent Key Performance Indicators (KPIs), but may also be the missions or the original intentions of the enterprise. The objective of data mid-end construction is to establish a middle layer with rich data, full link, multi-dimensional, reliable quality (that is, the standard and results should be accurate), and it should be stable, timely and fault-free. Many people will say that this is a data mart but it doesn’t matter, because it is a middle layer.
The data mid-end needs to provide reliable data services, data products, and business applications for the upper-layer businesses. This means that it is not a simple data warehouse or a simple data mart, but a data mid-end, which can be used continuously by businesses. If an enterprise only synchronizes data for MaxCompute, open-source Hadoop, or a database, the data mid-end is only a data warehouse. The data mid-end we defined can be directly usable by businesses or even designed to bring business value to businesses.
After defining such an objective, we need to start a step-by-step breakdown. Some business teams will only tell the data team that they want the data of sales volume when they raise business requirements. However, this sales volume still has constraints, such as when? Does it include a refund? Is it restricted to certain regions? Therefore, if you want to develop a data mid-end, you need to design a metric system first, and this metric system should be productized in the mid-end team. Second, because the business does not use the fields of a table, it needs the support of a data model design to allow enterprises to make data more standard. The third step is based on our designed model, and we need to develop data processing tasks. Finally, we need to share the data by using data services, which are not limited to tables, APIs, and reports and it can even be a product or anything else.
The above is a hierarchical graph of the data model or data mart that you often see on the Internet-Operational Data Store (ODS), Data Warehouse Detail (DWD), Data Warehouse Service (DWS), and Application Data Store (ADS). Although there are many concepts, everyone has a different understanding about these layers. We need to have a very strict and clear definition of these layers. Each layer should have its own characteristics and responsibilities. In our view, we could briefly introduce them as follows:
ADS is certainly business-oriented rather than development-oriented. This data can be understood or even directly used by businesses in the shortest possible time.
DWS must be metrics and also a carrier of the metric system mentioned earlier. The data collection of DWS is basically the support of ADS.
DWD is the data warehouse detail layer. How is the DWD layer be created? We recommend that you use the dimensional modeling method. An enterprise has dimension tables and fact tables. Dimension tables also have many hierarchical dimensions, such as enumeration dimensions, and fact tables have periodic snapshots. Of course, there is a point here that the DWD field must be directly understandable, and there is no ambiguity. Once there is ambiguity, there will be problems when DWS is used, which can cause problems throughout the upstream application.
Basically, we all agree that the business data of ODS should be directly synchronized. However, some architectures have evolved. You like to perform preliminary Extract-Transform-Load (ETL) processing in ODS, which results in data inconsistency between ODS and enterprise business data. In fact, we recommend not to do this for the simple reason that ODS must be consistent with the business database so that when a problem occurs, we can quickly locate the cause of the problem. Once ETL is implemented, there may be a bug in the ETL process, which may lead to inconsistent data between the two sides. Therefore, if an enterprise strictly requires that no logical processing be performed from the business database to ODS, then the problem can only be caused by the middleware or any other storage problems, and should not be caused by the business logic.
Build a Data Mid-End for New Retail Using DataWorks
Some ideas, designs, architectures, objective, and requirements for building a data mid-end are described earlier. Next, I will talk about how to use DataWorks to build a data mid-end and how to use the DataWorks platform. Dataworks platform not only serves customers on Alibaba Cloud, but also serves almost all business departments of Alibaba Group since 2009. Therefore, most of its overall product design tends to be open, universal and flexible. At this time, when using DataWorks, a series of problems will occur because it is too flexible or has no standards. The following content will share some of our experiences with you.
Data synchronization is the first step in data mid-end construction. If data cannot be stored in the data warehouse, a data mid-end cannot be built. When we do data synchronization, we have several requirements. For example, all the business data of the enterprise is uniformly synchronized to one project, and only one copy is synchronized. Repeated synchronization is not allowed and this facilitates management, reduces costs, and ensures no ambiguity in data. When a data source error occurs, the subsequent data is faulty. Therefore, the data mid-end must ensure that the data source is 100% correct.
In terms of data backtracking and auditing, the lifecycle of data is set to a permanent storage. Even if the business system has some archiving and deletion due to the traffic problem of some online libraries, when they want to use the historical data again, they can restore it through the ODS layer.
The second part is data development, which tests personal abilities. Basically, everyone uses Structured Query Language (SQL). What we have learned in data development is that the data processing is the implementation of the business logic. We must ensure that the business logic is correct, and that the data output is stable, timely, and reasonable. In addition to providing good coding capabilities, the DataWorks data development editor also provides some visual processing methods to help enterprises review some code and even partially verify it. This function is very helpful in our daily use.
Every Java developer knows that each programming type has its own programming paradigm, and several abstract steps are involved in this process.
We have code conversion in development. For example, we convert enumeration to something that we can understand. What is 0? What is 2? Or what is a? There is also a format conversion. Enterprises have some business systems, which are difficult to standardize. For example, some use timestamp, some store strings, and some store yymm. Although all these represent time, they have different formats. In the process of building a data mart, it requires that the data format in the data mart must be consistent. We will change the non-standard data format into a standard format through format conversion.
The second is business judgment, which basically obtains a business result through conditions. For example, young people will definitely not have a field or business logic called “young people” in the business system. If there is age data, when sorting out, we can judge that people under the age of 30 are called young people. This is what we call business judgment.
The third is the data connection, which is basically very simple, that is, a table join to fill the data.
The fourth is the data aggregation. Enterprises frequently use data aggregation in DWS.
The fifth is the data filtering. We often encounter some invalid data. We process these invalid data by using data filtering.
The sixth is condition selection, which is also about when, slightly similar to data filtering.
The final is the business analysis. Business analysis is most commonly used by enterprises because NoSQL or ApsaraDB for MySQL now supports it. Some business teams even use Mongo DB. There are many business expressions in that big field. In recent years, when we make DWD for data mart, we must parse all the formats of JSON fields or map fields into fixed column fields. Because we just said that its content must be consistent, so that users can see it directly. I would like to share my experience here that the business logic will be closed at the data detail layer as much as possible to ensure data consistency and simplify downstream use. When there are changes at the source, we can convert the code or format to ensure the stability of the DWD layer structure and avoid bringing more changes to the downstream. A good model also requires the collaborative development of the upstream business system. First, the business system must have a reasonable design. Second, it must be able to perceive changes in a timely manner. Therefore, the construction of the data mid-end is not a matter for a single data team. It is also necessary to associate and create a common data mid-end with the business team.
The above parts are more in the development stage. If DataWorks only completes these parts, we think it is an Integrated Development Environment (IDE). However, DataWorks is a one-stop big data development and governance platform and the core point is to ensure the operation of the platform. How can we ensure the running of the enterprise’s code for data development? That is through task scheduling in DataWorks to ensure the running.
The new retail business of an enterprise is very complicated. Fresh food can be delivered in 30 minutes, e-commerce can reach in the next day or three days, and there are some pre-sale and pre-purchase. These tasks may not be supported by simple scheduling systems. However, DataWorks provides a very flexible choice of task scheduling cycles, such as months, weeks, and days. It also supports stable scheduling of 15 million daily tasks during Double 11, and works well in terms of scheduling cycle flexibility and stability.
At the beginning, we designed the new retail business of an enterprise to be a closed loop, where each business is correlated. On the other hand, the data tasks of the enterprise are also correlated. At this time, the entire task scheduling link is very complex.
In the whole process, we also have a lot of attempts, innovations, and have stepped on a lot of pits. Here we share our experience with you. DataWorks may experience missing data or errors when task nodes are not started or are started at the wrong time. In this case, we need to make sure that enterprise data development addresses any issues of each online task in a timely manner, because every problem can cause a data problem. An appropriate scheduling policy can ensure the correctness of data output and the timeliness of data output. We want to output data by day, so we do not want to output data by hour. If it is three days, we can set up a three-day scheduling.
Through these steps, under normal circumstances, one of our projects or requirements is completed in this way, and we think that a data development engineer has finish the task.
However, this is not the case in general, because the data mid-end is business-oriented, so once it goes wrong, the impact will be particularly great. The group has its core system and the department core system. The business line has its core system and non-core system. Different core systems need different guarantees, including p1, p2, p3, p4 way to define the fault level and the same is true for data business. Unlike normal business systems, data businesses rely on DataWorks to ensure the stability of the entire online big data business. DataWorks provides an important module, data quality monitoring. Data quality monitoring allows enterprises to find some problems in a more timely manner.
When the business is affected, data quality monitoring can ensure that we know it at the first time (Because sometimes the business use is delayed, the data team often find that they do not know there is a problem until the business side tell them.) Data quality monitoring aims to ensure the correctness of data output. Its monitoring range must be comprehensive, not limited to changes in table sizes and functions, the enumerated values of the field conflict with some primary keys, or even some invalid formats. Abnormal values will trigger alarms or interrupt the data processing and then the personnel on duty should intervene immediately.
The preceding section describes monitoring problems. However, when there is more monitoring, monitoring may become rampant and many early warnings may be generated. DataWorks also provides another capability: task baseline management. As I mentioned earlier, the business has grading, and the data business of an enterprise also has some important and unimportant tasks. We isolate these tasks in this baseline way.
Our experience with the baseline lies in that a baseline ensures the timely output of data assets. Priorities determine the assurance of system hardware resources and the assurance of duty-related operations by the operation staff. You must configure a baseline with eight-level for your most important business, which ensures that your most important tasks are generated in a timely manner.
In addition, DataWorks provides an excellent feature-a refresh tool. When a baseline problem occurs or a baseline is broken, you can use the refresh tool to quickly refresh data back. If you have set up DataWorks’ intelligent monitoring, this feature can help you predict in advance whether there is a risk of breaking the baseline through the current task status and historical running time under some baselines. For example, a data item is normally generated at 12 pm, but before that, a data item should be generated at 6 pm.
After the intelligent monitoring is set, if the task of generating data at 6:00 p.m. is not completed at 7:00 p.m and the system determines by the algorithm that it still cannot normally generate data until 12:00 p.m. Then, the intelligent monitoring will send an alert at 7:00 p.m. This allows technical engineers to intervene in advance, instead of waiting for the actual data output delay. This intelligent monitoring and risk estimation are very useful for the stability of an enterprise’s business.
Monitoring and baseline of data quality basically ensures the stable and normal operation of enterprise big data tasks and businesses, as well as the governance of data assets. Alibaba advocates data, and one of its major milestones in the transformation is that the hardware costs for data storage and computing of Alibaba exceed those of its business systems. As a result, Alibaba takes data asset governance as one of its core task.
DataWorks is the largest, even the only platform in Alibaba Group that uses the most data volume. It also provides a data asset module called Universal Decentralized Asset Platform (UDAP). In this module, we can check the overall resource usage today from multiple dimensions, from projects to tables to individuals. This module also provides the user with a concept of health score. This health score gives a comprehensive view of the ranking of each individual in each business department.
The simplest way to implement governance is to manage the lowest health score, and then increase the health score, and the whole level will come down. In addition, UDAP provides many data visualization tools, which allow you to quickly see the effect of governance. I also have some ideas to share with you.
First, the main goal is to optimize storage and computing, reduce costs, and improve resource utilization. The technical team will build many project spaces on their own, and the data mid-end team needs to work with the technical team to build these spaces to complete data governance. Some useful methods are to bring useless applications offline, manage table lifecycles, and manage repeated computing.
More importantly, violent scanning of computing resources needs to be strictly prohibited. Currently, some functions of UDAP can be implemented in the resource optimization module of DataWorks, such as the management of duplicate tables and duplicate data development and data integration tasks.
After finishing the above, we think that the data mid-end has complete its mission. Finally, another point is the data security management. With the development of the Internet, China basically continues to come up with a relevant network law every year, such as e-commerce law, network security law, and the recent drafted data security law. As an enterprise, compliance with the law is particularly important.
As the most unified data entry and exit of Alibaba big data, DataWorks provides many data security management methods. It can control at the engine level, the project level, the surface level, or even the field level. At the field level, each field has a level. For example, the permissions of some high-level fields can only be used after being approved by the department head or the enterprise president.
Even if some fields are approved, there may still be some high-risk data, such as ID card number, mobile phone number, etc. The data security guard of DataWorks will provide a technology called data masking. The sensitive, high-risk data is masked when it is taken away, which does not affect the statistics or analysis of users, but the user can not see the original data.
Alibaba Group has a unified set of data management methods, which are interconnected with the organizational structure. When an employee leaves or transfers jobs, his or her permissions will be automatically withdrawn. In any enterprise, including Alibaba, personnel changes are frequent. With such features and systems, enterprises can ensure that data is better applied while ensuring data security.
Value of Building a Data Mid-End Based on DataWorks
In the previous section, we talked about building a new retail data mid-end based on DataWorks. We mentioned that the data mid-end must serve the business. Now, I will introduce some ways in which the data mid-end serves the business. The process of using data for an enterprise is from shallow to deep.
At first, everyone is the same; we just looked at the data and what data do we have. Then, we will look at some problems through the data and make some manual assistance and decisions. However, many businesses of new retail expands very fast, opening more than 100 stores a year or covering more than 200 cities across the country. After the business form has changed, simple data reports and data visualization can no longer support the business of opening more than 100 stores a year. Therefore, enterprises can also do a lot of refined control at this time, such as category diagnosis and inventory health, telling the business side what problems they have now, instead of letting them use reports to find problems.
Later, for example, some fresh food businesses and e-commerce businesses have a very different point. Fresh food businesses are particularly affected by natural factors, such as weather or holidays. Even a traffic accident will affect the fresh food businesses because the inventory problem may lead to damage to the goods.
To address this situation, enterprises can make many prediction applications based on the data mid-end, such as sales prediction. Fresh food sales prediction can be configured to the hour prediction. Iterations will be done every hour and even some analogue systems can be built. For example, when there is a sudden change in the weather, we can predict or perceive what kind of risk through the analogue system, and make certain adjustments. In the future, there will be some daily fresh goods (the goods will be sold on the same day).
Every operator and salesperson has a lot of things to do every day. Therefore, so many stores have so many kinds of daily fresh goods and there is absolutely no way to efficiently perceive and make adjustments by people. If we eliminate hundreds of reports, we can centralize these scenarios where problems need to be found through human observation into the business system. When the data mid-end finds that the daily fresh goods can not be sold, and it is only three hours before the door is closed, which requires a discount.
At this time, without any manual participation, the discount is automatically triggered by the prediction and algorithm of the data in the data mid-end. Then, the goods will be sold. The integration of Business Intelligence (BI) and Artificial Intelligence (AI) can make the data mid-end truly valuable. Enterprises can design different data applications based on different data application stages so that the data can truly empower businesses.
That’s all for this sharing. If you have any questions about building a new retail data mid-end based on DataWorks, you are welcome to communicate with us. Thank you!
About the Authors
Xu Ri (Huanbo), in the early days of Freshippo in 2016, transferred to Freshippo business department as the R&D leader of online data platform. Currently, he is the leader of the Alibaba Cloud DataWorks modeling engine team.
Li Qiping (Shouyi) is Freshippo’s data R&D leader from the start-up to the present. He has very senior experience in data warehouse and data mid-end construction. He is the former leader of Alibaba’s international business data warehouse.