Cloud Data Warehouse is a new-generation data warehouse solution built on the cloud. How to choose a cloud data warehouse that meets the needs of the enterprises and issues to be considered when selecting a cloud data warehouse are concerns of many enterprise managers. This article introduces the basis for selection of cloud data warehouse by referring to research reports from TDWI and Forrester, to help you in the selection of cloud data warehouse.
What Is Cloud Data Warehouse?
The cloud data warehouse solution has changed the traditional methods of data platform construction. You can create and start using a data warehouse service within a few minutes without the guidance from platform technical experts. Enterprise data analysts and other non-technical personnel are allowed to access and process large-scale data to quickly gain business insight. Enterprises can focus on business issues at a lower cost without worrying too much about complicated platform technologies. In addition, modern cloud data warehouse services can meet more analysis requirements, such as ETL for massive data, interactive query, machine learning, and unstructured data processing. More and more enterprises are considering using the cloud data warehouse to build their own data analysis platforms.
Forrester, an authoritative market research institution, defines a cloud data warehouse as a secure and scalable self-service data warehouse that is available on demand. The solution accelerates the data analysis process through automated deployment, management, optimization, backup, and recovery, and minimizes the requirements for technical support.
Therefore, how to choose a proper cloud data warehouse and key factors to be considered are concerns of many decision makers. This document shares some best practices of using the cloud data warehouse in combination with TDWI’s research report.
1. Select a Data Warehouse Platform Suitable for Your Analysis Purposes
In practice, data warehouses gradually have more and more functions. These functions include fixed reports for managers, interactive exploration and analysis for analysts, and predictive analysis for data scientists. Different applications have different requirements for the system in terms of data access methods, processing computing models, and algorithm support.
An effective strategy is to make the data warehouse as a whole system and continuously support mixed workloads, rather than meeting specific service needs. For example, for periodic reports, data needs to be cleaned and converted, star/snowflake models are used to create a data set for report tools. Interactive query must support parallel processing of massive of data to enable low-latency data exploration. Predictive analysis must support different development languages and algorithm models, and be able to cope with iterative computing of massive data.
Cloud-based data warehouses meet such requirements. With the flexibility of services provided by cloud computing, users can focus more on analysis and results, rather than building systems.
In addition, projects deployed on the cloud usually require flexibility and agility. For example, self-help analysis in a short period of time, or even building a prototype analysis system to quickly verify the service concept. For such projects, the use of a cloud-based data warehouse can provide special benefits, because you do not have to design, develop, and deploy platforms and data management frameworks. In addition, the solution can reduce startup costs, accelerate analysis, and reduce or even eliminate maintenance costs.
2. Use a Cost Model to Estimate Costs Accurately
Before the investment on the data warehouse gains benefits, we must consider the costs. However, during the lifecycle of the data warehouse system, most practitioners in the data industry are not aware of the components of their total costs , which may include:
- Procurement costs: For evaluation and purchase of hardware, storage, software, and network communication devices;
- Deployment cost: For project planning, project management, system design, development, configuration, testing, and implementation;
- Data development and management costs: For design and development of data extraction and integration applications, and data warehouse models;
- Business opportunity costs: For reducing impact of system release delay on services;
- O&M costs: For data center power supply, cooling, data center space, and operator network maintenance;
- Regular costs: For software license maintenance, system upgrades, data archiving, data backup/recovery, and disaster planning;
Different organizations have different tolerance for different types of costs. For mature services, organizations may be willing to invest in infrastructure and can predict that the benefits are higher than the startup costs. Small-size or startup enterprises may not have enough budget for regular costs and want to earn profits in a short period of time.
In this case, a cost model is needed to determine when it is necessary and valuable to use the cloud data warehouse. In some cases, an agile cloud data warehouse solution can shorten the time-to-market for your services and bring business revenue earlier. The increased revenue may exceed or offset system investment.
3. Simplify Application Deployment Process to Accelerate Value Realization
Cloud-based data warehouses greatly simplify the deployment. First, service vendors have prepared the infrastructure and software in advance, so that users do not have to worry about the complex underlying technical work. Second, users will benefit from the supporting tools provided by the service provider to support the whole data processing process, including data access, analysis, conversion, loading, reporting, and query. These tools and demos can simplify data development. Third, cloud data warehouse vendors provide value-added services by integrating rich functions, such as data management, visualization tools, and predictive analysis.
After eliminating tasks at the underlying infrastructure level, users can focus on data analysis. The standard data development and deployment processes include at least the following tasks:
- Service objectives:Clarify enterprises’ objectives of data analysis and provide the datasets to specific users;
- Data requirement evaluation: Determine the datasets required to access the data warehouse;
- Information modeling: Consider how to organize and express data in the data warehouse;
- Data integration: Developers and implementers integrate required data to the data warehouse;
- Data conversion: Use data preparation tools (in the ETL phase) to process and convert data;
- Service-driven analysis: Determine the tasks to be analyzed and deliver the expected results based on service requirements;
Fortunately, the cloud data warehouse service providers support these requirements. For example, they provide data integration tools for data access, use ETL tools or ETL data processing and conversion in the data warehouse, and use job scheduling management tools to orchestrate and periodically schedule data processing logic. Therefore, deploying cloud-based BI/Analytics projects using standard processes greatly improves the flexibility of the processing and the accessibility of analysis results.
4. Find a Cloud-based System That Integrates Advanced Analysis Functions
Traditional business intelligence analysis is mature, but some cloud data warehouse vendors are rapidly integrating advanced analysis functions, including but not limited to:
- Clustering, a grouping (for example, customer grouping) method based on characteristics and behaviors;
- Subdivision, a method for distinguishing entities (such as vendors) based on the previously created clustering models;
- Classification, which uses iterative algorithms to classify an individual to a predefined category, such as “Best Customers”, “Good Customers”, “Medium Customers”, and “Undesired Customers”;
- Decision tree, a method for obtaining the optimal solution by comparing different solutions in the decision;
- Association analysis, which iteratively checks the relationship between events in a dataset to explore potential associations;
In the past, you may need a separate advanced analysis and computing platform for these functions. However, now these functions are supported in the new data warehouse. For example:
- The new architecture supports hybrid load, supporting both traditional query and report analysis, and advanced analysis;
- Memory computing can significantly accelerate iterative computing in both traditional query analysis and advanced analysis;
Therefore, you need to find a cloud data warehouse service that supports richer computing functions to meet the current data analysis needs. In addition, service providers can constantly innovate designs to meet the needs of different users.
5. Ensure that the Cloud Platform Meets Stability and Consistency Performance Requirements
One of the risks for hosting an application is that the provider relies on deploying the application in a virtualized environment. This may reduce the overall operating costs of customers. However, applications may be redeployed on different infrastructure at any time and may coexist with other applications. The execution of these applications may affect the performance.
For most organizations, quick analysis and results cannot be achieved by data users, which will affect the wide adoption of data services and project success. If your organization requires predictable performance, specify the performance requirements and acceptable levels, and evaluate the vendor’s methods to guarantee or improve performance. You should come up with the following questions:
- Does the cloud data warehouse vendor provide a performance benchmark that accurately reflects how applications run?
- Does the vendor provide options for deploying projects on the “bare metal” cloud platform instead of the virtualized platform?
- Does the platform use columnar storage, data compression, and memory computing to accelerate query execution?
Confirm with the service vendor that your performance requirements can be met.
6. Actively Manage Data Access and Integration
If you are considering cloud-based BI and analysis, ensure that you can easily move the data for analysis in the cloud environment. Note the complexity of integrating various types of data sources. These types include flat file data, data in relational databases accessed through SQL, data managed in the new NoSQL environment, geospatial data, and multi-source heterogeneous data such as HDFS files on Hadoo.
To actively manage data source access and data integration, consider the following factors:
- Network connectivity: Consider the accessibility of the network connection between each data source you need and the cloud data warehouse.
- Data movement: It is an alternative method when the data warehouse capacity exceeds the standard network connection capacity, which may require faster connections and greater bandwidth.
- Data profiling and analysis: Evaluate potential exceptions of data sources, find the metadata, and check data availability and integrity.
- Standardize and convert data based on service rules for data preparation.
- Data is collected through replication and real-time change data capture (CDC) to reduce overhead on the data warehouse.
- The time required to move data from the data source to the cloud data warehouse is reduced through data compression.
In the face of increasing data from different sources, users need more complex and efficient data integration solutions. When selecting a cloud data warehouse, you should find a data integration service with data check and discovery, compression, transmission, data preparation, and efficient data loading.
7. Meet Security and Data Protection Requirements
Another risk of using hosted or cloud-based data warehouses is data security. There are risks in ensuring access security and data protection for two reasons. First, in some cases, the multi-tenant architecture allows multiple customer applications to run in the same environment, leading to the risk of data leakage across application boundaries. Second, the storage on the virtual platform can be distributed across multiple physical machines, which may make users worry about whether the application can capture remaining data during migration.
Obviously, your enterprise must assess security and data privacy protection needs, and make sure that vendors can meet these needs. Cloud-based data warehouse vendors may provide the following methods:
- User authentication and authorization to prevent unauthorized data access
- Fine-grained data access control to prevent the exposure of protected data attributes
- Data shielding to prevent disclosure of protected data attributes
- Data encryption, which can be applied to static or stored data, as well as dynamic data when accessing and transmitting data to a user portal.
- Data erasure, which completely overwrites the hard drive to prevent malicious recovery
The suggestions listed above help you determine whether the cloud data warehouse is suitable for your organization. Once you decide to apply the cloud for the data warehouse and BI applications, make sure to select a suitable service vendor. In short, some standards described here for evaluating cloud data warehouse services mainly focus on how cloud data warehouse products and services can help improve your BI and analysis projects, including:
Reduce the overall cost of development and operation
- Accelerate value realization
- Reduce dependency on internal IT resources
- Simplify data receiving, integration, and loading
- Expand data user groups by improving ease of use
- Support your elastic and scalable needs
- Enable service continuity through fault tolerance and hosted failover
- Establish trust in system security and private information protection
Once you have determined the vendor, we recommend that you establish a good cooperation with the trusted cloud data warehouse vendor. This is very important for three reasons:
- Environmental sustainability: Trusted partners will ensure that the environment meets all your service analysis requirements in all phases of the data warehouse lifecycle, and other requirements for elasticity, scalability, security, and overall performance in the project lifecycle.
- Response capability: A valuable service vendor can make you can trust them to solve any problems in a timely and reliable manner.
- Cooperation: Find a vendor that can help you quickly build a data platform and cooperate with you and your data consumers to further improve your BI/analysis applications.
Cloud data warehouse vendors can organize their implementation experience to align with customers’ short-, medium-, and long-term strategies.