The Diversified Machine Learning Applications In Big Data

Machine Learning and How to Use It on Alibaba Cloud

Machine learning is an important way to use to create value for customers. This article discusses what machine learning is and how you can use it on Alibaba Cloud.

We live in an era of Big Data where there is an abundance of data and companies across several different industries are coming up with new and unique ways of using this data to create value for their customers. For all of this, machine learning is an important piece of the puzzle.

What Is Machine Learning?

Machine learning (abbreviated ML) can be described as a mechanism whereby a machine learns a pattern from data sets so that it can predict future data. The major types of machine learning algorithms are supervises, semi-supervised, unsupervised, and reinforcement learning. In a machine learning pipeline, training data, some sort of model for that data, and an algorithm are used. After initial training, a test dataset is applied to the model to check the accuracy of predictions made by this pipeline.

Machine learning pipelines typically have the following steps:

  1. Data source
  2. Data Preprocessing
  3. Feature engineering
  4. Training and prediction
  5. Evaluation

Machine learning can be broadly considered as a subtype of Artificial Intelligence (AI) and the larger umbrella category under which you find other types of algorithms like deep learning algorithms.

The computation power of the machine on which these types of algorithms are deployed also plays a big role in how power the algorithm can be. In the cloud, all of these algorithms are instrumental pieces to many services provided, and they rely on the computing power provided by servers on the cloud.

What Can Machine Learning Be Used for?

One common application of machine learning in recent years is recommendation systems. These systems use user input data to provide user recommendations. On example of these systems is the one used by Netflix.

Netflix uses a state-of-the-art recommendation system that can provide accurate recommendations. The algorithm used takes input such as the user’s viewing history, user ratings, the data of other users with similar tastes, and the time of the day the user watched the content.

This recommendation system is important as about two thirds of movies watched on Netflix are recommended ones. In other similar services provided by Amazon and Google, the story is very similar. For Amazon, 35 percent of sales on their ecommerce platform come from recommendations, and on Google, news recommendations improved click-through rates by 38 percent.

How Does Machine Learning Work on Alibaba Cloud?
This section takes you on a step-by-step tutorial of how to use machine learning on Alibaba Cloud. In this tutorial, you will create a basic machine learning pipeline to create a binary classification algorithm.

First, procure the Data. To do this, find a data source you want to work with. You can find some datasets in the console already. In this example, breast cancer data is used.

Related Blogs

Although conceptually similar, the terms AI, machine learning, and deep learning are not interchangeable.


On November 9, 2015, Google released an open source Artificial Intelligence (AI) system known as TensorFlow. Ever since the launch of TensorFlow, the growth of AI and machine learning has been immense. Machine learning, as a type of AI, enables software to elaborate or predict future events based on a large volume of data. Today, leading technology giants are all making substantial investments in machine learning, including Facebook, Apple, Microsoft, and even China’s leading search engine — Baidu.

In 2016, Google DeepMind’s AlphaGo project defeated South Korean player Lee Se-dol in the world-famous Go game. The media used the terms AI, machine learning, and deep learning to explain the reasons for DeepMind’s victory, causing a mass confusion of these terms among the public.

Differences and Similarities

Although conceptually similar, the terms AI, machine learning, and deep learning are not interchangeable. Referencing the interpretations from Michael Copeland of NVIDIA, this article unveils the concepts of AI, machine learning, and deep learning. To understand the relationship between the three, let us look at the figure below:

As shown in the figure, machine learning and deep learning belong are subcategories of AI. The concept of AI appeared in the 50’s, while machine learning and deep learning are relatively newer topics.

AI: From Irrelevance to Global Adoption

Since 1956, when computer scientists coined the term AI at Dartmouth Conferences, there has been an endless stream of creative ideas about AI. AI was one of the hottest topic or research because many perceived AI as the key to a bright future of human civilization. However, the idea of AI was quickly discarded for being too pretentious and whimsical.

In the past few years, especially after 2015, AI has experienced a new surge. A large contributor to this growth is the widespread use of graphics processors (GPUs) that make parallel processing faster, economical, and powerful. Additionally, the emergence of almost infinite storage spaces and massive data (big data movement) also benefitted the development of AI. These technologies allow unlimited access to all kinds of files, including images, text, transaction data, and map data.

Next, we will look at AI, machine learning, and deep learning one by one from their development processes.

In this post, we 6 key automated machine learning (AutoML) platforms that can assist data scientists to accelerate machine learning development.

1. What Is AutoML?

AutoML (automated machine learning) refers to the automated end-to-end process of applying machine learning in real and practical scenarios.

A typical machine learning model includes the four following steps:

From data reading, pre-processing, optimization, and result prediction, each step is controlled and performed manually. AutoML focuses on two main aspects: data collection and prediction. Any other intermediate steps can be easily automated. In addition, AutoML provides models that have been optimized and ready for prediction.

Currently, AutoML mainly falls into three categories: 1. AutoML for automated parameter tuning (a relatively basic type) 2. AutoML for non-deep learning, for example, AutoSKlearn. This type is mainly applied in data pre-processing, automated feature analysis, automated feature detection, automated feature selection, and automated model selection. 3. AutoML for deep learning/neural networks, including NAS and ENAS as well as Auto-Keras for frameworks.

From the application perspective, the demand for machine learning systems has soared over the past few years. ML has been adopted in a wide range of applications. However, although it is proven that machine learning can provide better support for some enterprises, many enterprises are still struggling to implement ML model deployment.

Theoretically, one goal of AI is to replace a portion of manpower. Specifically, a large part of the AI design work can also be implemented by using proper algorithms. Take parameter tuning for example: Algorithms like Bayes, NAS, and evolutionary programming can be used in the parameter tuning process to replace manpower by allowing more computing power.

To deploy AI models, an enterprise first needs to have a team of experienced data scientists, who expect high salaries. Even if an enterprise does have an excellent team, usually more experience rather than AI knowledge is needed to decide which model best fits the enterprise. The success of machine learning in a variety of applications leads to an increasingly higher demand for machine learning systems, which are supposed to be easy-to-use even for non-experts. AutoML tends to automate as many steps as possible in ML pipelines and retain good model performance with minimum manpower.

This article discusses how and where you can find public data to use in machine learning pipelines that you can then use in a variety of applications.

The goal of the article is to help you find a dataset from public data that you can use for your machine learning pipeline, whether it be for a machine learning demo, proof-of-concept, or research project. It may not always be possible to collect your own data, but by using public data, you can create machine learning pipelines that can be useful for a large number of applications.

Machine Learning Requires Data

Machine learning requires data. Without data you cannot be sure a machine learning model works. However, the data you need may not always be readily available.

Data may not have been collected or labeled yet or may not be readily available for machine learning model development because of technological, budgetary, privacy, or security concerns. Especially in a business contexts, stakeholders want to see how a machine learning system will work before investing the time and money in collecting, labeling, and moving data into such a system. This makes finding substitute data necessary.

This article wants to provide some light into how to find and use public data for various machine learning applications such as machine learning demos, proofs-of-concept, or research projects. This article specifically looks into where you can find data for almost any use case, problems with synthetic data, and the potential issues with using public data. In this article, the term “public data” refers to any data posted openly on the Internet and available for use by anyone who complies with the licensing terms of the data./ This definition goes beyond what is the typical scope of “open data”, which usually refers only to government-released data.

Be Careful When It Comes to Synthetic Data

One solution to these data needs is to generate synthetic data, or fake data to use layman’s terms. Sometimes this is safe. But synthetic data is usually inappropriate for machine learning use cases because most datasets are too complex to fake correctly. More to the point, using synthetic data can also lead to misunderstandings during the development phase about how your machine learning model will perform with the intended data as you move onwards.

In a professional context using synthetic data is especially risky. If a model trained with synthetic data has worse performance than a model trained with the intended data, stakeholders may dismiss your work even though the model would have met their needs, in reality. If a model trained with synthetic data performs better than a model trained with the intended data, you create unrealistic expectations. Generally, you rarely know how the performance of your model will change when it is trained with a different dataset until you train it with that dataset.

Thus, using synthetic data creates a burden to communicate that any discussions of model performance are purely speculative. Model performance on substitute data is speculative as well, of course, but a model trained on a well-chosen substitute dataset will give closer performance to actual model trained on the intended data than a model trained on synthetic data.

If you feel you understand the intended data well enough to generate an essentially perfect synthetic dataset, then it is pointless to use machine learning since you already can predict the outlines. That is, the data you use for training should be random and used to see what the possible outcomes of this data, not to confirm what you already clearly know.

This article shows you how to set up a machine learning platform with Alibaba Cloud Machine Learning Platform for AI to analyze census data.

A census is an official survey of a population that records the details of individuals in various aspects. Through census data, we can measure the correlation of certain characteristics of the population, such as the impact of education on income level. This assessment can be made based on other attributes such as age, geographical location, and gender. In this article, we will show you how to set up the Alibaba Cloud Machine Learning Platform for AI product to perform a similar experiment using census data.

Dataset Introduction
Data source: UCI open source dataset Adult is a census result for a certain region in the United States, with a total of 32,561 instances. The detailed fields are as follows:

Related Courses

The Machine Learning Career Path teaches users about end-to-end machine learning services, including data processing, feature engineering, model training, model prediction, and model evaluation. Alibaba Cloud’s Machine Learning Platform combines all of these services to make AI more accessible than ever. Learn how to use this platform through the completion of this learning path.

This course is the first class of the Alibaba Cloud Machine Learning Algorithm QuickStart series, It mainly introduces the basic concept and algorithm principle of linear regression model, as well as the model evaluation metrics, explains and demonstrates a complete process of building linear regression analysis and prediction model in PAI, prepare for the knowledge associate with subsequent machine learning courses.

Internet-of-things (IoT) and Big Data are closely connected with each other and provide a significant impact on many vertical industries opening up new opportunities for innovation and process optimization.

This course is the first class of the Alibaba Cloud Machine Learning Algorithm QuickStart series, It mainly introduces the basic concept and algorithm principle of linear regression model, as well as the model evaluation metrics, explains and demonstrates a complete process of building linear regression analysis and prediction model in PAI, prepare for the knowledge associate with subsequent machine learning courses.

Related Market products

How to use Alibaba Cloud advanced machine learning platform for AI (PAI) to quickly apply the linear regression model in machine learning to properly solve business-related prediction problems.

This session introduces how to use Alibaba Cloud Machine Learning Platform For AI to create a heart disease prediction model based on the data collected from heart disease patients.

How to apply decision tree and random forest model to solve classification problems with PAI.

Related Documentation

What is Machine Learning Platform for AI?

Machine learning refers to the practice of instructing machines to discover regular patterns from accumulated data to help users make predictions and decisions.

Alibaba Cloud Machine Learning Platform for AI provides an all-in-one machine learning service featuring low user technical skills requirements, but with high performance results. On the Machine Learning Platform for AI, you can quickly establish and deploy machine learning experiments to achieve seamless integration between algorithms and your business. Machine Learning Platform for AI is built on the full-fledged algorithm application system of Alibaba Group, and is now serving tens of thousands of developers and enterprise users. You can quickly build services such as product recommendation, financial risk control, image identification, and voice recognition based on Machine Learning Platform for AI to implement artificial intelligence.

Sina Weibo is a leading social media platform in China. Sina Weibo has 165 million daily active users (DAUs) and 376 million monthly active users (MAUs). Mobile MAUs occupy as high as 92% of all MAUs. Involved data operation scenarios include, but are not limited to:

Users generate a huge amount of data on the platform every day. After data processing, tens of billions of features and hundreds of billions of sample entries may be generated. How to compute and analyze such a huge amount of data poses a great challenge for the bottom-layer computing engine.

Developed in the mid 90’s, Support Vector Machine (SVM) is a statistical learning theory based machine learning method. It seeks to improve the learning machine’s generalization ability through structural risk minimization, so as to minimize the empirical risk and confidence range. Therefore, good statistics can be obtained from small sample sizes. For more information about SVM, see wiki.

This linear SVM is not implemented using the kernel function. For details about the implementation, see section “6 Trust Region Method for L2-SVM” in This algorithm only supports binary classification models.

Related Products

Machine Learning Platform for AI provides end-to-end machine learning services, including data processing, feature engineering, model training, model prediction, and model evaluation. Machine Learning Platform for AI combines all of these services to make AI more accessible than ever.

Alibaba Cloud Image Search is an intelligent image search service that helps users find similar or identical images. Based on machine learning and deep learning, the product enables end-users to take a screenshot or upload an image to search and find desired products and fulfill other search requests.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store