Using Machine Learning for Automatic Label Classification

5 min readOct 31, 2018

11.11 The Biggest Deals of the Year. 40% OFF on selected cloud servers with a free 100 GB data transfer! Click here to learn more.
The annual Alibaba Double Eleven (Singles’ Day) shopping festival is just around the corner, and recently all kinds of hot shopping lists are available on Taobao and Tmall. Online shopping experts know that for an item to be better identified, usually more labels are used to represent multiple dimensions of a product. For example, instead of labeling a pair of shoes as “female leather boots”, it can be described as: “British style women’s marten’s boots dull polish genuine leather thick platform”. Not surprisingly, the same applies to product listings in Chinese as well. For example, if it is a bag, its product description would be “天天特价包包2018新款秋冬斜挎包韩版手提包流苏贝壳包女包单肩包”, which roughly translates to “discount 2018 fall winter women’s messenger bag Korean style tassel shoulder bag”.

For a consumer, this approach makes sense. Each product is described in multiple dimensions, such as origin, style, and brand. However, for e-commerce platforms, classifying tens of thousands of products according to specific dimensions is often the biggest challenge. The problem here is getting the labels of each product’s dimensions correctly. Words such as location related labels (e.g. Japan, India, and Korea) can be identified pretty easily, but there are also other labels that are not immediately obvious.

For Alibaba, we believe the best way to do this is through artificial intelligence. Instead of manually filtering out these labels, we can automatically learn them through algorithms, and quickly establish a label classification system. This article will discuss the usage of Alibaba Cloud Machine Learning Platform’s (PAI) text analysis function to implement a simple automatic label classification system. We will be using real examples from the 2016 Double Eleven festival to illustrate this concept.

Data Description

The data from the 2016 Double Eleven festival is directly downloaded and sorted from the Internet. The labels are in Chinese, but the same principles apply to labels of any language. There are more than 2,000 product descriptions, and each line is a label aggregation for a product, as shown below:

We import this data into PAI for processing. The specific data upload method can be found in the official PAI documentation: https://www.alibabacloud.com/help/product/30347.htm

Test Description

After the data is uploaded, the following logic test diagram can be generated by dragging the components in PAI, and the specific functions of each step have been marked:

Detailed functions of each step are described in modules below:

1. Upload Data and Break Words

The data is uploaded, the shopping_data represents the underlying data storage, and then the word breaking component breaks words in the data. The word breaking is a basic operation of NLP, which is not introduced here.

2. Add a Serial Number Column

Because the uploaded data has only one field, adding the primary key to the data by adding the serial number column facilitates subsequent computing. The processed data is as follows:

3. Count Word Frequency

This is a straightforward step. We just count the number of words that appear for each item.

4. Generate Word Vectors

The word2vector algorithm is used, which can expand each word in the vector dimension based to its meaning. The word vector has two meanings.

The two words that have close vector distance will have similar meanings. For example, in the data, “Singapore” and “Japan” both indicate the product origin, and then the two words have a close vector distance.
The difference in distance between different words is also meaningful. For example, “Beijing” is the capital of “China” and “Paris” is the capital of “France”, with sufficient training: |China| — |Beijing| = |France| — |Paris|

Through word2vector, each word is mapped to a hundred-dimensional space, and the result is shown in the following figure:

5. Word Vector Clustering

Now the word vectors have been generated, and it is only necessary to compute which words have closer vector distances, so that the label words can be classified based to their meanings. Here the kmeans algorithm is used for automatic clustering. The clustering result shows which cluster each word belongs to:

Result Verification

Finally, through the SQL component, a category is randomly selected from the clusters to check whether the labels of the same category are automatically classified. Here, cluster 10 is selected.

Here are the result of cluster 10:

Through the labels “Japan”, “Russia”, “Korea”, “Yunnan”, “Xinjiang”, and “Taiwan” in the result, it can be found that the system automatically classified geo-related labels. However, the result also contains the labels “men’s underwear” and “nuts”, which are obviously inconsistent with the category. This is probably due to insufficient number of training samples. If the training samples are enough, the label clustering result will be more accurate.

Conclusion

Machine Learning Platform (PAI) is built based on Alibaba Cloud’s computing and big data architecture. The product comes with several ready-to-use templates, including weather forecasting and disease prediction tools for your convenience.