Alibaba Cloud Machine Learning Platform for AI: News Classification Case

By Garvin Li

News classification is a common scenario in the field of text mining. At present, many media or content producers often use manual tagging for news text classification, which consumes a lot of human resources. This article classifies news texts through smart text mining algorithms. It is completely realized by the machine without any manual tagging.

In this article, automatic news classification is implemented through the PLDA algorithm and clustering topic weights. It includes processes such as word breaking, word type conversion, disabled-word filtering, topic mining, and clustering. We will be doing this using the Alibaba Cloud Machine Learning Platform.

Note: The data in this article is fictitious and is only used for experimental purposes.

Dataset Introduction

The data screenshot is shown below.

Image for post
Image for post

The detailed fields are as follows:

Image for post
Image for post

Data Exploration Procedure

The experiment flow chart is as follows.

Image for post
Image for post

The experiment is roughly divided into the following 5 steps:

  1. Add a serial number column

1. Add a Serial Number Column

The data source of this experiment is based on a single news unit. It is necessary to add an ID column as a unique identifier for each news unit, which is convenient for computing the following algorithm.

2. Word Segmentation and Word Frequency Analysis

These two steps are the most common practices in the field of text mining.

The word splitting component is first used to break the content field (news content). After removing filtered words (filtered words are generally punctuation and auxiliary words), then the word frequency is analyzed. The results are shown in the following figure.

Image for post
Image for post

3. Disabled-Word Filter

The disabled-word filter component is used to filter the input disabled-word lexicon, generally filter punctuations and auxiliary words that have less influence on the article.

4. Text Topic Mining

Using the PLDA text mining component requires first converting the text to a ternary form (text to numeral), as shown in the following figure.

Image for post
Image for post

append_id is the unique identifier for each news unit.

The number in front of the colon in the key_value field indicates the numeral identifier that the word is abstracted into, and the colon is followed by the frequency at which the corresponding word appears.

Use the PLDA algorithm for the data.

The PLDA algorithm is also known as topic model, which can locate words that represent the topic of each article. This experiment sets 50 topics. PLDA has 6 output piles, and the 5th output pile outputs the probability of each topic corresponding to each article, as shown in the following figure.

Image for post
Image for post

5. Result Analysis and Evaluation

The above steps represent the article as a vector from the dimension of the topic.

Then article classification can be achieved by clustering the distances of the vectors. The classification results of the K-means clustering component are shown in the figure below.

Image for post
Image for post
  1. cluster_index indicates the name of each class.

The 4 articles 115, 292, 248, and 166 are queried through the filtering and mapping component. The results are shown in the following figure.

Image for post
Image for post

The experimental result is not perfect. In the above figure, most of the articles are sorted correctly, with the exception of a financial news unit, a technology news unit and two sports news units being grouped together.

The main reasons are as follows:

  1. There is no detailed optimization.

To learn more about Alibaba Cloud Machine Learning Platform for Artificial Intelligence (PAI), visit www.alibabacloud.com/product/machine-learning

Reference:https://www.alibabacloud.com/blog/alibaba-cloud-machine-learning-platform-for-ai-news-classification-case_594401?spm=a2c41.12532010.0.0

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store