Alibaba Cloud Machine Learning Platform for AI: News Classification Case
By Garvin Li
News classification is a common scenario in the field of text mining. At present, many media or content producers often use manual tagging for news text classification, which consumes a lot of human resources. This article classifies news texts through smart text mining algorithms. It is completely realized by the machine without any manual tagging.
In this article, automatic news classification is implemented through the PLDA algorithm and clustering topic weights. It includes processes such as word breaking, word type conversion, disabled-word filtering, topic mining, and clustering. We will be doing this using the Alibaba Cloud Machine Learning Platform.
Note: The data in this article is fictitious and is only used for experimental purposes.
The data screenshot is shown below.
The detailed fields are as follows:
Data Exploration Procedure
The experiment flow chart is as follows.
The experiment is roughly divided into the following 5 steps:
- Add a serial number column
- Word segmentation and word frequency analysis
- Disabled-word filter
- Text topic mining
- Result analysis and evaluation
1. Add a Serial Number Column
The data source of this experiment is based on a single news unit. It is necessary to add an ID column as a unique identifier for each news unit, which is convenient for computing the following algorithm.
2. Word Segmentation and Word Frequency Analysis
These two steps are the most common practices in the field of text mining.
The word splitting component is first used to break the content field (news content). After removing filtered words (filtered words are generally punctuation and auxiliary words), then the word frequency is analyzed. The results are shown in the following figure.
3. Disabled-Word Filter
The disabled-word filter component is used to filter the input disabled-word lexicon, generally filter punctuations and auxiliary words that have less influence on the article.
4. Text Topic Mining
Using the PLDA text mining component requires first converting the text to a ternary form (text to numeral), as shown in the following figure.
append_id is the unique identifier for each news unit.
The number in front of the colon in the key_value field indicates the numeral identifier that the word is abstracted into, and the colon is followed by the frequency at which the corresponding word appears.
Use the PLDA algorithm for the data.
The PLDA algorithm is also known as topic model, which can locate words that represent the topic of each article. This experiment sets 50 topics. PLDA has 6 output piles, and the 5th output pile outputs the probability of each topic corresponding to each article, as shown in the following figure.
5. Result Analysis and Evaluation
The above steps represent the article as a vector from the dimension of the topic.
Then article classification can be achieved by clustering the distances of the vectors. The classification results of the K-means clustering component are shown in the figure below.
- cluster_index indicates the name of each class.
- Find class 0, there are a total of 4 articles with the docid of 115, 292, 248, and 166.
The 4 articles 115, 292, 248, and 166 are queried through the filtering and mapping component. The results are shown in the following figure.
The experimental result is not perfect. In the above figure, most of the articles are sorted correctly, with the exception of a financial news unit, a technology news unit and two sports news units being grouped together.
The main reasons are as follows:
- There is no detailed optimization.
- There is no feature engineering for the data.
- The data volume is too small.
To learn more about Alibaba Cloud Machine Learning Platform for Artificial Intelligence (PAI), visit www.alibabacloud.com/product/machine-learning