Alibaba Cloud Machine Learning Platform for AI: Student Exam Score Prediction

This article uses middle school students’ data and machine mining algorithms to determine the key factors affecting middle school students’ academics. This includes information such as parents’ occupation, parents’ education, and Internet connectivity at home. The offline models and the academic indicator evaluation report are generated through the logistic regression algorithm to predict the students’ final examination. An online prediction API is generated, through which the trained offline model is applied to the online scenario.

We will be building our predictor using the Alibaba Cloud Machine Learning Platform for Artificial Intelligence (PAI) service.

Dataset Introduction

The dataset consists of 25 feature columns and 1 target column. The detailed fields are as follows.

The following is a screenshot of the data.

Offline Training

The following diagram shows the experiment process.

The data flows through the experiment from top to bottom, for preprocessing, splitting, training, prediction and evaluation in turn.

1. Data Preprocessing

The SQL script is provided as follows.

Structure text data using the SQL script component.

  1. For example, if the source data includes “yes” and “no”, the text data can be digitized using 0 for yes and 1 for no.
  2. For some multi-value text fields, the data can be abstracted in accordance with the scenario. For example, for the field “Mjob”, 1 can indicate a teacher and 0 can indicate a non-teacher. After abstraction, this feature indicates whether the job is related to education.
  3. The target column is digitized in such a manner that 1 indicates more than 18 points, and 0 indicates the others. The goal is to find a model that can predict the score through training.

2. Normalization

The purpose of the normalization component is to remove the dimension and transform all the fields to 0 and 1, which eliminates the impact of the imbalance between the fields. The result is shown in the figure below.

3. Splitting

The data set is split in a ratio of 8:2, in which 80% is used for model training, and 20% is used for prediction.

4. Logistic Regression

The offline model is generated by training through a logistic regression algorithm. If you are new to this algorithm, you can read more about logistic regression on Wikipedia.

5. Result Analysis and Evaluation

View the accuracy of model predictions through the confusion matrix. As can be seen from the figure below, the prediction accuracy of this experiment is 82.911%.

According to the characteristics of the logistic regression algorithm, some valuable information can be mined through the model coefficients. Right click on the Binary Logistic Regression component to view the model. The results are shown below.

According to the characteristics of the logistic regression algorithm, the greater the weight, the greater the impact of the feature on the result. A positive weight indicates a positive correlation to the result 1 (high score in final exam), and a negative weight indicates a negative correlation. Several features with large weights are analyzed in the following table.

Due to the small dataset in this experiment, the above analysis results are not necessarily accurate and are for reference only.

Online Prediction Deployment

Once generated, the offline model can be deployed online and the online prediction function can be implemented by calling restful-api.

To learn more about Alibaba Cloud Machine Learning Platform for Artificial Intelligence (PAI), visit


Follow me to keep abreast with the latest technology news, industry insights, and developer trends.