New Retail Data Analysis: How to Merge an SKU Table and an SPU Table into One Table

Image for post
Image for post

By Weisong

1) Background: Data Analysis in New Retail

In the new retail industry, granularity to store order transaction data tables in the data warehouse is the most common issue. There are three common granularities: (1) store order data by Stock Keeping Unit (SKU); (2) store order data by Standard Product Unit (SPU); (3) store order data by transaction order. The third mode stores order data by transaction order, which is more suitable for transaction detail data tables than for data warehouse storage. Therefore, this article mainly describes how to merge an SKU table and an SPU table into one table.

Why do the data tables store data using two different granularities? Generally, it is caused by two determinants. The first determinant is related to the characteristics of the data analysis business for the retail industry. The second is related to data volume.

1) Characteristics of the Data Analysis Business for the Retail Industry

  • To analyze the recent sales of commodities, decision-makers often need to see the sales of commodities at the finest granularity level, which is SKU. For the retail industry, SKU = SPU + specification/color. Which SPU is the best seller? Which specification/color of an SPU is the best seller? What styles, decorations, and colors are popular this year? The analysis of these commodity sales data needs to be refined to the granularity of SKU.
  • To analyze commodity sales in the last five or ten years, decision-makers usually do not need sales data to be refined to SKU. Generally, SPUs are sufficient to meet analysis requirements.

2) Data Volume

  • Analyze recent commodity sales, such as the sales of the last 15 days, the last month, or the last three months. Results show that the number of sales data records by SKU is usually less than one million. You can use common BI tools to quickly get the analysis results.
  • Analyze the cumulative sales in recent years or in history. If the data is stored by SKU, the data volume often exceeds hundreds of millions or even billions, and it is difficult to quickly obtain analysis results by using conventional BI tools.

From the perspectives of industry characteristics and data volume, the preceding sections analyze why a company often has SKU and SPU sales data tables. In addition, there are many situations where SPU is used for inventory, SKU for sales, SPU for unified distribution, and SKU for delivery.

However, in the field of data warehouse analysis, it is often necessary to merge data tables using different granularities to obtain integrated analysis results. For example, make an integrated analysis of the current year’s commodity sales data and historical cumulative commodity sales data, or make a correlation analysis of the current year’s commodity sales data and the current year’s inventory data. In these cases, integrated analysis or correlation analysis is often performed on the SKU table and SPU table.

This article introduces a simple data preprocessing method that merge an SKU table and an SPU table into one table, as well as provides some ideas for easy reference.

2) Method for Merging SKU Data and SPU Data

Consider two data tables, sku_table and spu_table, storing commodity sales data by SKU and SPU respectively.

2.1) View the Table Structure

--Statements to create tables
create table if not exists test_project.sku_table (
shopno string comment 'shop number',
prodno string comment 'product number',
sizeno string comment 'size number',
rtam_lastyy Double comment 'the current year’s sales of a commodity using SKU granularity'
) comment 'SKU table' PARTITIONED BY (ds string comment 'partition date') lifecycle 200;
create table if not exists test_project.spu_table (
shopno string comment 'shop number',
prodno string comment 'product number',
rtam_lishileiji Double comment 'historical cumulative sales of a commodity using SPU granularity'
) comment ' SPU table' PARTITIONED BY (ds string comment 'partition date') lifecycle 200;

Note: In the preceding data table, prodno is the SPU. The combination of prodno and sizeno becomes SKU.

For sku_table, this table uses shopno + prodno + sizeno as the primary key to indicate one data record. One piece of commodity sales data is recorded with “store + SPU + specification/color” of one sales record. This is common in case of new retail industry’s data warehouse.

For spu_table, this table uses shopno + prodno as the composite primary key to indicate one piece of data. One piece of commodity sales data is recorded with “store + SPU” of one sales record, which is common in the data warehouse of the new retail industry.

2.2) View Data

select * from test_project.sku_table where ds = '20200417';
Image for post
Image for post
select * from test_project.spu_table where ds = '20200417';
Image for post
Image for post

2.3) Merge sku_table and spu_table into One Table

First, create a data table to store the merged data as shown below.

create table if not exists test_project.sku_spu_merge_table (
shopno string comment 'shop number',
prodno string comment 'product number',
sizeno string comment 'size number',
rtam_lastyy Double comment 'the current year’s sales of a commodity using SKU granularity',
rtam_lishileiji Double comment 'historical cumulative sales of a commodity using SPU granularity'
) comment 'merge SKU table (the current year’s sales) and SPU table (historical cumulative sales) into one table' PARTITIONED BY (ds string comment 'partition date') lifecycle 200;

Now, import the two tables to the test_project.sku_spu_merge_table table as shown below.

insert overwrite table 
test_project.sku_spu_merge_table partition (ds = '20200417')
select
COALESCE (v1.shopno, v2.shopno) as shopno,
COALESCE (v1.prodno, v2.prodno) as prodno,
v1.sizeno,
v1.rtam_lastyy,
v2.rtam_lishileiji
from
(
select
shopno,
prodno,
sizeno,
rtam_lastyy,
row_number() over ( partition by shopno,prodno order by sizeno) size_order
from
test_project.sku_table
where ds = '20200417'
) v1
full outer join
(
select
shopno,
prodno,
rtam_lishileiji,
1 join_order
from
test_project.spu_table
where ds = '20200417'
) v2
on (v1.shopno = v2.shopno) and (v1.prodno = v2.prodno) and (v1.size_order = v2.join_order);

Let’s describe the SQL statement for more clarity.

1) First, use the row_number() over (partition by shopno,prodno order by sizeno) function to group sku_table by shopno,prodno (that is, store + SPU), then sort by sizeno, and assign the sorting result to size_order.
2) Next, add the last column join_order to sku_table and assign the fixed value 1.
3) Finally, perform a full outer join to sku_table and spu_table, with (v1.shopno = v2.shopno) and (v1.prodno = v2.prodno) and (v1.size_order = v2.join_order) as the join condition. In particular, the condition v1.size_order = v2.join_order is to associate the first row (that is, size_order=1) of sku_table after it is grouped by shopno,prodno with spu_table (that is, join_order=1).

The following figure shows the final result.

Image for post
Image for post

Finally, sku_table and spu_table are merged into one table.

  • The sku_table with the original SKU granularity is retained.
  • For spu_table by SPU, when (shopno + prodno) is the same as sku_table (the same store + the same SPU), then first sizeno displays the real historical cumulative sales data, and the subsequent historical cumulative sales data is set to null.

The advantage of the preceding method is that when integrated data analysis is performed, such as summation and averaging, the system does not encounter calculation errors due to repeated calculation of SPU data.

3) Further Analysis

The data integration method for SKU and SPU data presented in this article is only a reference. You may build on the ideas mentioned in this article using other methods. For example, SPU sales data is calculated according to the number of SKUs under the SPU and then assigned to the merged data table. Another idea is to perform correlation analysis for the data table by using SPU granularity and the order list to obtain the sales data of each SKU under the SPU and then merge both the data tables using SKU granularity. Just decide which scheme to use based on the actual business situation and project requirements.

The views expressed herein are for reference only and don’t necessarily represent the official views of Alibaba Cloud.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store