Analyzing Hot and Cold Tables with MaxCompute

Image for post
Image for post

By Liu Jianwei.

In the operation and management processes of actual data platforms, data tables often grow very large with the inflow of more business data and the construction of data applications. Data management personnel often want to use metadata analysis to better understand the usage of different data tables in order to optimize their data models.

In this article, we are going to show you one way how you can use metadata Analysis with Alibaba Cloud MaxCompute. Specifically, this article shows how you can analyze hot and cold tables through MaxCompute metadata.

Solution Design

MaxCompute InformationSchema provides tables that contains the full table metadata for a project and tasks_history that contains detailed data for jobs that access tables. By summarizing the number of times each table is accessed by a job, you can see how often different tables are used.

  • For hot data, obtain the details of the input_tables field in the tasks_history table, and then use count to count the number of times each table is used in a certain time window.
  • For cold data, associate and sort by using the total number of jobs in the input_tables in tables and tasks_history. This allows you to count the number of times each table is used within the specified time window and list the tables in ascending order.

Procedure

To implement the solution we discussed above, follow the steps given below:

1. Obtain the details of the input_tables field in the tasks_history table with the following command:

select 
inst_id ,
input_tables,
output_tables,
start_time,
end_time
from information_schema.tasks_history
where ds='20190902'limit 100;

For the above command, you can adjust the time condition to suit your specific needs. The results should look something like this:

Image for post
Image for post

The format of the input_tables field in the tasks_history table is ["lightning.customer","lightning.orders_delta"]. Therefore, you need to separate the fields with commas (,) when compiling statistics.

Function processing:

select
--Remove the brackets ([]) at the beginning and end of the input_tables field
trans_array(1,",",inst_id,replace(replace(input_tables,"[",""),"]","")) as (inst_id,input_table)
from information_schema.tasks_history
--The date can be defined based on actual requirements, in which case 20190902 is used as an example
where ds='20190902' limit 100;

The results should look something like this:

Image for post
Image for post

2. Compile an SQL statement to collect statistics on hot tables:

select 
--Calculate statistics by table name
input_table
,count(distinct inst_id) table_read_num
from
(
select
--Remove the brackets ([]) at the beginning and end of the input_tables field
trans_array(1,",",inst_id,replace(replace(input_tables,"[",""),"]","")) as (inst_id,input_table)
from information_schema.tasks_history
where ds='20190902'
) t
group by input_table
order by table_read_num desc
limit 1000;

The results should look something like this:

Image for post
Image for post

3. Compile an SQL statement to collect statistics on cold tables:

Associate and sort by using the total number of jobs in the input_tables in tables and tasks_history. This allows you to count the number of times each table is used within the specified time window and list the tables in ascending order.

select
t1.table_schema,
t1.table_name,
--Join two tables
if(t2.table_read_num is null,0,table_read_num) as table_read_num
FROM information_schema.tables t1
left join(
select
--Remove the quotation marks ("") before and after the table name
regexp_replace(t.input_table,""","") as input_table
,count(distinct t.inst_id) table_read_num
from
(
select
--Remove the brackets ([]) at the beginning and end of the input_tables field
trans_array(1,",",inst_id,replace(replace(input_tables,"[",""),"]","")) as (inst_id,input_table)
from information_schema.tasks_history
where ds='20190902' )t
group by input_table
)t2
--Match by using join conditions
on concat('your_project_name.',t1.table_name)=t2.input_table
order by table_read_num desc
limit 1000;

In the SQL statement, your_project_name. is the table name prefix. You need to modify the name based on your actual data. The results should look something like this:

Image for post
Image for post
Image for post
Image for post

Now, all the tables are sorted by the number of times they have been used. This means you can obtain the sorting information based on the number of times each table is used.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store