Analyzing Hot and Cold Tables with MaxCompute
By Liu Jianwei.
In the operation and management processes of actual data platforms, data tables often grow very large with the inflow of more business data and the construction of data applications. Data management personnel often want to use metadata analysis to better understand the usage of different data tables in order to optimize their data models.
In this article, we are going to show you one way how you can use metadata Analysis with Alibaba Cloud MaxCompute. Specifically, this article shows how you can analyze hot and cold tables through MaxCompute metadata.
Solution Design
MaxCompute InformationSchema
provides tables that contains the full table metadata for a project and tasks_history
that contains detailed data for jobs that access tables. By summarizing the number of times each table is accessed by a job, you can see how often different tables are used.
The detailed process is explained as follows:
- For hot data, obtain the details of the
input_tables
field in thetasks_history
table, and then usecount
to count the number of times each table is used in a certain time window. - For cold data, associate and sort by using the total number of jobs in the
input_tables
in tables andtasks_history
. This allows you to count the number of times each table is used within the specified time window and list the tables in ascending order.
Procedure
To implement the solution we discussed above, follow the steps given below:
1. Obtain the details of the input_tables
field in the tasks_history
table with the following command:
select
inst_id ,
input_tables,
output_tables,
start_time,
end_time
from information_schema.tasks_history
where ds='20190902'limit 100;
For the above command, you can adjust the time condition to suit your specific needs. The results should look something like this:
The format of the input_tables field in the tasks_history
table is ["lightning.customer","lightning.orders_delta"]
. Therefore, you need to separate the fields with commas (,) when compiling statistics.
Function processing:
select
--Remove the brackets ([]) at the beginning and end of the input_tables field
trans_array(1,",",inst_id,replace(replace(input_tables,"[",""),"]","")) as (inst_id,input_table)
from information_schema.tasks_history
--The date can be defined based on actual requirements, in which case 20190902 is used as an example
where ds='20190902' limit 100;
The results should look something like this:
2. Compile an SQL statement to collect statistics on hot tables:
select
--Calculate statistics by table name
input_table
,count(distinct inst_id) table_read_num
from
(
select
--Remove the brackets ([]) at the beginning and end of the input_tables field
trans_array(1,",",inst_id,replace(replace(input_tables,"[",""),"]","")) as (inst_id,input_table)
from information_schema.tasks_history
where ds='20190902'
) t
group by input_table
order by table_read_num desc
limit 1000;
The results should look something like this:
3. Compile an SQL statement to collect statistics on cold tables:
Associate and sort by using the total number of jobs in the input_tables
in tables and tasks_history
. This allows you to count the number of times each table is used within the specified time window and list the tables in ascending order.
select
t1.table_schema,
t1.table_name,
--Join two tables
if(t2.table_read_num is null,0,table_read_num) as table_read_num
FROM information_schema.tables t1
left join(
select
--Remove the quotation marks ("") before and after the table name
regexp_replace(t.input_table,""","") as input_table
,count(distinct t.inst_id) table_read_num
from
(
select
--Remove the brackets ([]) at the beginning and end of the input_tables field
trans_array(1,",",inst_id,replace(replace(input_tables,"[",""),"]","")) as (inst_id,input_table)
from information_schema.tasks_history
where ds='20190902' )t
group by input_table
)t2
--Match by using join conditions
on concat('your_project_name.',t1.table_name)=t2.input_table
order by table_read_num desc
limit 1000;
In the SQL statement, your_project_name
. is the table name prefix. You need to modify the name based on your actual data. The results should look something like this:
Now, all the tables are sorted by the number of times they have been used. This means you can obtain the sorting information based on the number of times each table is used.