Optimizing MaxCompute Table to Cope with Small Files

Join us at the Alibaba Cloud ACtivate Online Conference on March 5–6 to challenge assumptions, exchange ideas, and explore what is possible through digital transformation.

A distributed file system stores data by blocks. The size of a block is 64 MB by default and a file smaller than a block is called a small file. Having too many small files on MaxCompute can be problematic, which we will discuss in further detail in this article.

To determine whether there are too many small files, run the following command to check the number of files:

desc extended + Table name

Guideline for Number of Small Files in a Table

  1. Non-Partition Table: The number of table files reaches 1000, and the average file size is less than 64 MB.
  2. Partition Table:
  3. The number of files in a partition reaches 1000, and the average file size is less than 64 MB.
  4. The number of partitions of the entire non-partition table reaches 50,000 (the system limit is 60,000).

Causes of Having a Large Number of Small Files

  1. When data integration tools, such as Tunnel, DataHub, and Console, are used to upload data, the data is frequently committed and written to the table (table partitions). As a result, the number of files in a partition may reach hundreds of thousands, and most of these files are small files with a size of several KB.
  2. When “insert into” is used to write data, only a few data records are written at a time and such operations are performed frequently.
  3. A large number of small files are generated in the reduce processes.
  4. Too many temporary files are generated in the job execution processes, and there are too many expired files in the recycle bin.

Note: Although MaxCompute automatically merges small files for optimization purposes, you still need to use the appropriate table partition design and data upload method to avoid small files generated due to causes 1, 2, and 3.

Consequences of Having an Excessively Large Number of Small Files

Solution 1: Merging Small Files

set odps.merge.max.filenumber.per.job=50000;

The default value is 50000. If the number of partitions is greater than 50000, adjust maximum number of small files to 1000000. If the number of small files exceeds 1000000, merge small files multiple times using

ALTER TABLE Table Name [partition] MERGE SMALLFILES;

Partition Table

  1. Periodically run the command for merging small files.
  2. If the table is partitioned by date, run “insert override” to overwrite the partition data of the previous day on a daily basis.

Example:

insert overwrite table tableA partition (ds='20181220')
select * from tableA where ds='20181220';

Non-Partition Table

  1. Create a new partition table. We recommend that you partition the table by date and set the lifecycle properly to facilitate recycling of historical data.
  2. Import the data of the original non-partition table into the new partition table. We recommend that you temporarily stop the real-time write service of the original non-partition table first.
  3. Example:
  • create table sale_detail_patition like sale_detail; alter table sale_detail_insert add partition(sale_date='201812120', region='china'); insert overwrite table sale_detail_patition partition (sale_date='20181220', region='china') select * from sale_detail;
  1. Modify the upstream and downstream services. Modify the database import program to write data to the new partition table instead. Modify the query job to query data in the new partition table instead.
  2. Delete the original partition table after the data is migrated to the new partition table and verified.

Note: If you use “insert overwrite” to re-write the full data to the merged small files, “insert overwrite” and “insert into” cannot coexist; otherwise, the data may be lost.

Solution 2: Avoiding Small Files

Optimize the Table Design

To learn more about optimal table design, read MaxCompute Table Design Specification and Best Practices for MaxCompute Table Design

Avoid Tools that Generate Small Files

Tunnel -> MaxCompute

Avoid frequent commit operations when using Tunnel to upload data. Whenever possible, ensure that the size of data submitted each time is greater than 64 MB. For more information, see https://www.alibabacloud.com/help/doc-detail/27833.htm

DataHub -> MaxCompute

If you are using DataHub to generate small files, we recommend that you apply for shards properly and merge shards based on the topic throughput to reduce the number of shards. You can observe the data traffic changes based on the topic throughput and appropriately increase the interval between data write operations.

The policy for applying for the number of DataHub shards is as follows (too many DataHub shards will result in an excessively large number of small files):

  1. The default throughput of a single shard is 1 MB/s. Based on this value, you can apply for the actual number of shards required (plus a few extra).
  2. The MaxCompute synchronization logic is that each shard has an independent task and data is committed every 5 minutes or for every 64 MB of data. The interval is set to 5 minutes by default so that data can be located in MaxCompute as soon as possible. If you create partitions by hour, one shard creates 12 files per hour. If the amount of data is small, but the number of shards is large, so a lot of small files (number of shards x 12/hour) exist in MaxCompute. Therefore, allocate shards as needed to prevent an excessive number of shards.

Recommendation: If the traffic is 5 MB/s, apply for five shards. To reserve a buffer of 20% to cope with traffic peaks, you can apply for six shards.

DataX -> MaxCompute

DataX also encapsulates the Tunnel SDK to write data to MaxCompute. Therefore, we recommend that you set blockSizeInMB to a value greater than 64 MB when configuring ODPSWriter.

Reference:https://www.alibabacloud.com/blog/optimizing-maxcompute-table-to-cope-with-small-files_594464?spm=a2c41.12560556.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store