Join us at the Alibaba Cloud ACtivate Online Conference on March 5–6 to challenge assumptions, exchange ideas, and explore what is possible through digital transformation.
A distributed file system stores data by blocks. The size of a block is 64 MB by default and a file smaller than a block is called a small file. Having too many small files on MaxCompute can be problematic, which we will discuss in further detail in this article.
To determine whether there are too many small files, run the following command to check the number of files:
desc extended + Table name
Guideline for Number of Small Files in a Table
We know that having too many small files in a table can be suboptimal for our MaxCompute table, but how do we determine the threshold for the number of files? Let’s look at the criteria for determining whether there are too many small files in a table:
- Non-Partition Table: The number of table files reaches 1000, and the average file size is less than 64 MB.
- Partition Table:
- The number of files in a partition reaches 1000, and the average file size is less than 64 MB.
- The number of partitions of the entire non-partition table reaches 50,000 (the system limit is 60,000).
Causes of Having a Large Number of Small Files
- The table design is not reasonable. More partitions lead to more files. For example, if a table is partitioned by day, hour, and business unit (assume that there are six business units), the number of partitions will reach 365246 = 52,560 by the end of the year.
- When data integration tools, such as Tunnel, DataHub, and Console, are used to upload data, the data is frequently committed and written to the table (table partitions). As a result, the number of files in a partition may reach hundreds of thousands, and most of these files are small files with a size of several KB.
- When “insert into” is used to write data, only a few data records are written at a time and such operations are performed frequently.
- A large number of small files are generated in the reduce processes.
- Too many temporary files are generated in the job execution processes, and there are too many expired files in the recycle bin.
Note: Although MaxCompute automatically merges small files for optimization purposes, you still need to use the appropriate table partition design and data upload method to avoid small files generated due to causes 1, 2, and 3.
Consequences of Having an Excessively Large Number of Small Files
It takes less time for MaxCompute to process a single large file than multiple small files. Too many small files affect the overall performance of MaxCompute, increase the pressure on the file system, and decrease the space utilization. The number of small files that MaxCompute can process for a Fuxi instance is limited to 120. An excessively large number of files affects the number of Fuxi instances and the overall performance of MaxCompute.
Solution 1: Merging Small Files
If small files are unavoidable, you can consider merging them. To do this, you can run the following command:
The default value is 50000. If the number of partitions is greater than 50000, adjust maximum number of small files to 1000000. If the number of small files exceeds 1000000, merge small files multiple times using
ALTER TABLE Table Name [partition] MERGE SMALLFILES;
If your table is already a partition table, check whether the partition fields can be converged. Too many partitions also affect the computing performance of MaxCompute. We recommend that you partition the table by date.
- Periodically run the command for merging small files.
- If the table is partitioned by date, run “insert override” to overwrite the partition data of the previous day on a daily basis.
insert overwrite table tableA partition (ds='20181220')
select * from tableA where ds='20181220';
If your table is a non-partition table, periodically run the command for merging small files. However, we recommend that you design the table as a partition table:
- Create a new partition table. We recommend that you partition the table by date and set the lifecycle properly to facilitate recycling of historical data.
- Import the data of the original non-partition table into the new partition table. We recommend that you temporarily stop the real-time write service of the original non-partition table first.
create table sale_detail_patition like sale_detail; alter table sale_detail_insert add partition(sale_date='201812120', region='china'); insert overwrite table sale_detail_patition partition (sale_date='20181220', region='china') select * from sale_detail;
- Modify the upstream and downstream services. Modify the database import program to write data to the new partition table instead. Modify the query job to query data in the new partition table instead.
- Delete the original partition table after the data is migrated to the new partition table and verified.
Note: If you use “insert overwrite” to re-write the full data to the merged small files, “insert overwrite” and “insert into” cannot coexist; otherwise, the data may be lost.
Solution 2: Avoiding Small Files
Optimize the Table Design
Design the table partition properly. Whenever possible, design partition fields that can be converged or managed. An excessive number of partitions also affects the computing performance of MaxCompute. We recommend that you partition the table by date and set the lifecycle properly to facilitate the recycling of historical data and control of your storage costs.
Avoid Tools that Generate Small Files
Avoid using various data integration tools that generate small files.
Tunnel -> MaxCompute
Avoid frequent commit operations when using Tunnel to upload data. Whenever possible, ensure that the size of data submitted each time is greater than 64 MB. For more information, see https://www.alibabacloud.com/help/doc-detail/27833.htm
DataHub -> MaxCompute
If you are using DataHub to generate small files, we recommend that you apply for shards properly and merge shards based on the topic throughput to reduce the number of shards. You can observe the data traffic changes based on the topic throughput and appropriately increase the interval between data write operations.
The policy for applying for the number of DataHub shards is as follows (too many DataHub shards will result in an excessively large number of small files):
- The default throughput of a single shard is 1 MB/s. Based on this value, you can apply for the actual number of shards required (plus a few extra).
- The MaxCompute synchronization logic is that each shard has an independent task and data is committed every 5 minutes or for every 64 MB of data. The interval is set to 5 minutes by default so that data can be located in MaxCompute as soon as possible. If you create partitions by hour, one shard creates 12 files per hour. If the amount of data is small, but the number of shards is large, so a lot of small files (number of shards x 12/hour) exist in MaxCompute. Therefore, allocate shards as needed to prevent an excessive number of shards.
Recommendation: If the traffic is 5 MB/s, apply for five shards. To reserve a buffer of 20% to cope with traffic peaks, you can apply for six shards.
DataX -> MaxCompute
DataX also encapsulates the Tunnel SDK to write data to MaxCompute. Therefore, we recommend that you set blockSizeInMB to a value greater than 64 MB when configuring ODPSWriter.