Best Practices for Batch Processing Massive OSS Files through Serverless Workflow and Function Compute

Image for post
Image for post

By Chang Shuai

Background

  • Replication of massive OSS files (within a bucket or across buckets) with the storage type changed from Standard to Archive to reduce costs.
  • Restoration of OSS files concurrently for applications to use the backup archive files.
  • Decompression of oversized files driven by an event. In this scenario, GB-level packages and packages with more than 100,000 files are automatically decompressed to a new OSS path after uploading.

The preceding three scenarios share some common challenges:

  1. Long total processing time: Even highly concurrent access to OSS takes days or more to process hundreds of millions of OSS files.
  2. Handling exceptions that may occur in a large number of remote calls: Generally, OSS APIs are designed to process a single file. Therefore, processing millions to tens of millions of files requires the same number of remote calls. In a distributed system, you need to handle failures in remote calls.
  3. State persistence: A checkpoint-like mechanism is required to reduce the occurrence of reprocessing upon partial failure of the original processing. This helps save the overall processing time. For example, the first 10 million of processed keys are skipped in batch processing.

This article will introduce a serverless best practice based on Serverless Workflow and Function Compute (FC) to address the preceding three scenarios.

Replicate and Archive Massive OSS Files

For example, you need to copy hundreds of millions of OSS files into a bucket to another bucket in the same region to convert standard storage into archive storage. In this oss -batch-copy instance, we provide a workflow application template to back up all the files listed in your index file by calling the OSS CopyObject function in sequence. The index file contains the OSS object meta to be processed. For example:

Image for post
Image for post

The index for hundreds of millions of OSS files can be hundreds of GB. Therefore, we need to use the range to read the index file and process part of the OSS files at a time. In this case, we need a control logic similar to while hasMore {} to ensure the index file is fully processed. Serverless Workflow adopts the following implementation logic:

  1. copy_files task step: Read the size of the input from the offset position of the input index file, extract the files to be processed, and call the OSS CopyObject function through FC.
  2. has_more_files selection step: After you process a batch of files, check whether the current index file is fully processed by running the conditional comparison. If yes, proceed to the success step. If no, input the (offset, size) value of the next page to copy_files for loop execution.
  3. start_sub_flow_execution task step: Since the execution of a single workflow is limited by the number of history events, the event ID for the current workflow can be referred to for judgment during this step. If the number of current events exceeds a threshold, a new identical process is triggered, and the process continues after the sub-process ends. A sub-process can also trigger its own sub-process, which ensures that the entire process can be completed regardless of the number of OSS files.
Image for post
Image for post

Using the workflow for batch processing can guarantee the following expectations:

  1. Almost arbitrarily long processing time of a single request for any number of files: The workflow can be run for one year at most.
  2. Free of maintenance and operations and no need to implement high availability on your own: Serverless Workflow and FC are highly available serverless cloud services.
  3. No need to implement checkpoints and status maintenance: If the process fails for any reason, you can resume it from the last successful offset. You do not need to use any database or queue during this process.
  4. Retry-upon-failure configuration: Most instantaneous remote call errors can be handled through the configuration of exponential backoff.

Scenario 2: Restore OSS Files at High Concurrency and in Batches

  1. Unlike CopyObject, the Restore operation is asynchronous. That is, after the operation is triggered, you must poll the object status before restoring the files.
  2. A single object can be restored in minutes and the duration may vary with the object size. This means that a higher concurrency in the whole process is needed to restore the files within the specified time.

With the logic similar to oss-batch-copy, in this instance, you can restore OSS files in batches through ListObjects. Restoring a batch of files is a sub-process. In each sub-process, use this for each parallel loop step to restore OSS objects at high concurrency. A maximum of 100 OSS objects can be restored concurrently. Restore is an asynchronous operation. After each Restore operation for an object, you must poll the object status until the object is restored. Restoring and polling are done in the same concurrent branch.

Image for post
Image for post

Restoring files in batches by using Serverless Workflow and FC has the following features:

  1. Objects can be restored at high concurrency, reducing the overall recovery duration.
  2. Status-based polling ensures that all objects are restored at the end of the process.

Decompress Large OSS Files upon Event Triggering

  1. 10-minute execution time limit for a single function: Decompression is prone to failure due to execution timeout for GB-level packages or packages that contain a large number of small files.
  2. Low fault tolerance: For the asynchronous call of FC by OSS, the access to OSS within the function may fail immediately. When the function call fails, you can retry the FC asynchronous call three times at most. Otherwise, the message is discarded and the decompression fails.
  3. Insufficient flexibility: After decompression, multiple users request for sending notifications and SMS messages to message services and delete original packages. However, it is difficult for a single function to meet these demands.

To address the prolonged execution and custom retries, in this instance, we introduce Serverless Workflow to schedule FC tasks. Start Serverless Workflow after an OSS event triggers FC. Serverless Workflow uses the metadata of the zip package for streaming reading, unzipping, and uploading to the OSS target path. The current marker is returned when the execution time of each function exceeds a threshold. Then, Serverless Workflow determines whether the current marker indicates that all files are processed. If yes, the process ends. If no, the streaming decompression continues from the current marker until the end.

Image for post
Image for post

The addition of Serverless Workflow removes the 10-minute limit for function calls. Moreover, built-in status management and custom retry ensure that GB-level packages and packages with more than 100,000 files can be decompressed reliably. Serverless Workflow supports a maximum execution time of one year. On this basis, almost any size of zip packages can be streaming decompressed.

Image for post
Image for post

The decompression process can be customized flexibly thanks to Serverless Workflow. The following figure shows how a user notifies the MNS queue after decompression and how to delete the original package in the next step.

Image for post
Image for post

Takeaways

  1. Long-running processes for up to one year without interruption
  2. Status maintenance without effects from system failover
  3. Improved instantaneous error tolerance
  4. Highly flexible customization

The batch processing of large amounts of OSS files involves more than the three scenarios mentioned in this article. We look forward to discussing more scenarios and requirements with you at a later date.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store