MaxCompute Tunnel Offline Batch Data Channel FAQs

Tunnel is an offline batch data channel service provided by Alibaba Cloud MaxCompute. It mainly provides the uploading and downloading of large batches of offline data, and is only applicable to scenarios where each batch is greater than or equal to 64 MB of data. MaxCompute Tunnel is available in Java and C++ SDKs.

You can upload and download only table data (excluding view data) with MaxCompute Tunnel. It allows multiple clients to upload the same table at the same time. For small batch streaming data scenarios, use DataHub real-time data channel for better performance and experience.

Best Practices for SDK Upload

Refer to the following code when using SDK for Tunnel uploads.

Constructor

PartitionSpec(String spec): Uses a string to construct this class of object.

Parameters

spec: The definition string of the partition, such as pt=’1', ds=’2'.

Therefore, the program should be configured like this: private static String partition = “pt=’XXX’,ds=’XXX’”;

Frequently Asked Questions about MaxCompute Tunnel

Can block IDs be repeated?

Each block ID in an Upload session must be unique. That is, for the same UploadSession, open the RecordWriter with one blockId and call the “Close” after writing a batch of data.

Then, after the commit is completed and the write is successful, you cannot open another RecordWriter to write data with the same blockId again. A maximum of 20,000 blocks are supported, with the block IDs ranging from 0 to 19999.

Is there a restriction on block size?

The maximum size of a block is 100 GB. We strongly recommend that you write 64 MB or more data into each block. Each block corresponds to one file. A file smaller than 64 MB is a small file. Too many small files will affect the performance.

Using the latest version of BufferedWriter can simplify uploading and avoid problems like too many small files. BufferedWriter Object in the new version of Tunnel SDK

Can a session be shared? Does a session have a lifecycle?

Each session has a 24-hour lifecycle on the server. It can be used within 24 hours after being created, and can be shared across processes or threads on the condition that the same BlockId is repeatedly used. Distributed uploading can be done through:

Create Sessions -> evaluate data size -> assign Blocks (for example, thread 1 uses 0–100 and thread 2 uses 100–200) -> prepare data -> upload data -> commit all Blocks.

If a session is created but not used, does it consume system resources?

Upon creation, each session generates two file directories. If a large number of sessions are left unused after created, temporary file directories will increase and accumulate, causing extra burden on the system. Therefore, you should avoid creating too many sessions and instead use shared sessions whenever possible.

How can I process Write/Read timeout or I/O exceptions?

During the process of uploading data, a Writer writing every 8 KB data will trigger a network action. If no network actions are triggered within 120 seconds, the server closes the connection. At this point, the Writer become unavailable, and you need to open a new Writer to write data.

We recommended that you use the [Tunnel-SDK-BufferedWriter] interface to upload data. This interface blocks users from blockId details, has an internal data buffer, and automatically retries failures.

When downloading data, the Reader has a similar mechanism. If no network I/O occurs for a long period of time, the connection is closed. We recommend that you run Read without inserting any interfaces from other systems.

Is MaxCompute Tunnel suitable for batch uploading or stream uploading?

MaxCompute Tunnel is designed for batch uploading rather than stream uploading. For stream uploading, you can use the [high-speed streaming data channel DataHub ] to write data only with milliseconds of latency.

Are partitions required for data uploading through MaxCompute Tunnel?

Yes, MaxCompute Tunnel does not automatically build partitions.

What is the relationship between Dship and MaxCompute Tunnel?

Dship is a tool that uploads and downloads data through MaxCompute Tunnel.

Does data uploaded with Tunnel append to or overwrite existing data on a file?

The uploaded data appends to the file.

What is the routing function of MaxCompute Tunnel?

The routing function allows the Tunnel SDK to get the Tunnel endpoint by setting MaxCompute. That is, you can run the Tunnel SDK properly by setting the endpoint of MaxCompute.

How much data in a block is preferred when uploading data with MaxCompute Tunnel?

There is no absolute answer to this question. It depends on a variety of factors, such as network performance, real-time requirements, the specific use of the data, and small files in clusters. Generally, we recommend that you limit data in a block between 64 MB and 256 MB if data is relatively large in size and needs to be continuously uploaded.

However, if only a batch of data is uploaded daily, you can extend that limit to around 1 GB.

Why do I keep getting a timeout prompt when using MaxCompute Tunnel?

This usually happens due to endpoint errors. Please check the endpoint configuration. A simple method is to check the network connectivity by using tools like telnet.

Why do I receive the exception, “You have NO privilege ‘odps:Select’ on {acs:odps:*:projects/XXX/tables/XXX}. project ‘XXX’ is protected” when I use Tunnel to download data?

The data protection function has been enabled for the project. Only the project owner has the right to transfer data from one project to another if the project data is protected.

Why do I receive the exception, “ErrorCode=FlowExceeded, ErrorMessage=Your flow quota is exceeded” when I use Tunnel to upload data?

The maximum number of concurrent requests is exceeded. By default, MaxCompute Tunnel allows a maximum of 2,000 concurrent upload and download requests (quota). Each request, once it is sent, occupies one quota unit until it ends. Try the following solutions:

  1. Change the system to the sleep status, and try again later.
  2. Increase the tunnel concurrency quota for the project. We recommend that you contact the administrator to evaluate the traffic flow.
  3. Report the exception to the project owner to identify and control the top concurrency quota consumers.

To learn more about Alibaba Cloud MaxCompute, visit https://www.alibabacloud.com/product/maxcompute

Reference:https://www.alibabacloud.com/blog/maxcompute-tunnel-offline-batch-data-channel-faqs_594417?spm=a2c41.12532361.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store