MaxCompute Table Design Specification

MaxCompute Table Restrictions

Table Design Specification — Main Goals of Table Design

  1. Reduce storage costs. Appropriate table design can reduce redundant storage in hierarchical data designs, thus reducing the amount of data in intermediate tables. At the same time, it can correctly manage the lifecycle of the table data, and can directly reduce the amount of data stored to reduce the storage cost.
  2. Reduce computing costs. Standardize the table design to optimize the reading of the data according to these specifications in the subsequent query and computation of the table data, reducing redundant reading, writing and computing in the computation process and reducing costs while improving the computation performance.
  3. Reduce the maintenance complexity. Standardized table hierarchical design can directly reflect the characteristics of the business. For example, by optimizing the data collection method in the data channel and standardizing the table design, problem with small and medium-sized files in the distributed system as well as the number of table and partition maintenance can be reduced, thus reducing the complexity.

Impact of Table Design

  1. Query and compute immediately after import.
  2. Import multiple times, and query and compute regularly.
  3. Generate intermediate tables after import for computation.
  4. Reasonable table design and data integration cycle management can reduce the cost of data during storage. MaxCompute takes precedence as a batch data integration library and computes by business logic, such as by partition.
  5. Query and compute immediately after the import, and the amount of data imported each time needs to be considered to reduce the streaming import of small amounts of data.
  6. Unreasonable data import and storage (small files) affect the overall storage performance, computing performance and operational stability.

Steps of Table Design

  1. Determine the project to which the table belongs. Plan the table type and determine which data level it belongs to according to the business process.
  2. Define table descriptions, including permission definitions and Owner definitions.
  3. Define a partitioned table or a non-partitioned table based on the amount of data and the characteristics of data integration.
  4. Define a field or a partition field
  5. Create a table/transform a table
  6. Identify the relevant factors of the imported data scenario (including batch data writing/streaming data writing/striped data insertion).
  7. Define the lifecycle of table and partition data.
  1. After the table is created, the table schema can be modified according to business changes, such as setting the lifecycle and RangeClustering.
  2. During the design phase, special attention needs to be paid to the scenario of distinguishing data (batch data writing/streaming data writing/periodic striped data insertion).
  3. The non-partitioned tables and partitioned tables need to be used reasonably. It is recommended to use partitioned tables for log tables, fact tables, original collection tables and so on, and partition them by time.
  4. Make sure to take note of the restrictions of various tables and partitions.

Table Data Storage Specification

Standardize Data Lifecycle by Data Hierarchy

Standardize Data Preservation by Data Change and History

Data Import Channel and Table Design

  1. Datahub, which plans the relationship between written partitions and written traffic and commits every 64 M.
  2. Data integration or DataX, which plans the frequency of writing table partitions, commits every 64 M and avoids committing empty directories. DTS, which plans the relationship between the existing partition and the incremental partition of the written table and sets the frequency of commit.
  3. Console (Run SQL or Tunnel upload), which avoids inserting or uploading files having high-frequency and little data.
  4. The “insert into” in the SDK Run Sql. Note that when uploading files to a table or a partition, you need to organize small files after inserting them into the partition to avoid inserting them to a partitioned or non-partitioned table more than one time, and need to merge after insert.
  1. MaxCompute can only import data through the Tunnel SDK or by executing INSERT INTO in SQL to avoid streaming insertion.
  2. Each of the channels above has its own logic for streaming data writing, batch data writing, and periodic scheduling writing.
  3. When writing a table or a partition, the data channel should control the amount of data written at one time to a reasonable value, such as 64 M or more.

Relationship between Partition Design and Logical Storage

  1. The number of partitions has an upper limit.
  2. The amount of data in each partition must not be too small.
  3. Query and compute according to the partitioning criteria.
  4. Avoid writing data to each partition more than one time.

Basic Rules for Table and Partition Design

  1. The business type of the table needs to be distinguishable.
  2. Distinguish whether the table is a “fact table”, “dimension table”, “log table” or “limits storage table” (function to be published).
  3. The entity information of the table needs to be distinguishable.
  4. Fields with the same business meaning in different tables should be defined using a uniform data type.
  5. Avoid unnecessary type conversions.
  1. Only creating partitions are supported and creating partition columns are not supported.
  2. The number of partitions supported by a single table is 60,000.
  3. To add partitions to a table that has multi-level partitions, specify all partition values.
  4. The name of a partition column cannot be modified. Only the values in that column can be altered. To change the values in one or more partitions among multi-level partitions, you must specify values for partitions at each level.

Partition Design

Selection of Partition Field and Normal Field

  1. The corresponding fields can be set as partition fields from the data management scope and the commonly used data scanning scope.
  2. For fields that are irregular or have more than 10,000 types and are not often used as query criteria, they are set as normal fields.

Definition Basis of Partition Field

  1. The time should be fully taken into account when selecting partition columns, trying to avoid updating existing partitions.
  2. If multiple fact tables (excluding dimension tables) are joined, the “Where” range is used as the query criteria and its result columns are used as partition columns.
  3. Choose the column referenced in “Group By” or “Distinct” as the partition column.
  4. Choose as the partitioning column a column which contains evenly distributed values rather than skewed data.
  5. If a column is referenced within the “=” and “in” conditions of a commonly used SQL statement, choose the column as the partitioning column.
Select ... from table where id=123 and .... ;

Definition Basis of the Number of Partitions

  1. Time partition: partition can be done by day or by month. If partition is done by hour, the average number of the second-level partitions should not be more than 8.
  2. Geographic partition: partition can be done by province, city and county and the multi-level partition should be considered. For example, 23 provinces, 5 autonomous regions, 4 municipalities and 2 special administrative regions; 50 regions (States, leagues); 661 cities, including: 4 municipalities, 283 prefecture-level cities and 374 county-level cities; 1636 counties (autonomous counties, banners, autonomous banners, special zones and forest areas ). After partitioning according to the most fine-grained county level, no more fine-grained partitioning should be performed by hour.

Recommendations on the Number of Partitions and the Amount of Data

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

One of the most cringe-worthy questions I get as QA manager is “How many automated tests do you…

Setup Single-node Kubernetes Cluster on a Home Lab server using k0s

How to Install Discourse on Alibaba Cloud Using DirectMail and CDN

Probabilistic Data Structures and Algorithms in PipelineDB

Building a microservice with Golang, Kafka and DynamoDB — Part I

My Alibaba Cloud ACE Experience: Part 3

Introduction to Alibaba Cloud DMS: Task Orchestration and Practical Data Operations

How I Learned to Onboard Effectively in an Engineer Manager Role

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Automating clustered columnstore indexes for your SQL data warehouse

Performance Tuning tips for Bigdata Tables in HQL/IMPALA

Outgrowing Postgres? Keep using Postgres!

EXCEPTION handling for transaction with Snowflake Scripting(Stored Procedure/Anonymous block)