AliORC: A Combination of MaxCompute and Apache ORC

About the Author

Wu Gang, a senior technical expert at Alibaba Computing Platform Department and the PMC of ORC (a top-level open-source Apache project), is mainly responsible for the current work related to the storage line of the MaxCompute platform. Previously, he worked at Uber headquarters and engaged in Spark and Hive related work.

Introduction to the Apache ORC Project and Alibaba’s Contribution

Apache ORC Project

As described on the official Apache ORC project website, Apache ORC is the fastest and smallest column-based storage file format in the Hadoop ecosystem. The three main features of Apache ORC include support for ACID (that is, support for transactions), support for built-in indexes, and support for various complex types.

ORC Adopter

Apache ORC has many adopters. For example, the well-known open-source software, such as Spark, Presto, Hive, and Hadoop. In addition, in 2017, the Alibaba MaxCompute technical team began to participate in the Apache ORC project, and took ORC as one of the built-in file storage formats of MaxCompute.


The general development history of the Apache ORC project is shown in the following figure. In early 2013, Hortonworks began to replace the RCFile file format. After two versions of iteration, ORC was incubated into a top-level Apache project, and successfully separated from Hive to become a separate project. In January 2017, the Alibaba Cloud MaxCompute team began to continuously contribute code to the ORC community, and made ORC one of the built-in MaxCompute file formats.

Contribution from Alibaba

The Alibaba MaxCompute technical team has made significant contributions to the Apache ORC project, such as developing a complete C++ ORC Writer, fixing some extremely important bugs, and greatly improving ORC performance. The team has submitted more than 30 patches to Apache ORC project, totaling over 15,000 lines of code. And, Alibaba is still continuously contributing code to ORC. In addition, the team has 3 ORC project contributors, including 1 PMC and 1 Committer. At the Hadoop Summit in 2017, ORC Owen O’Malley also dedicated a page of his PPT to praise the contribution of Alibaba to the ORC project.

Why Did Alibaba Cloud MaxCompute Choose ORC?

Row-Based vs. Column-Based Storage

Two mainstream methods are available for file storage: row-based storage and column-based storage. Row-based storage stores each row of data in sequence, that is, stores the first row of data first, and then stores the second row of data, and so on. Column-based storage stores the data of each column in the table in sequence, that is, stores the data in the first column first, and then stores the data in the second column. In a big data scenario, only some columns of data need to be obtained. Therefore, using the column-based storage, only a small amount of data needs to be read, thus saving a lot of disk and network I/O consumption. In addition, the data attributes of the same column are very similar and the redundancy is very high, so the data compression ratio can be increased by using column-based storage, thus greatly saving disk space. Therefore, MaxCompute finally chose the column-based storage.

A Quick Look at ORC

ORC modeling on the type system is a tree structure. For complex types, such as Struct, one or more child nodes exist. The Map type has two child nodes, key and value, while the List type has only one child node, and other common types are just leaf nodes. As shown in the following figure, the table structure on the left can be visually converted into the tree structure on the right, which is simple and intuitive.

How About Apache Parquet

In the open-source software field, Apache Parquet is the benchmark for Apache ORC. Parquet was jointly developed by Cloudera and Twitter, and was inspired by the Dremel paper published by Google. The concept of Parquet is very similar to that of ORC. It also splits files into blocks with similar sizes, and uses the column-based storage in the blocks. Moreover, its support for open-source systems is almost the same as that of ORC, and it supports systems, such as Spark and Presto. It also uses the column-based storage and common compression and encoding algorithms, and can also provide lightweight indexes and statistical information.

Benchmark: ORC vs. Parquet


Based on the Github log data and the New York City taxi data, the Hadoop open-source community has compared the performance of ORC and Parquet and obtained some statistical data.

Storage Cost

The following figure compares the performance efficiency of file storage methods, such as ORC, Parquet and JSON. As can be seen from this Taxi Size chart, Parquet and ORC have very similar storage performance.

Full Table Scan

The following figure shows the table reading efficiency comparison between ORC and Parquet for two datasets. In general, ORC is faster than Parquet. Based on the above comparison, MaxCompute finally chose ORC because of its simpler design and higher table reading performance.

AliORC = Alibaba ORC

Through the above Benchmark comparison, MaxCompute chooses ORC based on performance considerations. From other aspects, ORC also has some advantages over Parquet, such as simpler design, better code quality, language independence, and efficient support for multiple open source projects, as mentioned earlier. In addition, the ORC R&D teams are relatively concentrated, and the founder has strong control over the project, so any demands and ideas raised by Alibaba can be quickly responded to and strongly supported, thus enabling it to become a leader in the community.

What Is the Difference Between AliORC and Open-Source ORC?

AliORC Is More than Apache ORC

AliORC is a deeply optimized file format based on the open-source Apache ORC. The primary goal of AliORC is to be fully compatible with the open-source ORC, so that it is more convenient for users to use. AliORC optimizes the open-source ORC from two aspects. First, AliORC provides more extended features, such as support for Clustered Index and C++ Arrow, and predicate push-down. Second, AliORC has also optimized its performance to achieve Async Prefetch, I/O mode management and adaptive dictionary encoding.

AliORC Optimization #1: Async Prefetch

Here, several specifically optimized features of AliORC compared with the open-source ORC are shared with you. The first is Async Prefetch. The traditional method of reading files is to obtain the original data from the underlying file system first, and then decompress and decode the data. These two steps are I/O-intensive and CPU-intensive tasks respectively, and no parallelism exists between them, so the overall end-to-end time is lengthened, which is actually unnecessary and wastes resources. AliORC implements parallel processing of reading data from the file system, data decompressing and data decoding, thus all disk reading operations become asynchronous. That is, all data reading requests are sent out in advance, and when data is really needed, check whether the previous asynchronous request has returned data. If the data has been returned, decompression and decoding operations can be performed immediately without waiting for the disk to be read, which greatly improves the degree of parallelism and reduces the time required to read files.

AliORC Optimization #2: Small I/O Elimination

The second optimization of AliORC is the elimination of small I/O. In the ORC file, the sizes of different columns are completely different, but data is read from the disk in units of columns each time. As a result, for columns with small data volumes, the network I/O overhead during reading is very large. To eliminate these small I/O overheads, AliORC sorts the data volume of different columns in the Writer, and puts the columns with small amounts of data together to form a large I/O block in the Reader, which not only reduces the number of small I/O, but also greatly improves the degree of parallelism.

AliORC Optimization #3: Memory Management for Streams in Each Column

The third optimization of AliORC is memory management. In the open-source ORC implementation, a large buffer, with the default size of 1 MB, is used to store compressed data for each column of data in Writer. The purpose is that the larger the buffer is set, the higher the compression ratio is. However, as mentioned above, the data volume of different columns is different, and some columns simply use a buffer of less than 1 MB, resulting in significant memory waste. A simple way to avoid memory waste is to use only a small data block as a buffer at the beginning, and allocate it as needed. If more data needs to be written, then a larger data block is provided through the Resize method similar to C++std::vector. In the original implementation, an O(N) operation is required for one Resize operation, to copy the original data from the old buffer to the new buffer, which is unacceptable for performance. Therefore, AliORC has developed a new memory management structure to allocate 64 KB blocks, but blocks are not consecutive to each other, which may cause a lot of code changes, but the changes are worthwhile. In many scenarios, the original Resize method consumes a lot of memory, which may cause the memory to run out and thus the task cannot be completed. However, the new method can greatly reduce the memory peak, with very obvious effects.

AliORC Optimization #4: Seek Read

The fourth optimization of AliORC is the Seek Read optimization. This part is slightly complicated to explain. Therefore, an example is used here to summarize it. The original problem with Seek Read is that compression blocks are large, and each compression block contains many blocks. In the figure, every 10 thousand rows of data is called a row group. In the Seek Read scenario, it is possible to seek to a section in the middle of the file, which may be contained in the middle of a compression block. For example, in the figure, row group 7 is contained in block 2. The general Seek operation is to jump to the head of Block 2 first, decompress the data before row group 7, and then jump to row group 7. However, the data in the green part in the figure is not what we need, so the data is decompressed in vain, wasting a lot of computing resources. Therefore, the idea of AliORC is to align the boundary between the compression block and the row group “when writing a file”, so seeking to any row group eliminates the need for unnecessary decompression.

AliORC Optimization #5: Adapting Dictionary Encoding

Dictionary encoding is to first sort out a dictionary for fields with high repetition, and then use the serial number in the dictionary to replace the original data for encoding. This is equivalent to converting the encoding of string data into the encoding of integer data, which can greatly reduce the data volume. However, ORC encoding has some problems. First, not all strings are suitable for dictionary encoding. In the original data, dictionary encoding is enabled for each column by default, and whether the column is suitable for dictionary encoding is determined at the end of the file. If not, it falls back to non-dictionary encoding. The fallback operation is equivalent to rewriting string data, so the overhead is very high. The optimization of AliORC is to use an adaptive algorithm to determine whether dictionary encoding is required for a column in advance, which can save a lot of computing resources. The open-source ORC uses std::unordered_map in the standard library to implement dictionary encoding, but its implementation method is not suitable for MaxCompute data. While the open-source dense_hash_map library of Google can improve writing performance by 10%, so AliORC adopts this implementation method. At last, the open-source ORC standard requires sorting dictionary types, but in fact it is not necessary. Removing this restriction can improve the performance of the Writer by 3%.

AliORC Optimization #6: Range Alignment for Range Partition

This part is mainly about the optimization of range partition. As shown in the DDL on the right of the following figure, a table is subjected to the RANGE CLUSTERED operation according to some columns, and the data of these columns is sorted. For example, the data is stored in 4 buckets, which store data of 0–1, 2–3, 4–8 and 9-infinity, respectively. The advantage of this is that, in the specific implementation process, each bucket uses an ORC file and stores an index similar to B+Tree at the end of the ORC file. When a query is required, if the filter is related to the range key for the query, the index can be used directly to exclude data that does not need to be read, thus greatly reducing the amount of data to be retrieved.

Value of AliORC for End Users

The following figure shows the comparison of the read time between AliORC, open-source ORC C++, and open-source ORC Java in the Alibaba internal test. As shown in the figure, the read speed of AliORC is twice as fast as open-source ORC.

Advantages of Alibaba Cloud MaxCompute Compared with Similar Products

First, MaxCompute is out-of-the-box. That is to say, users can directly start the MaxCompute service to run tasks on it without additional settings. However, the use of open-source software, such as Hive or Spark, may have many bugs, and it is extremely difficult to troubleshoot the bugs. Moreover, the bug fixing cycle of the open-source community is also very long. When users encounter problems when using MaxCompute, they can quickly receive feedback and complete bug fixes.

Why Did I Join the MaxCompute Team?

Personally, I am more optimistic about the big data field. Although the prime time for a technology is usually only 10 years, and big data technology has already existed for 10 years, I believe that big data technology will not decline. Especially with the support of artificial intelligence (AI) technology, big data technology still has many problems to solve, and it is still not perfect. In addition, the Alibaba MaxCompute team is full of talent. Teams in Beijing, Hangzhou, and Seattle have strong technical strength, and I can learn a lot in these teams. Finally, for open-source big data products, they are basically developed abroad, while MaxCompute is a completely self-developed platform in China. Joining the MaxCompute team makes me very proud to have the opportunity to make a contribution to domestic software.

How Did I Take the Road of Big Data Technology?

It was also a coincidence that I took the road of big data technology. What I learned in school had nothing to do with big data. My first job was also related to video coding, and I switched to a big data-related position at Uber later. Before I joined Uber, the Hadoop group was still in the early stage of formation. Basically, no one was actually using Hadoop, and everyone built their own services to run tasks. After I joined the Hadoop group at Uber, I followed the team to learn Scala and Spark from scratch, from learning how to use Spark to learning the Spark source code, and then slowly built up a big data platform to reach the big data field. After I joined Alibaba, I was able to learn about big data products through MaxCompute in all stages, including requirements, design, development, testing, and final optimization, which was also a valuable experience.

Working Experience at Alibaba U.S. Office

In fact, my work experience at Alibaba’s U.S. department is not very different from that of Alibaba’s domestic department. There may not be too many people in the Seattle office, but it still plays a vital role in the development of MaxCompute. The members of each BU in the Seattle office are excellent, and I can interact with colleagues in different technical directions to produce different ideas. In addition, at Alibaba’s U.S. office, many opportunities for external exchange are provided every year, and a lot of open-source sharing can also be organized.

How Did I Become the First Chinese ORC PMC?

This is actually because the MaxCompute team needed ORC, but the open-source ORC C++ had only Reader but no Writer at that time, so we needed to develop our own Writer for ORC C++. After the MaxCompute team finished the development, we hoped to gather the power of open source to do a good job in the Writer for ORC C++, so we contributed the code back to the open source community, and won the recognition of the open source community. Based on these workloads, the ORC open-source community gave the MaxCompute team two Committer positions. After I became a Committer, my responsibilities became even greater. I not only had to write my own code, but also had to grow along with the community, review the code of other members, and discuss short-term and long-term issues. The ORC community recognized the work of the MaxCompute team and I, and therefore granted me the position of PMC. Personally, the ORC open-source work also represents the attitude of Alibaba towards open source, which not only needs to be enough in quantity, but also needs to be good enough in quality.

Concluding Remarks

As long as you are interested in open source and willing to contribute continuously, no matter what kind of background and foundation you have, all your efforts will eventually be recognized.

Original Source



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website: