What’s New with Mars — Alibaba’s Distributed Scientific Computing Engine

Image for post
Image for post

By Ji Sheng

In March 2020, Mars 0.4.0b1, 0.4.0b2, 0.3.2, and 0.3.3 were released. You can click the different links to view the release notes. Two versions were released this month due to a special circumstance. In V0.4.0b2, the urgent problems in V0.4.0b1 were fixed.

Mars Project Release Cycle

Let’s look at the Mars project release cycle. Generally, the pre-release and official versions of Mars are released at the same time every month. The pre-release version contains more radical functions or changes, which may be unstable, while the stable functions or enhancements are synchronized to the official version.

Click the GitHub project’s milestones to view the latest pre-release and official versions.

Click the GitHub project to view the categorized issues and Pull Requests (PRs).

The v0.4 release contains the ongoing issues and PRs, which are archived by version. In other releases, issues and PRs are categorized by module.

Features in the New Version

We spent a lot of time improving the DataFrame API in the new version. Some common APIs in pandas are supported in this version.

Better Aggregation and Group Aggregation

  • #1030 enabled Groupby.aggregate to support multiple aggregation functions.
  • #1054 added support for DataFrame.aggregate and Series.aggregate.
  • #1019 and #1069 added support for cumulative computing such as cummax.

For example, in pandas, we can perform the following operations on MovieLens data:

We can aggregate the data according to movie IDs to obtain the maximum, minimum, and average values and the standard deviation of user reviews.

When Mars is used:

The code is almost the same, except that Mars needs to be started by execute().

The size of the ratings.csv file is more than 500 MB. We can accelerate it several times over by running Mars on a notebook. As the data size increases, Mars provides better acceleration performance. If a single server cannot meet the requirements, we can use Mars to accelerate the execution by distributing consistent code on multiple servers.

Sorting

  • #1053 added support for sort_index.
  • #1046 added support for sort_values.

We will use MovieLens data as an example again:

and attempt to sort the films in the dataset in descending order by average score.

The code is still the same in Mars.

Mars uses Parallel Sorting by Regular Sampling (PSRS.)

Improved Index Support

The earlier versions of Mars supported iloc but the latest version also supports other indexing methods.

  • #1042 added support for loc.
  • #1101 added support for at and iat.
  • #1073 added support for the md.date_range method.

By using loc, we can search for index-based data more conveniently.

Custom Functions, Strings, and Time Processing

  • #1038 added support for apply.
  • #1063 added support for md.Series.str and md.Series.dt to process strings and time columns.

We can use apply to calculate the distance from each city (dataset) to Hangzhou (120°12' E and 30°16' N).

Moving Window Functions

  • #1045 added support for rolling moving windows.

Moving Window Functions are used frequently in the financial field. Rolling means to perform some aggregation calculations at a fixed length (or a fixed time interval). The following provides an example:

Plans for Upcoming Versions

The next versions of Mars will be V0.4.0rc1 and V0.3.4. We will continue to concentrate on the coverage and performance of DataFrame API, improving stability, and adding documents.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store