What’s New with Mars — Alibaba’s Distributed Scientific Computing Engine

Mars Project Release Cycle

Features in the New Version

Better Aggregation and Group Aggregation

In [1]: import pandas as pd                                                     In [2]: %%time 
...: df = pd.read_csv('Downloads/ml-20m/ratings.csv')
...: df.groupby('movieId').agg({'rating': ['max', 'min', 'mean', 'std']})
...:
...:
CPU times: user 5.41 s, sys: 1.28 s, total: 6.7 s
Wall time: 4.3 s
Out[2]:
rating
max min mean std
movieId
1 5.0 0.5 3.921240 0.889012
2 5.0 0.5 3.211977 0.951150
3 5.0 0.5 3.151040 1.006642
4 5.0 0.5 2.861393 1.095702
5 5.0 0.5 3.064592 0.982140
... ... ... ... ...
131254 4.0 4.0 4.000000 NaN
131256 4.0 4.0 4.000000 NaN
131258 2.5 2.5 2.500000 NaN
131260 3.0 3.0 3.000000 NaN
131262 4.0 4.0 4.000000 NaN
[26744 rows x 4 columns]
In [1]: import mars.dataframe as md                                             In [2]: %%time 
...: df = md.read_csv('Downloads/ml-20m/ratings.csv')
...: df.groupby('movieId').agg({'rating': ['max', 'min', 'mean', 'std']}).execute()
...:
...:
CPU times: user 5.81 s, sys: 6.9 s, total: 12.7 s
Wall time: 1.54 s
Out[2]:
rating
max min mean std
movieId
1 5.0 0.5 3.921240 0.889012
2 5.0 0.5 3.211977 0.951150
3 5.0 0.5 3.151040 1.006642
4 5.0 0.5 2.861393 1.095702
5 5.0 0.5 3.064592 0.982140
... ... ... ... ...
131254 4.0 4.0 4.000000 NaN
131256 4.0 4.0 4.000000 NaN
131258 2.5 2.5 2.500000 NaN
131260 3.0 3.0 3.000000 NaN
131262 4.0 4.0 4.000000 NaN
[26744 rows x 4 columns]

Sorting

In [1]: import pandas as pd                                                                                               In [2]: %%time 
...: ratings = pd.read_csv('Downloads/ml-20m/ratings.csv')
...: movies = pd.read_csv('Downloads/ml-20m/movies.csv')
...: movie_rating = ratings.groupby('movieId', as_index=False).agg({'rating': 'mean'})
...: result = movie_rating.merge(movies[['movieId', 'title']], on='movieId')
...: result.sort_values(by='rating', ascending=False)
...:
...:
CPU times: user 5.17 s, sys: 1.13 s, total: 6.3 s
Wall time: 4.05 s
Out[2]:
movieId rating title
19152 95517 5.0 Barchester Chronicles, The (1982)
21842 105846 5.0 Only Daughter (2013)
17703 89133 5.0 Boys (Drenge) (1977)
21656 105187 5.0 Linotype: The Film (2012)
21658 105191 5.0 Rocaterrania (2009)
... ... ... ...
26465 129784 0.5 Xuxa in Crystal Moon (1990)
18534 92479 0.5 Kisses for My President (1964)
26475 129834 0.5 Tom and Jerry: The Lost Dragon (2014)
24207 115631 0.5 Alone for Christmas (2013)
25043 119909 0.5 Sharpe's Eagle (1993)
[26744 rows x 3 columns]
In [1]: import mars.dataframe as md                                                                                       In [2]: %%time 
...: ratings = md.read_csv('Downloads/ml-20m/ratings.csv')
...: movies = md.read_csv('Downloads/ml-20m/movies.csv')
...: movie_rating = ratings.groupby('movieId', as_index=False).agg({'rating': 'mean'})
...: result = movie_rating.merge(movies[['movieId', 'title']], on='movieId')
...: result.sort_values(by='rating', ascending=False).execute()
...:
...:
CPU times: user 4.97 s, sys: 6.01 s, total: 11 s
Wall time: 1.39 s
Out[2]:
movieId rating title
19152 95517 5.0 Barchester Chronicles, The (1982)
21842 105846 5.0 Only Daughter (2013)
17703 89133 5.0 Boys (Drenge) (1977)
21656 105187 5.0 Linotype: The Film (2012)
21658 105191 5.0 Rocaterrania (2009)
... ... ... ...
26465 129784 0.5 Xuxa in Crystal Moon (1990)
18534 92479 0.5 Kisses for My President (1964)
26475 129834 0.5 Tom and Jerry: The Lost Dragon (2014)
24207 115631 0.5 Alone for Christmas (2013)
25043 119909 0.5 Sharpe's Eagle (1993)
[26744 rows x 3 columns]

Improved Index Support

In [1]: import mars.dataframe as md 

In [3]: import mars.tensor as mt
In [8]: df = md.DataFrame(mt.random.rand(10000, 10), index=md.date_range('2000-1-1', periods=10000)) In [9]: df.loc['2020-3-25'].execute()
Out[9]:
0 0.372354
1 0.139235
2 0.511007
3 0.102200
4 0.908454
5 0.144455
6 0.290627
7 0.248334
8 0.912666
9 0.830526
Name: 2020-03-25 00:00:00, dtype: float64

Custom Functions, Strings, and Time Processing

In [1]: import numpy as np                                                                                                In [2]: def haversine(lat1, lon1, lat2, lon2): 
...: dlon = np.radians(lon2 - lon1)
...: dlat = np.radians(lat2 - lat1)
...: a = np.sin(dlat / 2) ** 2 + np.cos(np.radians(lat1)) * np.cos(np.radians(lat2)) * np.sin(dlon / 2) ** 2
...: c = 2 * np.arcsin(np.sqrt(a))
...: r = 6371
...: return c * r
...:
In [4]: import mars.dataframe as md In [5]: df = md.read_csv('Downloads/world-cities-database/worldcitiespop.csv', chunk_bytes='16M', dtype={'Region': object}
...: )
In [6]: df.execute(fetch=False) In [8]: df.apply(lambda r: haversine(r['Latitude'], r['Longitude'], 30.25, 120.17), result_type='reduce', axis=1).execute()
Out[8]:
0 9789.135208
1 9788.270528
2 9788.270528
3 9788.270528
4 9789.307210
...
248061 10899.720735
248062 11220.703197
248063 10912.645753
248064 11318.038981
248065 11141.080171
Length: 3173958, dtype: float64

Moving Window Functions

In [1]: import pandas_datareader.data as web                                                                                                                      In [2]: data = web.DataReader("^TWII", "yahoo", "2000-01-01","2020-03-25")                                                                                        In [3]: import mars.dataframe as md                                                                                                                               In [4]: df = md.DataFrame(data)                                                                                                                                   In [5]: df.rolling(10, min_periods=1).mean().execute()                                                                                                            
Out[5]:
High Low Open Close Volume Adj Close
Date
2000-01-04 8803.610352 8642.500000 8644.910156 8756.549805 0.0 8756.517578
2000-01-05 8835.645020 8655.259766 8667.754883 8803.209961 0.0 8803.177734
2000-01-06 8898.426758 8714.809896 8745.356445 8842.816732 0.0 8842.784180
2000-01-07 8909.012451 8720.964844 8772.374756 8844.580078 0.0 8844.547607
2000-01-10 8952.413867 8755.129883 8806.285742 8896.183984 0.0 8896.151172
... ... ... ... ... ... ...
2020-03-19 10423.317090 10083.132910 10370.730078 10180.533887 4149640.0 10180.533887
2020-03-20 10202.623047 9833.786914 10105.280078 9971.761914 4366130.0 9971.761914
2020-03-23 9983.399023 9611.036914 9885.659082 9763.000977 3990040.0 9763.000977
2020-03-24 9821.716016 9436.392969 9703.275098 9591.208984 3927690.0 9591.208984
2020-03-25 9685.129980 9290.444922 9543.636035 9466.308984 4003760.0 9466.308984
[4974 rows x 6 columns]

Plans for Upcoming Versions

Original Source:

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Journey into preparing to host and facilitate the first Coderetreat in Crete

Populate Your Laraval Project’s Database With Initial Data

Populating the database of your laravel project’s database with initial data using data seeding

PHP Xdebug proxy: when Xdebug’s standard capabilities are insufficient

Deploy your Rails app using EC2,Capistrano and NGINX

What I learnt from my first internship experience as a software engineer

How To Run The DIR BUSTER In Your Windows Machine

Next-Gen Ed — Part IV: Learning at Lightspeed

2021: Creating a Lexicon of Future

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

TUSKER NO CODE AI: Empowering Global Businesses With Automation

Automatically Find Label Issues in Image Classification Datasets

Examples of label errors in the MNIST dataset — an image depicting “3” but labeled as “5” and another image depicting the number “7” but labeled as “4”.

Fine Tuning Tesseract LSTM Models with Custom Data in MacOS

Performance verification of anomaly detection model automatic creation platform with MVTec dataset

Screen of ADFI