GIS-based Grid Operations of New Retail Merchants

Background

Jack Ma once said,

“The era of pure E-commerce will soon come to an end. In the next ten or twenty years, there won’t be E-commerce, just new retail. That is to say that online and offline business has to be integrated with logistics to create a new retail model.”

Here, online business refers to cloud platforms and offline business refers to stores or manufacturers. New logistics will eliminate inventories to reduce the volume of stock-ups.

The disappearance of E-commerce platforms means the decentralization of existing E-commerce platforms, where merchants will possess their own independent platforms and no longer settle their business on large-scale platforms like Tmall, JD.com, and Amazon. For example, if everyone owns a store on an E-commerce platform for centralized sales, the sales are restricted.

Data must be used to its greatest value if online and offline business are to be integrated and inventories eliminated. For example, it can be used to predict sales volume ahead of time. Connecting online and offline business also brings new challenges to data operation, and leads to gridded operation based on geographic locations.

Requirements

1. Support quick data search based on geographic locations (GIS)

2. Support analysis and exploration of massive sales data

Architecture design

1. Sales data usually enters the Alibaba Cloud HybridDB for PostgreSQL database concurrently by means of OSS.

2. The RDS PostgreSQL is responsible for online transaction processing and selecting merchants through gridded operation.

3. The ETL program is responsible for data scheduling.

4. The BI application is connected to HDB for PG and PG. Its drive and syntax are also compatible with PostgreSQL.

5. HybridDB for PostgreSQL provides basic high-availability and backup functions, and also provides a function for one-click storage resizing. Users do not need to worry about performance stress from future data increases.

6. HDB PG and RDS PG can transparently access (read/write) OSS data using an OSS_EXT external table plugin. The OSS provides massive shared storage, which allows RDS PG and HDB PG to share data. The OSS can also be used as an external source of massive data and is concurrently imported to an Express Connect of HDB PG. The OSS can also be used as cold data storage for RDS PG and HDB PG.

DEMO and Performance

Gridded Merchant Search

Built-in Geometry Type, Gridded Merchant Search, Testing

Using built-in polygon, box, circle and point types, GiST spatial indexes, KNN sorting operator <->, and operator @>.

1. Construct 100 million pieces of merchant geographic location data

2. Create spatial indexes

3. Create a query optimization function

Input a polygon and return the merchants that fall into its bounds.

If a spatial query with other conditions is required, a spatial composite partial index can be used, for example

4. Spatial index performance verification, where a grid-based query of 100 million pieces of data takes about 0.8 ms.

PostGIS Spatial Database, Gridded Merchant Search, Testing

In practice, longitude and latitude values are usually stored in a PostGIS spacial database. We will use the built-in polygon types during testing for the sake of convenience.

There are two gridded merchant search functions that require the use of PostGIS

http://postgis.net/docs/manual-2.3/ST_Within.html

1. ST_within

ST_Within — Returns true if the geometry A is completely inside geometry B

boolean ST_Within(geometry A, geometry B);

Returns TRUE if geometry A is completely inside geometry B. For this function to make sense, the source geometries must both be of the same coordinate projection, having the same SRID. It is a given that if ST_Within(A,B) is true and ST_Within(B,A) is true, then the two geometries are considered spatially equal.

This function call will automatically include a bounding box comparison that will make use of any indexes that are available on the geometries. To avoid index use, use the function _ST_Within.

2. ST_Contains

ST_Contains — Returns true if and only if no points of B lie in the exterior of A, and at least one point of the interior of B lies in the interior of A.

boolean ST_Contains(geometry geomA, geometry geomB);

Returns TRUE if geometry B is completely inside geometry A. For this function to make sense, the source geometries must both be of the same coordinate projection, having the same SRID. ST_Contains is the inverse of ST_Within. So ST_Contains(A,B) implies ST_Within(B,A) except in the case of invalid geometries where the result is always false regardless or not defined.

This function call will automatically include a bounding box comparison that will make use of any indexes that are available on the geometries. To avoid index use, use the function _ST_Contains.

Meanwhile, we often also need GiST spatial indexes. KNN sorting operator <->, functions to determine the center of an excircle, and a data construction function ST_PointFromText may also be necessary. For details, see the postgis manual

http://postgis.net/docs/manual-2.3/reference.html

1. Create a table and spatial indexes

2. Construct 100 million pieces of test data

3. Gridded merchant search and query

Execute the plan that uses spatial indexes and also involves partial filtering (a process similar to one that I will mention later. Essentially an excircle is generated and results are output according to distance. Points that fall outside the polygon are filtered out.)

4. Query breakdown

4.1 Find the smallest possible circle that contains the polygon (in fact a polygon contains multiple segments)

4.2 Find the center of the smallest circle containing the polygon

4.3 Solve the center and radius of the smallest circle containing the polygon (function introduced in PostGIS 2.3)

4.4 The earlier version can be used to find the radius of the circle. The following are the steps to breaking it down

4.5 Distance sorting, the size of the cut-off radius, and filtering out points not within the polygon

Compared to original methods

The quantity of records is slightly different due to distortion caused by the fact that when a circle is generated through conversion, what it creates is actually a polygon with 48 segment lines. For details, see the ST_MinimumBoundingCircle function

PostGIS Test for HybridDB for PostgreSQL

GPDB does not currently support KNN SORT and KNN MERGE SORT for GIST indexes.

So we can see an extra row of external nodes.

But it doesn’t matter as long as there aren’t too many results, otherwise the sort will reach a bottleneck, because st_within still can use the spatial indexes.

Data Analysis Performance

How is data analysis performance?

Here is a set of standalone for 1 terabyte of TPC-H test data. HybridDB for PostgreSQL is an MPP distributed database and can linearly improve the performance by increasing nodes.

Technical Points

1. Spatial indexes: GiST indexes are spatial indexes that are unique to PostgreSQL. They support precise distance index searches, and can return results with distance sorting. It has also become the first choice of many scientific research institutions and spatial services due to its excellent performance, .

2. KNN query: output records according to distance from near to far

3. OSS external table: a function added to Alibaba Cloud RDS PG and HDB PG, which is connected to a cloud massive OSS and can transparently read/write files in the OSS in the form of an external table. Each thread can achieve a read/write bandwidth of 30 MB/s. This allows use to increase throughput by utilizing concurrency.

4. ETL: an ETL program on the cloud or owned by the use Any program can be used as long as it supports OSS object connections using a protocol supported by PostgreSQL.

5. MADlib: an open source machine learning library that supports most learning libraries and is capable of implementing machine learning using RDS PG and HDB PG SQL interfaces.
MADlib supports many mining modes such as Classification, Regression, Clustering, Topic Modeling, Association Rule Mining, Descriptive Statistics, and Validation.

http://madlib.incubator.apache.org/product.html

Madlib Manual

6. Geometric knowledge

Incircle of a polygon, circle(polygon)

BOX and excircle of a polygon, circle(box(polygon))

The geometric functions of PostgreSQL are as follows

https://www.postgresql.org/docs/9.6/static/functions-geometry.html

The geometric functions of PostGIS are as follows

http://postgis.net/docs/manual-2.3/reference.html

7. The polygon search mentioned in the performance testing above is a PG 10 test. If you encounter performance problems concerning spatial indexes in older versions, you can use the following methods to optimize the operation.

First, convert the polygon into a BOX, then calculate the excircle of the BOX, return records according to KNN index order, and meanwhile filter data contained in the polygon.

Cloud Products

Alibaba Cloud RDS PostgreSQL

Alibaba Cloud HybridDB for PostgreSQL

Alibaba Cloud OSS

Summary

In the new retail industry, data must be used to its greatest value if online and offline business are to be integrated and inventories eliminated. For example, it can be used to predict sales volume ahead of time. Connecting online and offline business also brings new challenges to data operation, and leads to gridded operation based on geographic locations.

Databases are required to be equipped with:

1. Support for quick data search based on geographic locations (GIS)

2. Support for analysis and exploration of massive sales data

Through Alibaba Cloud RDS PostgreSQL, HybridDB for PostgreSQL, and OSS, a response can be returned in 1 ms for 100 million pieces of geographic location data, and requirements can be met for every step in the data chain, from analysis to mining.

1. Sales data usually enters the Alibaba Cloud HybridDB for PostgreSQL database concurrently by means of OSS.

2. The RDS PostgreSQL is responsible for online transaction processing and selecting merchants through gridded operation.

3. The ETL program is responsible for data scheduling.

4. The BI application is connected to HDB for PG and PG. Its drive and syntax are also compatible with PostgreSQL.

5. HybridDB for PostgreSQL provides basic high-availability and backup functions, and also provides a function for one-click storage resizing. Users do not need to worry about performance stress from future data increases.

6. HDB PG and RDS PG can transparently access (read/write) OSS data using an OSS_EXT external table plugin. The OSS provides massive shared storage, which allows RDS PG and HDB PG to share data. The OSS can also be used as an external source of massive data and is concurrently imported to an Express Connect of HDB PG. The OSS can also be used as cold data storage for RDS PG and HDB PG.

7. For spatial data, in addition to using the GiST index, we also provided a more cost-effective index, the BRIN index, which is excellent for filtering data structured by st_geohash.

References

https://www.postgresql.org/docs/9.6/static/functions-geometry.html

http://postgis.net/docs/manual-2.3/reference.html

Original Source

https://www.alibabacloud.com/blog/gis-based-grid-operations-of-new-retail-merchants_594986?spm=a2c41.13103979.0.0

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store