Applying PostgreSQL Graph Database to Social Scenarios

By Digoal


Humans are social animals. As the population grows, the methods of communication become more and more borderless. A huge network of relations has been formed between people-people, people-event, and people-time.

Many scenarios are based on this huge network of relations. For example:

1. Headhunting

IT personnel, headhunters and HR are no strangers to LinkedIn. LinkedIn is actually a website that maintains interpersonal relations.

By searching your 1-depth contacts, you can find people who are directly related to you, and by searching your 2-depth contacts, you can find people who are indirectly related to you.

Of course, you can continue to search for N-depth contacts, but those may not be so relevant to you.

If you knew that you are only a few depths away from actor Fan Bingbing, would you get a little a bit excited?

In fact, since ancient times, this kind of Social Relation already existed, as well as this kind of specialized occupation. Buying official positions, selling official positions, and other acts actually all depend on networking contacts. After reading the Dream of Red Mansions, you may be amazed at how many relatives this family has!

2. Criminal investigation

Public Security criminal investigation is also an application related to networking contacts. However, the relations and behaviors are getting more and more complex nowadays, as is the relations between people. In ancient times, the range of people that can be reached basically depends on two legs, plus a horse at most.

Now, mobile phones, computers, ATM machines, supermarkets, cameras, cars, and so on are all connected through the road network and the Internet.

The relations generated by someone’s behavior are more complicated. The difficulty of criminal investigation becomes more and more complicated if relation analysis is only performed manually.

3. Financial risk control

For example, when a bank reviews loan qualifications, it usually needs to check whether the applicant has the ability to repay, whether any false information was provided, as well as the behaviors & habits, assets, circle of friends and so on. It also involves complex analysis of relations between people, relations of human behaviors, and so on.

Pictures are from the Internet

Such human-centered and event-related businesses gave rise to Graph Databases

Currently, graph databases, such as neo4j, are commonly used.

For more information, see

PostgreSQL is a full-featured database. And, PostgreSQL is used in the background of some graph databases, such as OpenCog, and Cayley.

In addition, PostgreSQL is also very mature in relation query and relation management.

This article reveals how PostgreSQL meets the requirements of financial risk control, criminal investigation, social relations and human relations.

Graph Data Model

In the design of many graph databases, events or people are used as nodes, and if a relation exists between events or people, the relation is established.

In PostgreSQL, we can use two columns to represent this relation. Each column represents an event or a person. If the two have a relation, a record is created.

This representation is adopted to keep consistent with the graph data.

Of course, optimization methods, such as arrays, can also be used. In PostgreSQL, other events or people associated with an event are stored as arrays, which can improve retrieval efficiency.

Can PostgreSQL Recursive Queries Meet the Needs of Relation Derivation?

PostgreSQL recursive query is an excellent feature.

Relations can be circular. Therefore, due to the problem of loops, all relations found before should be excluded each time in derivation, which is currently supported for the PostgreSQL recursion.

For example, I need such a syntax to support the exclusion of loops

Currently, this is not supported.

Recursion can only exclude the set of work tables, not the set of the whole recursive process. If a loop exists in the relation, it leads to an infinite loop.


As shown in the legend, many points together form a loop. For example, A and I are good friends, B and I are also good friends, and A and B are also good friends.

In the derivation, A and B are derived from me, then B is derived from A, and A is derived from A, and so on… which is an infinite loop.

PostgreSQL UDF to Solve the Issue of Relation Loops

Using UDFs, the problem of infinite recursive loops caused by data looping can be solved very well.


Create 100 million entries of relation data, and assume a total of 10 million people. For ease of testing, other attributes of people or events (such as Time, Location, and Intimacy) are omitted from the table here.

Temporary tables used by the functions

Function 1: Output A-Centered and N-Level Relation Data

Function 2: Output A-Centered and N-Level Relation Data

A path is generated at the same time. Note: This path is a complete path, and a node may have multiple paths.

Function 3: Find the shortest relation path between two people and two events

Here, the optimization algorithm is to be tested, involving which point to start the search, or both sides simultaneously conducting radial search, and so on.


1. Find the 3-level relation network of the user with ID 1


2. Find the shortest relation path between ID 1 and ID 386886


Very fast.

UDF Optimization Idea 1 — Using a Cursor to Receive Streams

We can see that the previous UDF starts to output only after all the data is searched. In fact, the output can be started after the first record is found, so that the client can continuously receive data and reduce latency.

This can be achieved by using cursors.

Cursor usage reference


Return level-1 relations

Return level-2 relations

And the like


Because cursors in this example return relations with the context, make sure to receive data from the next cursor after a current cursor is completely received. No data will be received if you start to receive later cursors beforehand.

After a cursor is used, it is ready to receive streams by relation level.

UDF Optimization Idea 2 — Functions and Stream Receiving

Currently insert into … returning* in PostgreSQL will not return the first record until the insert operation is completed. Therefore, the deeper the level is, the more time it takes to insert data and the slower it is to receive data of a deeper level. Can we use a better optimization method?

In addition to altering cores and letting the cursor that supports insert into … returning * support stream returning, another method can be used to stream return data by using functions.

Create 100 Million Users and Consider Every 50 Thousand Users as a Group of Associated Users. Each User Is Associated with 1000 Close Users, Forming a Network of 100 Billion Relations.

About 520 GB

Test Scenario

Deduce User A’s N-level relation network.

Currently PL/pgSQL doesn’t support stream returning, even if return next or return query is used

Currently, using the c function to implement the preceding logic can allow stream receiving of functions.

Test Scenario 2

Extract 10 level relations of a user from 100 billion pieces of data

UDF Optimization Idea 3 — Asynchronous Messages

Asynchronous messages can be used to implement the same effect.

PostgreSQL is more useful after asynchronous messages are supported.


Enable a listener in session A

Start query in session B

Session A can read asynchronous messages

Design Optimization Method 1 — Array

When storing relations, we use one-to-one storage. This storage method is simple, but it has some disadvantages. For example, to search for a user’s direct relation network, it is required to search for N records. More search levels will lead to several-fold increases in records.

Assume that each user has 100 direct relations on average. To search for five levels, we have to search and return 100¹+100²+100³+… 100⁵ records.

However, if array storage is used, the number of scanned records can be reduced to 100⁰+100¹+100²+… 100⁴, and the overload for each level is 100 times smaller.

The use of array storage is also very simple and thus not explained in this article. Arrays support features such as = any(array) query and the foreach loop in functions.

Design Optimization Method 2 — Pgrouting

We all know that a relation has different levels of closeness. Similarly, events also have different levels of association. Therefore, simply using association or no association is far from enough to indicate event relationships in real life.

For example, although Ming, Shao Cong, and PostgreSQL are all my close friends, the closeness level may be totally different. This is particularly true in the case of showing or analyzing relationships among people.

Then how can we indicate the different association levels? The answer is very simple: adding a weight field.

Path calculation is no longer as simple as using the first received path. Instead, path calculation will not stop until the weight of each path is found to be greater than existing paths.

Assume that many paths can lead from A to B. We cannot stop path calculation when the first path is found, because the total weight of other paths may be smaller. Therefore path calculation will not stop until the weight of each path (including paths that have not arrived at B) exceeds the shortest among already calculated paths.

This is what pgRouting is good at. pgRouting can be completely applied in the graph database field.

pgRouting naturally supports path weight (for example, speed limits on roads, upward/downward slopes, number of road curves, number of highway lanes, and road congestion levels can all be dynamic weight values). pgRouting also supports multiple path planning algorithms and custom planning algorithms.

For more information, see

Other Graph Databases

Neo4j is also a good graph database used by many people.

However, some research papers compare Neo4j with PostgreSQL and find that Neo4j also has some disadvantages. For example, the memory cannot hold all data and the access speed is unstable.

Neo4j allows creating indexes for groups only by label to perform data search ( similar to partitioned indexes and not global indexes). To perform global search, each user has to contain the same label.



With a growing number of communication media available and increasingly lower communication cost, people, things, and events become more correlated and social relations are more and more complex.

In a relation system, a node is used to represent points (people, things, events, and specific time and space properties). If two nodes are related themselves, they will be associated.

Relation data will continue to see explosive growth. Nodes can be in trillions, and each node can have relations with hundreds of thousands of other nodes, forming a network of thousands of trillions of relations.

However, generally systems do not have such a large volume of data. Only very large companies can have a network of trillions of relations at the same time point.

PostgreSQL is such a comprehensive database that can be used almost in any scenarios. This article uses some features of PostgreSQL to design a DEMO in the relation network scenario that can meet common requirements such as relation deduction and path search between two nodes.

Example Features Used in This Article

1. Arrays for storing positive relations

2. PL/pgSQL for writing logic such as deduction logic and path search logic

3. Cursors for stream returning

4. Asynchronous messages for stream data returning

5. Aggregation queries

6. pgRouting for retrieving the shortest path

7. Recursive queries

8. Computing methods similar to the PostgreSQL rum plug-in can be used for closeness analysis, for example, calculating the score of the event overlapping degree. (Currently RUM has supported the output and sorting of approximate values of type tsvector and therefore can be used for calculating closeness.)

RUM reference

[Full-text Search with PostgreSQL Is Way Too Fast — RUM Index Interface (Pandora Box)]()

Closeness relation level deduction example

Create 100 million users again and consider every 50 thousand users as a group of associated users. Each user is associated with 100 close users, forming a network of 10 billion relations.

Use rum_tsvector_ops of RUM to sort results by closeness and obtain closeness values at the same time.

The smaller the value is, the closer the relation is.

Iterative computation

Perform deduction per level. Of course you can add a real-time closeness field so that you do not have to perform queries by using the RUM index each time.

Postgresql Improvements to Be Made

1. Although the insert into…returning… cursor supports stream returning, currently results are not returned until the insert operation is completed. The insert and return operations cannot be performed at the same time.

2. Currently the with recursive query can only support querying the intermediate status of work tables. It is recommended to add support for full work table queries.

3. It would be better if the with recursive query supported LOOPth variables so that users can know which loop they are currently in.

4. return next and return query in PL/pgSQL supports stream returning. (Currently only C interfaces can be written to return stream data in functions. As shown in the manual, PL/pgSQL may add support for stream returning later.)

5. Index interfaces in PostgreSQL are completely open. Users can customize how to organize index storage and retrieve data.

For the application of graph databases, many other factors can be improved in terms of efficiency. For example, currently the closeness refers to the closeness scores of the relations separated by one degree. (This scoring can be quickly implemented by using the similarity sorting of RUM.)

But how can we efficiently score the closeness of relations separated by two degrees or higher degrees?

The method for one degree of relations certainly cannot enable high multi-round retrieval and query efficiency.

In this case, open PostgreSQL interfaces show its advantages. Users can customize more efficient index organizations and structures based on the actual scenarios.

For more information, refer to how the RUM and bloom indexes are written.

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.