Implementation of Message Push and Storage Architectures of Modern IM Systems
By Zhaofeng Zhou (Muluo)
Preface
IM is the abbreviation of instant messaging. In the highly information-based mobile Internet era, IM products have become a must-have item in our life. The most well-known IM products in China are DingTalk, WeChat, and QQ. Among them, WeChat has already developed into an ecosystem, but its core function is still an IM. IM is an inseparable module of some applications that do not take IM as the core business. The most typical ones are online games and social networking applications. The IM feature is essential to applications with social attributes.
The IM system came into being in the early days of the Internet. Its basic technology architecture has been updated many times during the last dozen of years — from the early CS and P2P architectures to the current distributed systems in the backend. It involves all aspects of technologies, such as mobile clients, network, security, and storage. The daily active users that an IM product serves have increased from a small number in the early days to up to 900 million as announced by WeChat recently.
The core of an IM system is the messaging system, and the core function of the messaging system is the synchronization and storage of messages:
- Message synchronization: Transmitting integrate messages from the sender to the recipient quickly. The most important metrics of a message synchronization system are the instantaneity and integrity of transmitted messages, and the size of messages that can be supported. In terms of functionality, an IM system must at least support online and offline message push. Advanced IM systems also support “multi-terminal synchronization”.
- Message storage: The persistent storage of messages. This does not mean the local storage of messages at the client-side, but the storage on the cloud. This is the so-called “message roaming” function. The advantage of “message roaming” is that you can log on to your account at any terminals to view all historical messages. This is also one of the unique features of the advanced IM system.
The article mainly describes the messaging system architecture in the IM system. We will introduce the implementation of a Table Store-based message synchronization and storage system architecture. It supports advanced features of the messaging system, such as “multi-end synchronization” and “message roaming”. In terms of performance and scale, it supports full message storage on the cloud, and message synchronization with millions of TPS and millisecond delays.
Architecture Design
This section mainly describes the architecture design of Table Store-based modern IM systems. Before describing the architecture design in detail, we will introduce the timeline logic model to abstract and simplify the understanding of the IM synchronization and storage models. After understanding the timeline model, we will talk about the modeling methods for the synchronization and storage of messages based on the timeline model. We also have to make technical trade-offs in various aspects when implementing message synchronization and storage. For example, we have to compare and choose common pull and push modes for message synchronization, and choose the underlying database based on the characteristics of the timeline model.
Comparison between Conventional and Modern Architectures
The above diagram is a simple comparison between the conventional and modern architectures of a messaging system.
Under a conventional architecture, messages are synchronized first before they are stored. For online users, messages are directly synchronized to online recipients in real time. After a message is successfully synchronized, it will not be persisted. For offline users or messages that cannot be synchronized in real time, the messages will be persisted to an offline database. The recipient may pull all unread messages from the offline database after reconnecting to the Internet. A message will be deleted from the offline database after it is successfully synchronized to the recipient. Main tasks of the conventional messaging system server is to maintain the connection between the sender and the recipient, and to provide online message synchronization and offline message caching. This design ensures messages can be transmitted from the sender to the recipient under all circumstances. The server does not persist the message, so it does not support message roaming.
Under the modern architecture, messages are stored first and then synchronized. The advantage of storing messages before synchronization is that when a recipient receives a message, the message must have already been saved on the cloud. In addition, the messages are saved in two databases — the message storage database and the message synchronization database. The message storage database stores messages of all sessions to support message roaming. The message synchronization database is mainly used for multiple-terminal synchronization of the recipient. After a message is sent by the sender, it is forwarded by the server. The server subsequently saves the message to the message storage database and the message synchronization database. After a message is persisted, if the recipient is online, the message is pushed to recipient directly. However, online push is not the only option. It’s just the preferred one. For messages that failed to be pushed online, or when the recipient is offline, there is another unified message synchronization method. The recipient will actively pull all unsynchronized messages from the server. However, the time of synchronization, and from which terminals the recipient may send the message synchronization requests are unknown to the server. So, the server must save all messages that need to be synchronized to the recipient. This is what the message synchronization database is designed for. Users of an IM product may have message roaming needs when they use new devices. The message storage database is designed to meet such needs. From the message storage database, you can pull all historical messages for any session.
The above is a simple comparison between the conventional and modern IM system architectures. The modern architecture supports multi-terminal synchronization and message roaming without making the entire message synchronization and storage process much more complicated. The core of the modern architecture is the two message database — the “message synchronization database” and “message storage database”. They are the foundation of message synchronization and storage. The next section of this article will mainly describe the design and implementation of these two databases.
Timeline Model
Before analyzing the design and implementation of the “message synchronization database” and “message storage database”, we will first introduce the timeline logic model. Understanding the timeline model is helpful to understand of message synchronization and storage models. The design and implementation of a message database is also based on the characteristics and requirements of the timeline model.
The above diagram is an abstract representation of the timeline model. The timeline can be simply understood as a message queue that has the following characteristics:
- Each message has a sequence ID (SeqId), and the SeqId of a message in the rear part of a queue is always greater than the SeqId of a message in the front part of the queue. This ensures that the SeqId increments over time, but it does not have to be monotonically increasing.
- New messages are always added to the end of a queue, ensuring that the SeqId of the new message is always greater than that of existing messages.
- We can either read a specific message based on the SeqId, or read all messages within a given range.
With these characteristics, the synchronization of messages can be easily implemented with timeline. In the above diagram, A is the message sender, and B the recipient. B has multiple receiving terminals, which are B1, B2, and B3, respectively. When A sends a message to B, the message needs to be synchronized to multiple terminals of B. The messages to be synchronized are exchanged through a timeline. All messages sent by A to B are saved in this timeline, and each receiving terminal of B independently pulls these messages from this timeline. After all messages are synchronized to each of the receiving terminals, the SeqId of the last synchronized message is recorded locally in the receiving terminal. This SequId is used as the starting checkpoint of the next message synchronization. The server does not have to record the synchronization status of each receiving terminal, and each terminal can pull messages from any time point.
Message roaming is implemented based on timeline, too. The only difference between message synchronization and message roaming is that message roaming requires the server to persist all data in the timeline.
Based on the timeline logic model, we can easily understand how to implement message synchronization and storage on the server side, and how to implement advanced functions such as multi-terminal synchronization and message roaming. The main implementation challenges are: How to map the logical model to the physical model? What are the database requirements for implementing the timeline model? Which database should we choose? These are the topics that will be discussed next.
Message Storage Model
The above diagram illustrates a timeline-based message storage model. Message storage requires each session to have a separate timeline. As shown in the example, A has a session with B, C, D, E, and F. Each session has a separate timeline. All messages of a session are held in the corresponding timeline memory. The server will persist each timeline. Because the server can persist the full amount of messages of all session timelines, and it has the ability to support message roaming.
Message Synchronization Model
The message synchronization model is slightly more complicated than the message storage model. Message synchronization is generally implemented in two different modes — pull and push corresponding to different physical timeline models.
The above diagram illustrates the timeline models of two synchronization modes: pull and push. As shown in the diagram, the message recipient A simultaneously has sessions with B, C, D, E, and F. All new messages in these sessions need to be synchronized to one of A’s terminals. Let’s take a look at how messages are synchronized in both the pull and push modes.
- Pull mode: In the message storage model, all messages of a session are saved in the timeline of the session. In pull mode message synchronization, new messages generated in each session only need to be written once to the storage timeline to allow the receiving terminal to pull such messages from the timeline. The advantage is that a message only needs to be written once. This greatly reduces the number of message writes in comparison with push mode, especially in the case of group chat messages. However, its drawback is also obvious — the logic for a receiving terminal to pull messages could be relatively complicated and inefficient. The receiving terminal must pull messages from every session to get all messages. The reads are amplified. This also introduces a lot of invalid reads because not every session has new messages.
- Push mode: In push mode, an additional timeline is required for message synchronization for each session. Usually, each receiving terminal has an independent synchronization timeline to store all messages that need to be synchronized to this terminal. In this case, messages in each session are written to both the message storage and message synchronization timelines. In a one-to-one chat scenario, a message is additionally written twice. Apart from being written to the storage timeline of the session, the message must be written to the message synchronization timeline of both recipients. In the group chat scenario, the writes are amplified further. If a group has N participants, each message must be written N+1 times. The push mode message synchronization has an outstanding advantage in that the synchronization logic at the receiving terminal is very simple. The receiving terminal only needs to pull messages from the synchronization timeline once. This greatly reduces the reading pressure for message synchronization. The drawback is that message writes are amplified, especially for group chats.
IM systems usually choose the push mode message synchronization. In the IM scenario, a message is generated only once in a session, but it will be read multiple times. It is a typical scenario with more reads than writes. The read/write ratio of messages is about 10:1. If we use the pull mode message synchronization, the read/write ratio of the IM system will be amplified to 100:1. A well-optimized system must be designed to balance the read and write pressure, and avoid bottlenecks of either read or write. Therefore, IM systems usually use the push mode message synchronization to balance reads and writes, and the read/write ratio could be balanced from 100:1 to 30:30. Of course, the push mode synchronization also needs to deal with some extreme scenarios, such as a group chat with over ten thousand participants. For such extreme push mode scenarios, the pull mode may be used. A simple IM system usually restricts the creation of such a large group at the product level. However, an advanced IM system usually blends the pull and push modes to meet the needs of such scenarios.
Message Database Design
Based on the timeline model and the application of the timeline model in message storage and synchronization, let’s take a look at the design of the message synchronization database and the message storage database.
The above illustrates the design of a timeline-based message database.
- Message synchronization database: The message synchronization database is used to store all message synchronization timelines. Each timeline corresponds to one receiving terminal, and is mainly used for the push mode message synchronization. This database does not have to permanently keep all messages that need to be synchronized. Because when a message is synchronized to all terminals, its lifecycle ends, and it can be deleted immediately. However, as mentioned before, a simple multi-terminal message synchronization system does not store the synchronization status of all receiving terminals on the server. The synchronization is proactively done by the terminals. In this case, the server does not know when a message can be deleted. The common practice is to set a fixed lifecycle for messages stored in this database. For example, one week or one month. A message is deleted when its lifecycle ends.
- Message storage database: The message storage database stores timelines of all sessions, and each timeline holds all messages of a session. This database is mainly used to pull all historical messages of a session during message roaming. It can also be used in the pull mode message synchronization.
The message synchronization database and the message storage database have different database requirements. Next, we will discuss database selection.
Database Selection
The message synchronization database and the message storage database are the core databases of the message system. They have different database requirements:
To sum up, the database requirements are:
1. The schema design must be able to meet the functional requirements of the timeline model: it does not have to be a relational model, but it should be able to implement a queue model to enable the generation of auto-incrementing SeqIds.
2. Able to support highly concurrent writes and range reads, with a capacity of 100,000+ TPS.
3. Able to store massive amounts of data, measured in hundreds of TB.
4. Able to define data lifecycle.
Alibaba Cloud Table Store is a LSM storage engine-based distributed NoSQL database. It supports highly concurrent reads and writes at millions of TPS, PB level data storage, and TTL. It fully satisfies the above requirements, and supports auto-increment. It is a perfect design and physical model of timeline.
Architecture Implementation
Let the code speak for itself. For the detailed sample code, click here.
Postscript
This article mainly describes the implementation of the message push and storage architectures in a modern IM system. Based on the timeline logic model, we can clearly understand the message synchronization and storage architectures. Table Store is a perfect implementation of the timeline model. Its auto-incrementing feature solves the most critical problem of the timeline model — the auto-incrementing SeqId.
Table Store is a professional distributed NoSQL database independently developed by Alibaba Cloud. It is a high-performance, low-cost, scalable, and fully managed semi-structured data storage platform based on shared storage. It supports efficient calculation and analysis of Internet and IoT data. The message push and storage scenario of the IM system is one of the most important applications of Table Store in the social networking field.
The timeline-based message storage and push model can be applied in many other scenarios apart from the IM message system. For example, feed stream, real-time message synchronization, and bullet screens of live broadcast. In the feed stream field, we also have some in-depth studies. We also have some in-depth studies in other scenarios.
We have been constantly improving Table Store to meet the high-availability and high-reliability data requirements of the social networking scenario: