“The King of Goods” on Taobao: What Is Behind the Low Latency of Live Streaming?

15 min readJan 5, 2021

By Chang Gaowei (Chang Shan)
Collated by LiveVideoStack
From Taobao Technology Department, Alibaba New Retail

Editor’s note: This speech is from Chang Gaowei, a Technical Expert from the Taobao Technology Department, on LiveVideoStack 2019 Shenzhen. It is aimed at practitioners in the live streaming industry and technicians who are interested in low-latency live streaming and WebRTC technologies. This article introduces the exploration of Taobao low-latency live streaming technologies, how to realize low-latency live streaming in one second based on WebRTC technology, and the business values of low-latency streaming compared to e-commerce live streaming.

Author: Chang Gaowei (Changshan), Technical Expert from Taobao Technology Department. This article introduces how Taobao live streaming reduces the latency in e-commerce live streaming. Low-latency live streaming is a new trend of current live streaming technologies. The following focuses on our practices in this area. Now, we will look at each of the aspects shown in the preceding figure.

Background of Taobao Live Streaming

Today, Taobao live streaming is becoming increasingly familiar to all of us. Since its launch in 2016, the turnover of Taobao live streaming has set new highs in recent years. In 2018, its turnover exceeded RMB 100 billion. During the last Double 11 Global Shopping Festival, the turnover reached RMB 20 billion.

E-commerce live streaming and powerful interaction are the two major strengths of Taobao live streaming.

E-Commerce Live Streaming: While watching the live broadcast, the audience can purchase the product online and interact with the anchor. During the broadcast, the anchor introduces some products, and the audience can ask questions about the products. Compared with traditional online shopping, this approach provides more information about products, yielding a higher conversion rate.
Powerful Interaction: Compared with traditional TV shopping, the anchor and the audience can interact in various ways during the broadcast, including commenting, giving likes, and following the anchor through social media. As reflected in practice, commenting is an important way for the audience to interact with the anchor and is crucial to reaching a deal.

Traditional live streaming suffers from a long latency time. Generally, it takes more than 10 seconds for the anchor to respond to the audience’s comments. Reducing the latency of live streaming and improving interaction efficiency between anchors and their fans by using multimedia technologies were our initial intentions for low-latency live streaming research.

Low-Latency Technology Selection

The Latency of Current Live Streaming Technology

After analyzing current mainstream live broadcasting technologies, we found that HLS and RTMP/HTTP-FLV were two major protocols used in this scenario before the emergence of low-latency live streaming.

For HLS, the latency problem mainly pertains to encoding and decoding, network latency, and content distribution network (CDN) delivery latency. Since HLS is a sliced protocol, the latency in it can be divided into two parts. The first part is that the server has a latency in slice buffering, and the other part is that the player has a latency in picture anti-shake buffering. In this case, both the size and number of slices can affect the HLS latency, which generally exceeds 10 seconds.

For RTMP/HTTP-FLV, most vendors in China are using RTMP, whose server has been optimized compared with HLS. The RTMP server is not used for slices, but forwards each frame separately so that the CDN delivery latency can be minimized. The RTMP latency results from the anti-shake buffering of the player. To improve the smoothness of live streaming with jitters, a buffering latency ranging from 5 to 10 seconds is introduced.

These two protocols are based on TCP. As Chinese vendors have achieved the minimum latency of RTMP over TCP, it would be hard to use a TCP-based protocol to optimize the latency for better performance than the current RTMP protocol.

TCP is not suitable for low-latency live streaming due to its inherent features, for the following reasons.

Slow Retransmission: With the ACK mechanism of TCP, after a packet is lost, retransmission is triggered and times out on the sending end. The timeout period is usually 200 milliseconds, which causes jitters on the receiving end.
Inaccurate Determination of Congestion: The congestion control algorithm based on packet loss cannot accurately determine the congestion situation, because packet loss is not equivalent to congestion. As a result, this causes bufferbloat in the sending process, increases the round-trip time (RTT) of the link, and finally increases the latency.
Poor Flexibility: This is the biggest disadvantage of TCP. The TCP congestion control algorithm is implemented at the kernel layer of the operating system, which requires high optimization costs. Therefore, mobile terminals can only use the existing optimization methods of the system.

Due to the preceding reasons, we chose a UDP-based solution instead.

Selection of Low-latency Live Streaming Technologies

The preceding figure shows the comparison of two UDP-based solutions. The first is a proven UDP solution, such as Quic, and the other is an RTC solution, such as WebRTC.

We compared the two solutions from five dimensions:

Transmission Method: In essence, Quic is of reliable transmission, and RTC is of semi-reliable transmission. In a certain situation, Quic can achieve lossy transmission of audio or videos, which can effectively reduce the latency.
Complexity: Quic has low complexity, so you can easily transform the TCP interface to the Quic interface. Conversely, the RTC solution is very complex and involves a set of protocol designs and quality-of-service (QoS) protection mechanisms.
Audio and Video Friendliness: Quic does not differentiate the transmitted content and is transparent to the transmission of audio and video data. RTC is friendlier to audio and video data and can be customized and optimized for the data.
Completeness: In terms of solution completeness, Quic is optimized for the transport layer, whereas WebRTC can provide end-to-end optimization solutions.
Theoretical Latency: According to our lab tests and online data analysis results, the WebRTC latency can be less than 1 second.

To sum up, the greatest advantage of the Quic solution lies in its low complexity. However, this solution can be more complex for achieving a lower latency. Eventually, we chose WebRTC as our low-latency solution due to its technical sophistication. We also believe that the integration of RTC and live streaming technologies will be a trend of audio and video transmission in the future.

Alibaba Cloud RTS

RTS is a low-latency live streaming system co-built by Alibaba Cloud and Taobao Live. This system is divided into two parts:

Uplink Access: supports three access methods. The first is H5 terminals, which use standard WebRTC to ingest streams to the RTS system. The second is traditional RTMP stream ingest software programs, such as OBS, which use the RTMP protocol to ingest streams to the RTS system. The third is low-latency stream ingest clients, which allow us to use extended private protocols based on RTP/RTCP to ingest streams to the RTS system.
Downlink Distribution: provides two types of low-latency distribution, standard WebRTC distribution, and extended private protocols based on WebRTC. Currently, the most widely used method for Taobao live streaming is private protocol-based distribution.

Low-latency Live Streaming Technologies

Next, I will introduce low-latency live streaming technologies in a logical sequence from process protocol to terminal solution. Also, I will answer these common questions:

How to connect to the standard WebRTC terminal
How to connect to a native terminal for a better user experience
How to design a low-latency live streaming solution based on WebRTC
How to modify the player to support low-latency live streaming

Standard WebRTC Connection Process

Generally, the playback process is:

The player sends a connection request: An AccessRequest, which carries the playback URL and offerSDP, is sent through HTTP.
After receiving the AccessRequest, RTS records the offerSDP and URL and then creates an answerSDP. Next, RTS generates a session token, sets it in the ufrag field of SDP, and sends the response to the client through HTTP.
The client sets an answerSDP and sends a Binding Request in which the USERNAME field carries the ufrag (the token issued by RTS) value in the answerSDP.
After receiving the Binding Request, RTS retrieves the information of the previously sent HTTP request based on the token in USERNAME and records the user’s IP address and port.

Given that RTS is a single-port solution, the token was passed on through the USERNAME of the Binding Request. According to the token in the UDP request, the exact user that sent the request can be identified. Conventionally, WebRTC differentiates users by port.

However, setting a port for each user by RTS incurs huge O&M costs.

In addition, the standard WebRTC connection process has the following restrictions.

It does not support AAC and the 44.1K sample rate for audio, which are commonly used in live streaming.

In addition, it does not support coding features such as B-frames and H265 for videos, and picture blurring of multi-slice encoding occurs in weak network conditions. Establishing a WebRTC connection takes a long time, which affects the user experience on quick startup.

Based on the preceding problems, we designed a private protocol-based connection method with better efficiency and compatibility.

Private Protocol Connection Process

The playback process (four-way handshake) is:

The client sends a PlayRequest through UDP, which carries the playback URL.
After receiving the PlayRequest, RTS immediately returns a provisional response and starts back-to-origin.
After the back-to-origin succeeds, RTS sends the final response with relevant media descriptions, such as encoding and decoding information.
The client sends the final acknowledgment (ACK) message to notify the server of the successful information reception.

RTS sends media data through RTP/RTCP, and the connection is established. In this process, note the following points:

The PlayRequest is used to notify CDN of the URL, while implementing UDP drilling.
In the UDP protocol, signals and media are transmitted in the same UDP channel.

The four-way handshake process is designed to ensure the connection speed and the reliable delivery of important information.

The whole connection process takes only one RTT period, and therefore the connection speed is fast.

Design of the Private Protocol State Machine

The preceding figure shows the process state machine of a private protocol.

Playback requests are sent in the initial state and then enter the pending provisional response state.

In the provisional response state, the system starts a millisecond-level quick retransmission timer. If the playback requests time out and no provisional responses are returned, these requests are quickly retransmitted to ensure the connection speed. After receiving the provisional response, the system enters the pending final response state and starts a second-level timer.

After the final response is received, the connection is successfully established.

During the process, lost or disordered provisional responses can occur. As a result, the final response may be received earlier than the normal arrival time, which directly transforms the playback requests from the provisional response state to the final state.

Design of Private Protocol Signaling

For private signaling, we chose the RTCP protocol. The reason is that RFC3550 defines four functions of RTCP, of which the fourth optional function is described like this: RTCP can be used in a “loosely controlled” system to transmit minimal session control information. For example, the standard defines a BYE message to notify that the source is no longer active. On this basis, we extended the signaling messages of connection establishment, including request, provisional response, final response, and final response ACK messages.

At the same time, we chose APP messages among RTCP messages to carry private signaling messages. APP messaging is an extension protocol that RTCP provides for new applications and features. It is an official extension provided by RTCP, and the application layer can customize message types and content. In addition, this protocol was chosen based on the following considerations:

RTCP-APP can be used to distinguish private protocols from standard RTP/RTCP. As mentioned earlier, media and signals share the same channels. After the server receives them, RFC3550 explicitly describes how to distinguish private protocols, RTCP packages, and native RTCP packages.
By using this protocol, the existing packet analysis tools can be directly used to parse and capture certain packets.
RTCP-related definitions, such as the payload type, subtype, and ssrc, can be reused.

RTCP-APP Messages

The preceding figure shows the RTCP-APP message header, which contains the following key fields.

1. subtype

The message subtype can be used to define private application extension messages. Requests, provisional responses, and final responses of private signaling are distinguished by the subtype. The subtype value ranges from 0 to 31, where 31 is reserved for future message type extension.

2. payload type

The payload type of APP is fixed to 204. It can be used to distinguish other RTP and RTCP messages.

3. SSRC

SSRC is the identifier of an RTCP sender.

4. name

“name” indicates the application name, which is used to distinguish messages from different applications. In RFC3550, two fields are used to distinguish message types. “name” is used to identify the application type, and “subtype” is used to determine the message type. Multiple subtypes are allowed for the same name.

5. application-dependent data

To extend the message content at the application layer, the type, length, and value (TLV) format is used. It is a flexible and highly scalable binary data format with low space consumption.

Design of Private Protocol Media

The protocol body for media complies with the RTP/RTCP and WebRTC specifications. More specifically, H264 follows RFC6184 and H265 follows RFC7798.

For the RTP extension, the standard RTP extension method is used. For compatibility with WebRTC, the definition of the standard RTP extension header follows RFC5285.

Comparison of Two Connection Methods

Standard WebRTC connection has the following advantages:

Standard WebRTC connection conforms to WebRTC specifications except for HTTP requests for connection establishment.
Standard terminals can be easily integrated.
Rapid prototyping is supported.

Standard WebRTC connection has the following disadvantages:

Establishing a connection takes a long time. HTTP requests take 5 RTT periods for connection establishment. For HTTPS requests, a longer time is required.
The media must be encrypted for transmission.
The transmission of audio and video is subject to certain rules, and therefore they must be transcoded on the server.

Private protocol connection has the following advantages:

Private protocol connection is based on standard extension signaling and media protocols, and therefore its implementation conforms to standard protocols.
The connection speed is fast and the one second above the fold (ATF) experience is good.
The live streaming technology stack is supported, which enhances media compatibility and reduces server-end transcoding costs.

Private protocol connection has the following disadvantages:

Although private protocol connection is based on the standard extension method, it introduces partial privatization.

Therefore, the use of private protocols increases the complexity of this solution.

In the final Taobao live streaming solution, to obtain a better user experience, we use private protocols to connect native terminals, and this method has been widely adopted online.

Principles of Process Protocol Design

There are three principles for designing a process protocol:

Always follow standards whenever possible, including WebRTC-related specifications. By following this principle, we can easily communicate with standard WebRTC.
Prioritize user experience if standards and user experience conflict.
Extend based on standard protocols and use standard extension methods when the extension is required.

In practice, we did not overturn all RTP/RTCP protocols to use completely new private protocols for two reasons. The first lies in the workload. Redesigning protocols requires more workloads than using standard protocols. Second, the RTP/RTCP protocol design is streamlined and proven. In contrast, a new design may not take all potential problems into account. For the preceding reasons, we chose a standard-based extension rather than redesign.

Terminal Connection Plan

Connection Solution Based on the Complete Set of WebRTC Modules

The connection solution based on the complete set of WebRTC modules uses all WebRTC modules and modifies some of them to implement low-latency live streaming capabilities.

This solution has both pros and cons.

Advantages:

After years of development, this solution has become very mature and stable. At the same time, it provides a complete solution, including NACK, JitterBuffer, and NetEQ, which can be directly used for low-latency live streaming.

Disadvantages:

The preceding figure shows the overall WebRTC architecture. This solution is a complete end-to-end solution, which covers data collection, rendering, encoding, and decoding, as well as network transmission. Therefore, it is highly invasive and complex to existing stream ingestion terminals and players.

The RTC technology stack is different from the live streaming technology stack. The former does not support features like B-frames and H265. In terms of the QoS policy, the native application scenario of WebRTC is phone calls. The basic QoS policy of WebRTC is to prioritize latency over picture quality, which is not necessarily true for live streaming.

As for the package size, all WebRTC modules are packed into an APP package with at least a 3 MB increase in the package size.

Connection Solution Based on the WebRTC Transport Layer

The current overall connection solution for terminals is based on WebRTC, as shown in the preceding figure. However, we only use several core transmission-related modules of WebRTC, including RTP/RTCP, FEC, NACK, JitterBuffer, audio and video synchronization, and congestion control.

We encapsulated these basic modules as an FFmpeg plug-in and then injected it to FFmpeg. Then, the player can directly call the FFmpeg method to open the URL and access our private low-latency live streaming protocol. This dramatically reduces modifications of the player and stream ingestion terminal and lessens intrusions of low-latency live streaming technologies into the original system. In addition, the utilization of basic modules greatly reduces the resulting package size.

Modifications of the Player for Low-latency Live Streaming

The preceding figure shows the architecture of a common player. The player uses FFmpeg to access the network connection, reads audio and video frames, and caches them in the player buffer. Then, these frames will be decoded, audio-video synchronized, and rendered.

The following figure shows the overall architecture of the low-latency live streaming system. As you can see, a low-latency live streaming plug-in is added to FFmpeg to support our private protocols. The buffer duration of the player is set to be 0 seconds. The output audio and video frames from FFmpeg are directly transmitted to the decoder for decoding, and then synchronized and rendered. In this architecture, we moved the original buffer of the player to the transport-layer SDK and used JitterBuffer to dynamically adjust the buffer size based on the network condition.

This solution requires minor modifications to the player and does not affect the original player logic.

Business Value of Low-Latency Live Streaming

Low-latency live streaming technologies have been widely applied in Taobao live streaming. They reduce the latency of Taobao live streaming and improve the user interaction experience, which is of great value for Alibaba Taobao.

After all, all technical optimizations are meant to yield business value, and any improvement in the user experience should help drive business development. From online tests, we found that low-latency live streaming can significantly promote the transactions of e-commerce live streaming, with the unique visitor (UV) conversion rate increased by 4% and the gross merchandise volume (GMV) by 5%.

Additionally, low-latency live streaming technologies support various business scenarios, such as live auction and live customer service.

Future Prospects

We are looking for three prospects for the low-latency live streaming technologies:

The current WebRTC open-source software program does not support live streaming very well. We hope that the future standard WebRTC program can do this better so that we can easily conduct low-latency live streaming on browsers.
With the advent of 5G, the network condition will get better, and low-latency live streaming technologies will become a new future trend in the live streaming industry.
Most low-latency live streaming protocols that different vendors are using are private. For users, the cost of moving from one vendor to another is high. So, it is important to unify and standardize low-latency live streaming protocols for the live streaming industry. With the popularization of low-latency live streaming technologies, I believe that low-latency live streaming protocols will become more unified and standardized in the future. I also hope that technology vendors in China can make their own voices heard and contribute their own efforts to the process of promoting the standardization of low-latency live streaming.