The Technologies Behind Taobao Live Streaming

8 min readOct 29, 2020

Step up the digitalization of your business with Alibaba Cloud 2020 Double 11 Big Sale! Get new user coupons and explore over 16 free trials, 30+ bestselling products, and 6+ solutions for all your needs!

Live stream shopping is a rapidly growing e-commerce trend. However, audio-visual technology may be unfamiliar territory for many front-end engineers. This article explores the world of Taobao live broadcast and uncovers a variety of information about streaming media technology, such as text, graphics, images, audio, and video. It also introduces players, web media technology, and mainstream frameworks. This article envisions opening the door to the realm of frontend multimedia.

Basics of Audio-visual Technology

Video

Terms

Video Container Formats

Popular container formats include MP4, AVI, FLV, TS, M3U8, WebM, OGV, and MOV.

Audio

Terms

Audio Container Formats

Popular audio formats include WAV, AIFF, AMR, MP3, and OGG.

Audio Encoding Formats

Live Streaming Technology

Streaming Media Protocols

Over the Internet, audio and video media data is transmitted through network protocols, which are distributed at the session layer, presentation layer, and application layer.

Commonly used protocols include the Real-time Messaging Protocol (RTMP), Real-time Transport Protocol (RTP), Real-time Control Protocol (RTCP), Real-time Streaming Protocol (RTSP), HTTP-Flash Video (FLV) protocol, HTTP Live Streaming (HLS) protocol, and Dynamic Adaptive Streaming over HTTP (DASH) protocol. Each protocol has its own advantages and disadvantages.

Stream Ingestion and Stream Pulling

After a live shopping host starts a live stream, a recording device collects the host’s voice and image and ingests them into a streaming media server through the corresponding protocol. Then, streaming data is pulled from the streaming media server through a stream pulling protocol to play the streams for viewers.

Players

This section introduces player-related technologies and how players process pulled streams.

Stream Pulling

Video streams must be pulled before they are played back by a player.

For example, video streaming data in FLV format is pulled by using the Fetch API and Stream API provided by a web browser.

Definition of Demux

The pulled streaming data is then demultiplexed. Images, sounds, and subtitles (if any) are separated from the pulled streaming data by a demultiplexer. This process is called Demux.

The demultiplexed images, sounds, and subtitles are called basic streams, which are decoded by a decoder.

Demux Process

The demultiplexed elementary bitstreams are decoded into data that is played back by audio and video players.

The decoded data is of various types, some of which are introduced in the following sections.

SPS and PPS

The Sequence Parameter Set (SPS) and Picture Parameter Set (PPS) jointly determine the maximum video resolution, frame rate, and a series of video playback parameters. The PPS and SPS are typically stored in the starting position of a bitstream.

The SPS and PPS store a set of global parameters for an encoded video sequence. If these parameters are lost, the decoding may fail.

I-frame, B-frame, and P-frame

The I-frame is a keyframe. The data of an I-frame can be independently decoded through intra-frame prediction. The I-frame is typically the first frame in a Group of Pictures (GOP). GOP is a video compression technology used by MPEG. The I-frame is appropriately compressed into a random access reference point and used as a static image.
The B-frame is a forward pre-encoded frame. This frame is predicted based on a previous I-frame or P-frame and a subsequent I-frame or P-frame. During decoding, the data of the current B-frame is superimposed with both the previously cached picture and the decoded subsequent picture to generate the final picture.
The P-frame is a forward predictive frame. During image encoding, the data of a P-frame to be transmitted is compressed by removing all the redundant time information about encoded frames in an image sequence.

SEI

Supplemental enhancement information (SEI) is optional when video encoders output video bit streams. For example, in live Q&A streams, much Q&A-related information is carried by SEI. This improves the synchronicity between question display and audio-visual presentation to viewers.

PTS and DTS

Decoding Timestamp (DTS) is used to tell a player when to decode the data of a frame.
Presentation Timestamp (PTS) is used to tell a player when to display the data of a frame.

The DTS and PTS may directly determine the synchronicity of audio and video playback.

The decoding process generates a variety of products. For more information, see the last section of this article.

Remux

Remux is the opposite of Demux. Remux is the process of combining an audio elementary stream (ES), video ES, and subtitle ES into a piece of complete multimedia.

Both Remux and Demux are required to change the multiplexing and encoding formats of a video file. The specific processes are not described here.

Rendering

Rendering refers to the playback of decoded data on PC hardware, such as monitors and speakers. The module responsible for rendering is called a renderer. Popular video renderers include the Enhanced Video Renderer (EVR) and Madshi Video Renderer (madVR). Renderers are typically built into web players by using video tags.

Custom rendering: For example, the H.265 player uses the APIs provided by a web browser to create a simulated video tag and uses the canvas and audio elements for rendering.

Web Media Technologies

WebRTC

Web real-time communication (WebRTC) allows network applications and websites to establish peer-to-peer (P2P) connections between web browsers without using intermediate media. These connections support the fast transmission of audio streams, video streams, and other types of data.

WebRTC consists of a video engine, voice engine, session management, iSAC (designed for voice compression), VP8 (a video codec developed by Google’s WebM project), and APIs, such as the native C++ APIs and web APIs.

MSE

Media Source Extensions (MSE) is a technology specification for players. The MSE API provides a web-based streaming media feature without using any plug-ins. MSE allows JavaScript to create media streams and playback the streams by using the audio and video elements.

MSE significantly extends the media playback features of web browsers and allows JavaScript to create media streams. MSE is applicable to adaptive streaming and has evolved to support live streaming and other scenarios.

WebXR

Extended reality (XR) includes virtual reality (VR), augmented reality (AR), and mixed reality (MR). WebXR supports devices enabled with all types of realities. WebXR allows creating immersive content that runs on all VR and AR devices. This delivers a web-based VR and AR experience.

WebGL

Web Graphics Library (WebGL) is a 3D drawing standard that supports user interaction. WebGL supports the combination of JavaScript and OpenGL ES 2.0. Add a JavaScript binding of OpenGL ES 2.0 to allow WebGL to accelerate the 3D rendering of hardware that uses HTML5 Canvas. During web-based development, use a system graphics card to display 3D scenes and models in a web browser more fluently and create complex navigation and visual data.

WebGL uses a canvas for rendering. In the “Players” section, we learned that players can use a canvas for image rendering. WebGL enhances the playback fluency and other capabilities of players.

WebAssembly

WebAssembly (Wasm) is a new portable and web-compatible format that features small sizes and fast loading. Wasm is a new specification developed by the W3C community, which consists of mainstream web browser manufacturers.

For more information, visit https://webassembly.org/

Wasm allows players to be integrated with FFmpeg enabling players to decode H.265 videos that aren’t recognized by web browsers.

Open-source Products and Frameworks in Use

The following sections introduce the most popular open-source products and frameworks at present.

flv.js

flv.js is an open-source HTML5 FLV player developed by www.bilibili.com. Based on the HTTP-FLV protocol, flv.js transmutes FLV data through Pure JavaScript to ensure that FLV files are played back in web mode.

Official GitHub Page: https://github.com/bilibili/flv.js

hls.js

hls.js is a JavaScript player library developed based on the HLS protocol. hls.js uses MSE to play HLS files in web mode.

The HLS protocol proposed by Apple Inc. is widely supported by mobile clients, so it is widely used in live streaming scenarios.

Official GitHub Page: https://github.com/video-dev/hls.js/

video.js

video.js is an HTML5-based player that playbacks H5 and Flash files and provides more than 100 plug-ins. video.js playbacks the files in HLS and DASH formats and allows creating custom themes and subtitle extensions. video.js is applicable to many scenarios worldwide.

Official GitHub Page: https://github.com/videojs/video.js

Website: https://videojs.com/

FFmpeg

FFmpeg is a leading multimedia framework and an open-source, cross-platform multimedia solution. FFmpeg provides a range of features, such as audio and video encoding, decoding, transcoding, multiplexing, demultiplexing, streaming media, filters, and playback.

FFmpeg is used by the following frontend components:

JavaScript Players: FFmpeg is integrated with Wasm to implement JavaScript players in web browsers or to extend other audio-visual capabilities of web browsers.
Node Module Fluent-FFmpeg: This is a useful module in Node.js, which streamlines the complex commands of FFmpeg. Use this module to upload files and process video streams. For more information, visit the fluent-ffmpeg website.

OBS

Open Broadcaster Software (OBS) is flexible, open-source software for video recording and live streaming. Written in C and C++, OBS provides a range of features, such as real-time source and device capture, scenario composition, encoding, recording, and broadcast. OBS uses RTMP to transmit data to RTMP-enabled destinations, such as YouTube, Twitch.tv, Instagram, Facebook, and other streaming media websites.

OBS encodes video streams into H.264, MPEG-4 AVC, H.265, and HEVC formats using the x264 free software library, Intel Quick Sync Video, NVIDIA NVENC, and AMD video encoding engine. OBS also encodes audio streams sing MP3 and AAC encoders. If you are familiar with audio-visual encoding and decoding, use the codecs and containers provided by the libavcodec and libavformat libraries and output streams to custom FFmpeg URLs.

MLT