The Technologies Behind Taobao Live Streaming
Step up the digitalization of your business with Alibaba Cloud 2020 Double 11 Big Sale! Get new user coupons and explore over 16 free trials, 30+ bestselling products, and 6+ solutions for all your needs!
Live stream shopping is a rapidly growing e-commerce trend. However, audio-visual technology may be unfamiliar territory for many front-end engineers. This article explores the world of Taobao live broadcast and uncovers a variety of information about streaming media technology, such as text, graphics, images, audio, and video. It also introduces players, web media technology, and mainstream frameworks. This article envisions opening the door to the realm of frontend multimedia.
Basics of Audio-visual Technology
Video Container Formats
Popular container formats include MP4, AVI, FLV, TS, M3U8, WebM, OGV, and MOV.
Audio Container Formats
Popular audio formats include WAV, AIFF, AMR, MP3, and OGG.
Audio Encoding Formats
Live Streaming Technology
Streaming Media Protocols
Over the Internet, audio and video media data is transmitted through network protocols, which are distributed at the session layer, presentation layer, and application layer.
Commonly used protocols include the Real-time Messaging Protocol (RTMP), Real-time Transport Protocol (RTP), Real-time Control Protocol (RTCP), Real-time Streaming Protocol (RTSP), HTTP-Flash Video (FLV) protocol, HTTP Live Streaming (HLS) protocol, and Dynamic Adaptive Streaming over HTTP (DASH) protocol. Each protocol has its own advantages and disadvantages.
Stream Ingestion and Stream Pulling
After a live shopping host starts a live stream, a recording device collects the host’s voice and image and ingests them into a streaming media server through the corresponding protocol. Then, streaming data is pulled from the streaming media server through a stream pulling protocol to play the streams for viewers.
This section introduces player-related technologies and how players process pulled streams.
Video streams must be pulled before they are played back by a player.
For example, video streaming data in FLV format is pulled by using the Fetch API and Stream API provided by a web browser.
Definition of Demux
The pulled streaming data is then demultiplexed. Images, sounds, and subtitles (if any) are separated from the pulled streaming data by a demultiplexer. This process is called Demux.
The demultiplexed images, sounds, and subtitles are called basic streams, which are decoded by a decoder.
The demultiplexed elementary bitstreams are decoded into data that is played back by audio and video players.
The decoded data is of various types, some of which are introduced in the following sections.
SPS and PPS
The Sequence Parameter Set (SPS) and Picture Parameter Set (PPS) jointly determine the maximum video resolution, frame rate, and a series of video playback parameters. The PPS and SPS are typically stored in the starting position of a bitstream.
The SPS and PPS store a set of global parameters for an encoded video sequence. If these parameters are lost, the decoding may fail.
I-frame, B-frame, and P-frame
- The I-frame is a keyframe. The data of an I-frame can be independently decoded through intra-frame prediction. The I-frame is typically the first frame in a Group of Pictures (GOP). GOP is a video compression technology used by MPEG. The I-frame is appropriately compressed into a random access reference point and used as a static image.
- The B-frame is a forward pre-encoded frame. This frame is predicted based on a previous I-frame or P-frame and a subsequent I-frame or P-frame. During decoding, the data of the current B-frame is superimposed with both the previously cached picture and the decoded subsequent picture to generate the final picture.
- The P-frame is a forward predictive frame. During image encoding, the data of a P-frame to be transmitted is compressed by removing all the redundant time information about encoded frames in an image sequence.
Supplemental enhancement information (SEI) is optional when video encoders output video bit streams. For example, in live Q&A streams, much Q&A-related information is carried by SEI. This improves the synchronicity between question display and audio-visual presentation to viewers.
PTS and DTS
- Decoding Timestamp (DTS) is used to tell a player when to decode the data of a frame.
- Presentation Timestamp (PTS) is used to tell a player when to display the data of a frame.
The DTS and PTS may directly determine the synchronicity of audio and video playback.
The decoding process generates a variety of products. For more information, see the last section of this article.
Remux is the opposite of Demux. Remux is the process of combining an audio elementary stream (ES), video ES, and subtitle ES into a piece of complete multimedia.
Both Remux and Demux are required to change the multiplexing and encoding formats of a video file. The specific processes are not described here.
Rendering refers to the playback of decoded data on PC hardware, such as monitors and speakers. The module responsible for rendering is called a renderer. Popular video renderers include the Enhanced Video Renderer (EVR) and Madshi Video Renderer (madVR). Renderers are typically built into web players by using video tags.
Custom rendering: For example, the H.265 player uses the APIs provided by a web browser to create a simulated video tag and uses the canvas and audio elements for rendering.
Web Media Technologies
Web real-time communication (WebRTC) allows network applications and websites to establish peer-to-peer (P2P) connections between web browsers without using intermediate media. These connections support the fast transmission of audio streams, video streams, and other types of data.
WebRTC consists of a video engine, voice engine, session management, iSAC (designed for voice compression), VP8 (a video codec developed by Google’s WebM project), and APIs, such as the native C++ APIs and web APIs.
Extended reality (XR) includes virtual reality (VR), augmented reality (AR), and mixed reality (MR). WebXR supports devices enabled with all types of realities. WebXR allows creating immersive content that runs on all VR and AR devices. This delivers a web-based VR and AR experience.
WebGL uses a canvas for rendering. In the “Players” section, we learned that players can use a canvas for image rendering. WebGL enhances the playback fluency and other capabilities of players.
WebAssembly (Wasm) is a new portable and web-compatible format that features small sizes and fast loading. Wasm is a new specification developed by the W3C community, which consists of mainstream web browser manufacturers.
For more information, visit https://webassembly.org/
Wasm allows players to be integrated with FFmpeg enabling players to decode H.265 videos that aren’t recognized by web browsers.
Open-source Products and Frameworks in Use
The following sections introduce the most popular open-source products and frameworks at present.
Official GitHub Page: https://github.com/bilibili/flv.js
The HLS protocol proposed by Apple Inc. is widely supported by mobile clients, so it is widely used in live streaming scenarios.
Official GitHub Page: https://github.com/video-dev/hls.js/
video.js is an HTML5-based player that playbacks H5 and Flash files and provides more than 100 plug-ins. video.js playbacks the files in HLS and DASH formats and allows creating custom themes and subtitle extensions. video.js is applicable to many scenarios worldwide.
Official GitHub Page: https://github.com/videojs/video.js
FFmpeg is a leading multimedia framework and an open-source, cross-platform multimedia solution. FFmpeg provides a range of features, such as audio and video encoding, decoding, transcoding, multiplexing, demultiplexing, streaming media, filters, and playback.
FFmpeg is used by the following frontend components:
- Node Module Fluent-FFmpeg: This is a useful module in Node.js, which streamlines the complex commands of FFmpeg. Use this module to upload files and process video streams. For more information, visit the fluent-ffmpeg website.
Open Broadcaster Software (OBS) is flexible, open-source software for video recording and live streaming. Written in C and C++, OBS provides a range of features, such as real-time source and device capture, scenario composition, encoding, recording, and broadcast. OBS uses RTMP to transmit data to RTMP-enabled destinations, such as YouTube, Twitch.tv, Instagram, Facebook, and other streaming media websites.
OBS encodes video streams into H.264, MPEG-4 AVC, H.265, and HEVC formats using the x264 free software library, Intel Quick Sync Video, NVIDIA NVENC, and AMD video encoding engine. OBS also encodes audio streams sing MP3 and AAC encoders. If you are familiar with audio-visual encoding and decoding, use the codecs and containers provided by the libavcodec and libavformat libraries and output streams to custom FFmpeg URLs.
MLT is a non-linear video editor engine that is applicable to various types of apps, including desktop apps and apps that run on Android and iOS.
Official GitHub Page: https://github.com/mltframework/mlt/