Catch the replay of the Apsara Conference 2020 at this link!
By AliENT and Danjiang
Did you know that when you watch Street Dance of China Season 3 on Youku, you can find your idol easily with the Free Viewpoint Video (FVV) feature?
Speaking of the classic movies, in The Matrix (1999), do you still remember the scene where Neo dodged bullets?
The simulated variable speed effect was achieved with specialized photography technology and named “bullet time”.
The production of this kind of scene was relatively complicated. First, a row of cameras was used for filming. Then, pictures were taken with each camera and superposed to generate a video. In essence, “bullet time” is the continuous series of images from different angles played in rapid succession to create the illusion of time slowing down.
Over the years, “bullet time” has undergone many iterations, and leaps from “fixed frame viewing” to “video viewing” have been achieved. With the popularization and implementation of 5G, the interaction technology of free viewpoint has gradually become popular.
What Is “Free Viewpoint”?
We can start with 6 degrees of freedom (6DoF), a common concept in VR. There are six different degrees of freedom when an object moves in space. They can be divided into two types. The first type is the translational degree of freedom, which includes front and back, left and right, and up and down. The second type is the rotating degree of freedom, which includes nodding, shaking head, and tilting head.
Strictly speaking, bullet time is a 2D video, with more emphasis on post-production. For a 6DoF video, users can drag and drop it with their fingers to select the viewing angle and position. Although the camera is linear, it can adjust the positions of up and down and front and back without relying on the original camera position. For example, it can capture close-ups of characters and the panoramic view of the vision.
In 2019, for the first time, Youku applied 6DoF video technology to the live broadcast of Chinese sporting events, such as the CBA opening game. Youku froze the pictures and presented the athletes’ relative positions and actions from multiple perspectives, generating a more on-the-spot viewing experience.
Compared with traditional video interaction, there more advantages to 6DoF videos. Firstly, “the feet can move”. Users can move the viewing position in a virtual form. Secondly, “the hands can move”. Users can change the video content through certain gesture operations.
Now, we can watch the GIF again. When you touch the screen, the whole scene will be frozen, which gives you a more refined viewing experience.
In this year’s variety show Street Dance of China Season 3, the Youku app also introduced a new interaction feature called FVV. Users can slide freely on the screen with their fingers to watch more details of the contestant’s performance from different angles.
What is the conceptual difference between “free viewpoint technology” and 6DoF video? In short, 6DoF video is an “inside-out” video that can be viewed in a user-centered manner and shows the translational and rotating degree of freedom. The free viewpoint technology can be understood as the “outside-in” viewing method, which is more like the operation method of 3D games at first glance. Therefore, free viewpoint technology is also very suitable for scenes, such as variety shows, sports, films, and television. The technology can create a free and immersive three-dimensional interactive experience.
After this highly interactive feature was launched, Alibaba MoKu Lab first introduced a complete set of solutions that underpins the free viewpoint technology. It is reported that the production of free viewpoint technology interaction includes software and hardware manufacture, cloud-based 3D reconstruction, video compression and transmission, client-side viewpoint reconstruction, and video standard construction.
Street Dance of China Season 3 is the first show in China that employs free viewpoint technology. Judging from the on-site photos, more than 40 special cameras have been installed around the stage for capturing pictures. The cameras can record each action and synchronize it in milliseconds. The free viewpoint technology can show the street dancers from all directions. At the same time, it sets higher requirements for the dancers because everyone’s facial expressions and movements are always being watched.
It is worth mentioning that this technology will also be applied to the test competition of the Beijing Winter Olympics early next year.
According to the Ali Entertainment MoKu Lab, the layout of the free viewpoint technology is mainly divided into the following parts: content filming, production, and user device interaction.
On-Site Filming: Software and Hardware Solutions
For free viewpoint technology, filming is the most difficult part to be standardized. As the first step of the technical procedure, filming plays a crucial role in content presentation and subsequent algorithm optimization. Compared with the filming on one camera, the on-site control of camera arrays is more demanding.
To this end, Ali Entertainment MoKu Lab has designed a set of on-site software and hardware solutions that provide high stability and ease of use. This technology is different from the traditional method of dynamic stream cutting for dense camera array. It can present a good multi-view effect on the user’s application with sparse cameras. It is the most cost-effective solution for filming in a single scene. It can also address problems at the transmission layer, such as the limited interaction range caused by a large amount of data in dense camera arrays.
Currently, the filming system camera array can synchronize videos recorded by over 50 cameras in milliseconds. All of the cameras in the camera array can set parameters and verify effects through unified remote control, which greatly accelerates on-site deployment and reduces debugging time.
In addition to the hardware, Ali Entertainment MoKu Lab also developed a software system with a complete graphical interface to manage on-site solutions. The stability of the system has reached the level of commercial applications. It has been run stably in more than 70 CBA games and Youku’s self-made variety shows. With this system, non-professional technicians can control and debug the complex camera array system on-site.
The hardware/software solution also includes a set of on-site software and communication protocols for cloud computing. After the on-site software performs real-time stream pulling on videos and images taken by the camera, they can be uploaded directly to the cloud for producing and verifying 3D video effects. The on-site personnel can immediately see the effect and make timely adjustments, making the on-site solution a closed loop of quality management with high availability.
3D Reconstruction System on the Cloud
To produce high-quality interactive 3D videos, the 3D reconstruction algorithms and large-scale production systems that meet the business requirements for timeliness are crucial.
In cultural and recreational applications, the difficulty in implementing and commercializing 3D reconstruction technology lies in the comprehensive consideration of algorithm selection and the end-to-end implementation path. For example, in the selection of three-dimensional expression forms, there are two schemes, namely, point cloud and depth. Point cloud has no mature coding and decoding standards and hardware decoding support due to a large amount of point cloud data. In addition, the point cloud cannot be modeled well for scenes, such as variety shows, which have higher requirements for light authenticity restoration. Therefore, in terms of the technical procedure, Ali Entertainment MoKu Lab adopted depth-based three-dimensional expressions.
The depth-based route has its difficulties. The scenes of variety shows are very complex. Each scene has different designs for dancers, lighting, and dance movements. To make the algorithm robust for different scenarios, it is necessary to introduce end-to-end systematic consideration.
From the perspective of algorithm processing, 3D reconstruction relies on images collected by an on-site camera array, so many problems must be considered to meet the quality requirements of reconstruction. How is the camera array deployed on-site? How can we design camera density? How can we decide the distance between the camera and the scene? What is the relationship between the height and angle of the camera and the stage? How can we reduce light interference during collection? These pre-considerations will affect the quality of the reconstruction algorithm. We need to explore a set of optimal solutions in practice.
After obtaining the collected images, the algorithm also needs to be optimized in all aspects. Problems, such as small objects, complex occlusion areas, illumination changes, fast motion blur, and time domain stability, are all major problems of 3D reconstruction. The reconstruction algorithm in this scheme combines traditional matching algorithms, image segmentation, the cross-correction of multi-view 3D information, analysis of reconstruction stable region, multi-resolution reconstruction fusion, deep learning, and other strategies, which has greatly improved the above problems.
From the perspective of the reconstruction algorithm, the bandwidth of the compression transmission, and the performance of the client-side rendering should also be considered. Through a large number of experimental analyses, the Alibaba Entertainment MoKu Lab adopts a downsampling policy for depth information and a customized encoding policy for depth maps to compress the information to a level that the current user bandwidths can support to achieve a good display effect on user devices. These technologies have all applied for patents.
Continuous business practices have proven that the system reconstruction algorithm is now commercially available in terms of performance, production timeliness, and stability. Ali Entertainment MoKu Lab combines different business scenarios for sports scenes like the CBA and Youku’s variety shows, such as Street Dance of China and Dunk of China. Different algorithm policies are constantly polished and customized to achieve optimal results in different scenarios. At the same time, to control the end-to-end algorithm effect of the complex system, the procedure of user device interaction is completely simulated on the cloud, and an algorithm simulation verification platform for the complete procedure is built. Through this platform, the effects seen by the user can be verified 100% through algorithm simulation on the cloud, guaranteeing the image quality seen by end users.
The 3D reconstruction algorithm requires a lot of computing power. Therefore, the Ali Entertainment MoKu Lab has deployed more than 30 GPU computing clusters on the cloud to re-create and produce videos for variety shows and sports scenes at a high concurrency. In a typical basketball application with high timeliness, for example, a CBA basketball scene, 3D video effects need to be released immediately after a goal. Currently, the cloud-based concurrent processing system can produce in near-real-time with a delay of 10s, which meets the broadcasting requirements of sports scenes. Relevant effects have been applied to the live broadcast of CBA games on many occasions.
To solve the problem of reconstruction stability in the time-space domain of an interactive 3D video and to pursue the optimal algorithm, the Ali Entertainment MoKu Lab has also incorporated some more complex stability enhancement strategies in the time-space domain. Currently, deep learning models are used and tried to obtain more stable reconstruction results in the time-space domain and integrate them with traditional algorithms. The team is also exploring algorithms with better time-space stability constraints.
Video Compression and Transmission
For the compressed transmission of free viewpoint videos, the main difficulty lies in ensuring the reconstruction of high-quality images on devices and taking into account the decoding capability and transmission bandwidth of the current device. It is also a technical issue that requires in-depth optimization. The Ali Entertainment MoKu Lab has adopted a 3D scenario representation method based on texture and depth splicing to adapt to the existing video compression standards. For the particularity of depth map compression, in-depth customization and optimization was made.
There are two difficulties in the compression of depth maps. First, the expression data of depth maps are large, and the resolution of reconstructed depth maps is consistent with that of texture maps. Therefore, it is necessary to consider how to reduce the resolution of the depth map without causing obvious loss to the viewpoint reconstruction on devices. Second, depth maps are sensitive to compression loss. Generally, depth maps change dramatically, especially for the edge parts of objects. General compression parameters are prone to the loss of the edge part of objects in depth maps due to quantization, which can seriously affect the image quality of viewpoint reconstruction on devices.
To solve the first problem, the Ali Entertainment MoKu Lab has proposed an algorithm that allows us to perform cloud downsampling and terminal upsampling on depth maps. The resolution of depth maps can be reduced to at most 1/16 of texture maps before cloud compression. The viewpoint reconstruction can be performed after the terminal increases the sampling to the same resolution as texture maps. In addition, algorithms have been continuously optimized so that the quality of the reconstructed image has no obvious loss. To solve the second problem, Ali Entertainment MoKu Lab proposed a method based on the ROI coding of depth map regions. Therefore, the quantization loss in the coding process can be effectively controlled without significantly increasing the bitrate.
In terms of video transmission, Ali Entertainment MoKu Lab has developed a set of cloud-device integrated video transmission protocols based on the flexibility and universality required by the actual business. This set of protocols can be implemented on the cloud and devices according to the protocols. It supports different numbers and layouts of on-site cameras, different terminal interaction range designs, and interactive 3D videos with different resolutions. This proprietary transmission standard can ensure that videos can be played and interaction can be made after the protocol is parsed.
Realizing “Freedom of Viewpoint”: Born for 5G
Currently, Youku provides two different experiences for high-end and low-end mobile phone users. Some viewers of Street Dance of China may find that Youku’s “FVV” feature only supports a 70-degree interaction range. This is because the technology still has certain requirements for mobile phone performance and network conditions.
Taking viewpoint reconstruction on clients as an example, any viewpoint reconstruction of user interaction needs to be completed at terminals. The timeliness and low power consumption of algorithms are also very important influencing factors. Generally speaking, even for watching on low-end mobile phones, smooth interaction must be achieved and the heating of mobile phones must be minimized.
The current work shows that, after the deep optimization of the mobile terminal viewpoint reconstruction algorithm, the current version can cover the mainstream models on the market. However, to experience 150-degree interaction, you may need a 5G mobile phone.
With the wide application of 5G, free viewpoint technology will also enter more video programs and play a greater role in how people watch and interact with films and TV series. The freedom of wealth may be difficult to achieve, but the “freedom of viewpoint” is just around the corner. In the near future, everyone can “stay at home” but “be immersive” in virtual scenes from their favorite films and TV series.