Interview with iDST Deputy Managing Director Hua Xiansheng: City Brain — Comprehensive Urban Cognition
Editor’s Note: From October 11 to 14, 2017, The Computing Conference will be held once again in Hangzhou’s Yunqi township (get your tickets now!). As one of the world’s most influential technology expos, this conference will include brilliant lectures by many Alibaba Group’s experts and industry leaders. Starting from today, the Yunqi Community will interview a series of conference guests.
The first guest we interviewed was Alibaba iDST Deputy Managing Director Hua Xiansheng. During the October Computing Conference, he will discuss the latest trends in the computer vision field and the latest progress of the City Brain.
Hua Xiansheng is a leading international expert in the field of visual recognition and search, and has previously served as the program committee chair for the ACM Multimedia Conference and other organizations. Dr. Hua is also a Thousand Talents Program expert, IEEE Fellow, ACM Distinguished Scientist, and MIT TR35 Young Innovator Award recipient.
In 2015, Dr. Hua left the Microsoft Research Institute for Alibaba. In the search business department, he was responsible for optimizing image-based product search technology and his team developed Pailitao, the image search function for the Taobao app. In April 2016, Dr. Hua also joined Alibaba’s artificial intelligence research institute iDST, where he directed the research work of the visual computing team. At present, the City Brain project is one of the projects under his charge.
At the Conference on Computer Vision and Pattern Recognition (CVPR 2017) held at the end of July, Dr. Hua, as the director of iDST’s visual computing team, delivered a keynote speech titled “Practices of Large-Scale Target Re-Identification”, which brought up the City Brain project.
Finding Value in Heterogeneous City Data
The City Brain project was publicly announced at the 2016 Hangzhou Computing Conference. Wang Jian, the then chairman of the Alibaba Group’s technical committee, introduced City Brain using the following words: “City Brain has Alibaba Cloud’s ET artificial intelligence technology at its core to perform a comprehensive real-time analysis across the city, automate public resource allocation, and fix problems as they arise during city operation. City Brain will evolve into a super artificial intelligence for city governance.”
Today, one year has passed, but City Brain remains a mysterious project to outsiders. If you want to use a plain and dated term to define it, you could call it a smart city. However, City Brain is actually far more advanced than what we usually refer to as a smart city.
In the words of Dr. Hua, at its core, City Brain uses big data and big computing to mine valuable information from large volumes of heterogeneous city data.
What is heterogeneous city data? It has two main features:
First, city data is a combination of visual data, public transport data, GPS data, and other heterogeneous data. Naturally, visual data make up the largest and most important part of such data. Second, city data volumes are huge. For example, a city may have hundreds of thousands of cameras, which produce massive data around the clock on a daily basis. Therefore, the inherent advantage of city data is its massive volume. The mission of City Brain is to find a way to extract valuable information from the data.
According to Dr. Hua, “in the past, the value of these data was not fully explored and the deployment and O&M costs for this many devices were very high. However, the value of such data goes far beyond traditional applications such as license plate identification and traffic fines.”
City Brain is creating cities with data intelligence. By providing comprehensive, real-time, and complete awareness, it can recognize vehicle shapes, models, trajectories, and speeds, or perceive pedestrians and cyclists. On such a basis, the project can improve decision-making, make forecasts, and intervene. At present, the value of city data is gradually becoming more apparent.
Dr. Hua used traffic conditions as an example: When an emergency arose, City Brain could immediately find the relevant data, such as suspect vehicles, cars involved in accidents, and even criminal suspects. After analyzing relevant data, it can also optimize traffic for the entire city. Going one step further, City Brain can even predict such a situation before it happens. For instance, it can tell you where traffic jams will occur in the next 10 minutes. City Brain is also capable of making predictions much earlier and deploy police and medical resources in advance. It can even prevent traffic accidents by instituting preemptive traffic control and policing.
Dr. Hua added that the comprehensive perception of city data is possible due to two main technologies. First, improved computing power, such as cloud computing, GPUs, and FPGAs, allows us to compute massive volumes of data. For example, we can simultaneously process video feeds from thousands, tens of thousands, or even more roads in real time. Second, deep learning algorithms are critical to the progress in the field of computer vision.
Dr. Hua’s team has already made many breakthroughs relative to algorithms. On the server end, they are using more optimized algorithms for vehicle detection and license plate recognition with greater precision. At the same time, they can monitor accidents in real time and predict traffic conditions. City Brain has been deployed and used in the Hangzhou and Xiaoshan metropolitan areas for quite some time.
“We can perform large-scale video processing, but either efficiency or stability poses a major challenge. Over the better half of this year, as a result of ongoing iteration and optimization efforts in the project, its overall processing speed has been increased by a factor of 20 today.”
From Perception to Search
Without a doubt, computer vision is both the most important and most challenging aspect in the City Brain project. Dr. Hua stated that visual data is the core of heterogeneous city data. It is more comprehensive than other data. Therefore, the City Brain project invests the most time and energy in visual technology.
“From the coverage perspective, GPS data prevails over visual data, because GPS data is essentially cross-section data. However, visual data is more comprehensive and can give us complete details of what is happening at any given intersection.”
However, besides the fundamental aspects of visual perception and recognition, City Brain must also deals with issues related to the structure of visual data, such as search.
Just like Taobao’s image search feature, City Brain must index images in real time. One of the major breakthroughs of this project is indexing and searching visual data feeds from cameras across a city.
According to Dr. Hua, from the technical perspective, the overall approach to city image search is similar to Taobao’s image search feature. First you need to know where your target is and detect it. Then, you need to identify the vehicle, person, or other moving target and the target’s properties. Finally, you need to extract a feature, a high-dimensional vector representing the essential characteristics of this target.
However, city images searches are much more complex than product searches. As far as the customer is concerned, different instances of the same product are essentially identical. However, cars of the same model owned by different people cannot be consider identical. In addition, human feature description and search are another major challenge. If a person’s facial image is not clear, this issue becomes even trickier. These are the real challenges that need to be overcome in actual applications.
Of course, the iDST visual team is already at the forefront of the industry. Their results achieved in open test sets have already greatly exceeded the best publicly available results.
Commercialization of AI
With artificial intelligence development in full swing, the past few years have seen the emergence of many AI startups, both in China and abroad. Successfully commercialization is the best standard for measuring the strength of these companies.
Dr. Hua believes that successful AI commercialization must meet five criteria:
First, competent algorithms serve as a foundation.
Second, related data must be available.
Third, there must be a user base large enough.
Forth, there also needs to be a platform with powerful computing capabilities and a sound system architecture (of course, cloud computing has already lowered the barrier to entry for many startups).
Fifth, there must be a good business model.
At present, most artificial intelligence companies focus on visual applications. It would be no exaggeration to say that the field of computer vision is already a “red ocean”. It is undeniable that computer vision is the fastest in terms of commercialization among the numerous artificial intelligence technologies Dr. Hua predicts that there will be five main visual application trends in the future:
The first is transportation security, which is also a main focus of City Brain.
Then, there is rich media, the use of visual methods to find valuable information in large volumes of video or image data.
The third trend will be medical imaging. Although adoption of such technologies in the medical community may take longer, they will certainly be an important area in the future.
The fourth trend of application is industry vision. In the future, cameras will be able to replace manual-visual inspections and judgements in most scenarios. This is a field to be further explored in the future.
In addition, the field of terminal-based visual intelligence is quite promising, including chips and some visual-based applications.
It is not hard to see that the fields described above are exactly the R&D focuses of Alibaba Cloud’s City Brain, Medical Brain, and Industrial Brain. However, the differences between the different fields are also quite obvious. During the interview, Dr. Hua repeatedly stressed the importance of in-depth study of each industry. Artificial intelligence is gradually penetrating into different industries and sectors. However, to realize the full potential of this technology, in addition to laying the foundation with data and algorithms, in-depth research into specific application scenarios is also of critical importance.
Below we have attached the transcript of our interview with Dr. Hua:
Yunqi: What are the limitations of deep learning when applied to computer vision applications? In the future, will it be outdated by new technologies?
Dr. Hua: In fact, there are many limitations. Deep learning looks wonderful, but there are still many issues that need to be addressed. For example, facial recognition works great on a small scale, and its results are passable when dealing with thousands of individuals. However, any further expansion of the scale is very difficult to achieve. Also, video quality, resolution, and obstructions all limit the effectiveness of recognition. In these aspects, machines still cannot compete with humans. Deep learning is highly reliant on data. Deep learning applications using small data need to be further explored.
In recent years, deep learning has been gaining momentum. However, in the future, there will surely be new technologies to challenge its position.
Yunqi: One of our papers entitled “Video to Shop: Matching Clothes in Videos to Online Shopping Images” was included in last month’s CVPR. Can you talk about the innovative ideas about this application?
Dr. Hua: This application uses cutting-edge clothing detection and tracking technologies. To address the multiple angle, multiple scenario, and obstruction challenges in detection of the clothing worn by celebrities, we came up with a Reconfigurable Deep Tree structure. It relies on similarity matching between multiple frames to deal with obstructions, fuzziness, and other problems in individual frames. This structure can be considered an extension of the existing attention model and can be used to solve the problem of multi-model fusion.
Yunqi: In your opinion, what future changes can be predicted in the computer vision field?
Dr. Hua: It depends on which level you want to talk about. If we are talking about technology, I think the evolution of deep learning itself will be an important change. For example, GANs may be used in more scenarios. Large-scale video mining will be another important direction. From a higher level, if we look at the field from the perspective of intelligent applications, I think that more in-depth research into specific industries will truly jump-start commercialization of artificial intelligence, or the so-called visual intelligence. Then this technology will realize its true impact and potential. Practice and exploration in this area will in turn promote the further development of visual technologies. Only by putting this technology into practice can we discover what challenges remain to be addressed. After all, the real-world competition can be very cruel.
Yunqi: What do you plan to share with attendees during this Computing Conference? Can you give us a preview of the topics you will discuss and tell us why you chose them?
Dr. Hua: I will introduce some of the applications of visual technology in various fields and the challenges they face, with special focus to the technologies and applications in the City Brain project. Our previous discussions only touched upon the City Brain project. This time, I want to take a deeper dive. For example, I want to discuss the technical details of City Brain and how we can manifest its value.