The New Architecture of Search Engines: Stories with SQL
Catch the replay of the Apsara Conference 2020 at this link!
By ELK Geek and Luo Tao, Senior Technical Expert of Alibaba Group
Search Engines with HA3 Architecture
1. There are two parts of HA3 architecture:
- The online part is a traditional service architecture with two layers called QRS and search. QRS accepts user requests and processes them. After that, QRS sends these requests to nodes in the search layer, which load indexes and complete retrievals. Then, results are collected by QRS and returned to users.
- The offline part is divided into two stages. The first stage is data preprocessing. Its core work is to process the data of businesses and algorithms to form a large wide table that is friendly to indexes. The second stage is index construction. Its main challenges are to support large-scale index updates and ensure that indexes are updated in real-time.
2. There are three main features of HA3 architecture:
- The first feature is the high performance of service architecture.
- The second feature is the various capabilities of indexing.
- The third feature is the pyramid algorithm framework.
As the business of Alibaba Group has grown over the past few years, this architecture was previously viewed as an advantage but has gradually turned into a barrier for further development.
Core Challenges Faced by HA3 Architecture
The main challenges are the extension of deep learning and the expansion of data dimensions.
1. The Extension of Deep Learning
The application scope of deep learning has extended from the early days of high-precision sorting to rough sorting and retrieval, such as the recall of vector index. The introduction of deep learning also causes two problems. The first problem is the network structure of deep learning models are usually complex and have high requirements in the execution process and model size. As a result, traditional pipeline work modes cannot meet the demands anymore. The second problem is the real-time update of the model, and characterized data pose a challenge to the indexing capabilities with tens of billions of updates online.
2. The Expansion of Data Dimensions
In the e-commerce field, the main data dimensions are buyers and sellers. Now, data in locations, distributions, stores, and fulfillment are taken into consideration. Take distribution as an example, there are distributions in 3 kilometers, 5 kilometers, and intra-city and cross-city distributions. In the offline workflow of the search engine, data of each dimension is converted into a large wide table, which results in data expansion in the form of the Cartesian product. Therefore, it is difficult to meet the requirements of the scale and timeliness of upgrades in new scenarios.
Solutions for Traditional Search Engines
The solution for traditional search engines splits engines into different instances according to the dimension features of business data. Then, users can obtain results by querying different engine instances at the business layer. For example, the search engine of Eleme has data in dimensions of stores and products. To reduce the impact of real-time changes in store status on the index, two search engine instances can be deployed. One is used to search for appropriate stores, and the other is used to search for appropriate products. Both are operated by the business side’s queries on store engine and product engine one after another. However, this solution has an obvious disadvantage. When many stores meet users’ intentions, the store data needs to be serialized in the store engine and sent to the product engine on the business side. In this process, the cost of serialization is very high, and usually, a certain limitation is required for the number of stores sent from the store engine. However, those stores beyond limitation are very likely to better match the user’s intentions, causing major impacts on business performance. It will be big losses for both users and sellers, especially in popular business areas.
Solutions for HA3 in Search Engines with SQL
The following three key points describe how to reshape the search process using the SQL database:
- The original large wide table is extended to multiple tables, and each is independent in index loading, updating, and switching. Joint offline operations are converted into online joint operations.
- The original pipeline work model is replaced and the Directed Acyclic Graph (DAG) work model is adopted. In addition, the search functions are divided into independent abstract operators, which are unified with the execution engine of deep learning.
- Users can easily use the DAG-based query process in SQL and reuse some basic functions of the SQL ecosystem. For example, the personalized search technology of e-commerce puts products, personalized recommendations, and deep models into different tables with flexible index formats, such as inverted index, forward index, and KV index. In addition, the execution engine supports parallel, asynchronous, and compilation optimization. For this reason, both memory and CPU resources can be effectively used to solve various business problems.
New HA3 Architecture
It is mainly divided into three layers:
- The bottom layer is the framework of searchRuntime, whose core responsibilities are index management and service scheduling. The indexes mainly include the loaded policies and query interfaces, such as supports for computing and storage separation and real-time indexing. Service scheduling mainly processes the failover of processes and service updates, that is, the two-layer scheduling oriented to final states. The main feature of service scheduling is for restarting processes, updating programs, and a unified gray release.
- The middle layer is the DAG engine layer, which has two core components, the execution engine, and operators. The execution engine is capable of graph execution within a single machine, distributed communication, and deep learning. Through the interconnections of operators, the query process of search engines can be easily connected to deep learning. It can achieve the penetration of deep learning in every search stage, such as vector search, rough sorting, and high-precision sorting. The abstraction of operators is the most important part of architectural abstraction. The original process-oriented development is changed to the development of the independent functions. To do so, the functions of operators are required to be as cohesive as possible. Operator-level management is also required to be more conducive to the reuse and release of functions.
- The top layer is the SQL query layer, which has two parts, namely, SQL parsing and query optimization. As the DAG process can be customized at will, the key issues are how to make it easier for users to construct graphs and coordinate operators. Simplicity and universality are two features that must be considered. They are also the reasons why SQL is the first choice. Another reason is that SQL executors in the industry usually have steps of logic optimization and physical optimization. Both provide a good abstraction for the execution of complex DAG. Many detailed optimizations, including graph transformation, operator merging, and compilation optimization are also implemented through these two steps.
In a scenario of takeaway searching, let’s imagine a user enters the keyword “beef noodles” in the search box. The background process of the search engine works in three steps. First, it searches for and finds stores that are currently open that sell beef noodles. Then, it takes the stores with the best matching product listing. Lastly, the user sees the return results of the store and product that best meet their need. In this case, the data on the business status of the store, the distribution capacity, and product inventory needs to be updated in real-time. The searching information matched in mass data also requires various indexing technologies, such as spatial indexing, inverted indexing, and vector indexing. The sorting of stores and commodities also relies on the deep models since user preferences, discount information, and distance are all important. In the past, Elasticsearch was used to query tables from store dimension and product dimension, but may have problems in query results limitation and introducing deep learning. However, these problems can be solved easily within HA3 architecture. After migrating to HA3 architecture, the long tail problem of the service disappears, and the performance improved substantially. Moreover, HA3 architecture saves space for subsequent iterations of algorithms. The performance improvements mainly lie in the index structure and query optimization.
2. Taobao’s Services for Local Life
Taobao’s core demand is to introduce local services, such as the Tmall supermarket and Freshippo’s distribution within an hour, into its search business. By splitting the data of stores and commodity dimensions, the update capability has been greatly improved, and many search functions of Eleme are also reused.
3. Search on DingTalk Disk
The permission management of documents on DingTalk Disk needs to be supported by a traditional search engine. The reasons lie in the large scale and frequent updating of data in document and permission dimensions. This problem is solved through the real-time local joint operation of HA3’s SQL with low latency.
4. Internal Monitoring System
This system was previously built based on druid but can no longer meet the demand for the business scale. Manual solving of errors frequently occur. Therefore, time series data indexes are expanded based on HA3, and the parallel operation capability of SQL is also applied. By doing so, there is a decrease in latency, but stability has improved substantially.