The New Architecture of Search Engines: Stories with SQL

Catch the replay of the Apsara Conference 2020 at this link!

By ELK Geek and Luo Tao, Senior Technical Expert of Alibaba Group

Search Engines with HA3 Architecture

  • The online part is a traditional service architecture with two layers called QRS and search. QRS accepts user requests and processes them. After that, QRS sends these requests to nodes in the search layer, which load indexes and complete retrievals. Then, results are collected by QRS and returned to users.
  • The offline part is divided into two stages. The first stage is data preprocessing. Its core work is to process the data of businesses and algorithms to form a large wide table that is friendly to indexes. The second stage is index construction. Its main challenges are to support large-scale index updates and ensure that indexes are updated in real-time.

2. There are three main features of HA3 architecture:

  • The first feature is the high performance of service architecture.
  • The second feature is the various capabilities of indexing.
  • The third feature is the pyramid algorithm framework.

As the business of Alibaba Group has grown over the past few years, this architecture was previously viewed as an advantage but has gradually turned into a barrier for further development.

Core Challenges Faced by HA3 Architecture

1. The Extension of Deep Learning

The application scope of deep learning has extended from the early days of high-precision sorting to rough sorting and retrieval, such as the recall of vector index. The introduction of deep learning also causes two problems. The first problem is the network structure of deep learning models are usually complex and have high requirements in the execution process and model size. As a result, traditional pipeline work modes cannot meet the demands anymore. The second problem is the real-time update of the model, and characterized data pose a challenge to the indexing capabilities with tens of billions of updates online.

2. The Expansion of Data Dimensions

In the e-commerce field, the main data dimensions are buyers and sellers. Now, data in locations, distributions, stores, and fulfillment are taken into consideration. Take distribution as an example, there are distributions in 3 kilometers, 5 kilometers, and intra-city and cross-city distributions. In the offline workflow of the search engine, data of each dimension is converted into a large wide table, which results in data expansion in the form of the Cartesian product. Therefore, it is difficult to meet the requirements of the scale and timeliness of upgrades in new scenarios.

Solutions for Traditional Search Engines

Solutions for HA3 in Search Engines with SQL

  1. The original large wide table is extended to multiple tables, and each is independent in index loading, updating, and switching. Joint offline operations are converted into online joint operations.
  2. The original pipeline work model is replaced and the Directed Acyclic Graph (DAG) work model is adopted. In addition, the search functions are divided into independent abstract operators, which are unified with the execution engine of deep learning.
  3. Users can easily use the DAG-based query process in SQL and reuse some basic functions of the SQL ecosystem. For example, the personalized search technology of e-commerce puts products, personalized recommendations, and deep models into different tables with flexible index formats, such as inverted index, forward index, and KV index. In addition, the execution engine supports parallel, asynchronous, and compilation optimization. For this reason, both memory and CPU resources can be effectively used to solve various business problems.

New HA3 Architecture

  • The bottom layer is the framework of searchRuntime, whose core responsibilities are index management and service scheduling. The indexes mainly include the loaded policies and query interfaces, such as supports for computing and storage separation and real-time indexing. Service scheduling mainly processes the failover of processes and service updates, that is, the two-layer scheduling oriented to final states. The main feature of service scheduling is for restarting processes, updating programs, and a unified gray release.
  • The middle layer is the DAG engine layer, which has two core components, the execution engine, and operators. The execution engine is capable of graph execution within a single machine, distributed communication, and deep learning. Through the interconnections of operators, the query process of search engines can be easily connected to deep learning. It can achieve the penetration of deep learning in every search stage, such as vector search, rough sorting, and high-precision sorting. The abstraction of operators is the most important part of architectural abstraction. The original process-oriented development is changed to the development of the independent functions. To do so, the functions of operators are required to be as cohesive as possible. Operator-level management is also required to be more conducive to the reuse and release of functions.
  • The top layer is the SQL query layer, which has two parts, namely, SQL parsing and query optimization. As the DAG process can be customized at will, the key issues are how to make it easier for users to construct graphs and coordinate operators. Simplicity and universality are two features that must be considered. They are also the reasons why SQL is the first choice. Another reason is that SQL executors in the industry usually have steps of logic optimization and physical optimization. Both provide a good abstraction for the execution of complex DAG. Many detailed optimizations, including graph transformation, operator merging, and compilation optimization are also implemented through these two steps.

Practices

1. Eleme

2. Taobao’s Services for Local Life

3. Search on DingTalk Disk

4. Internal Monitoring System

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.