Alibaba Cloud Makes Efficient Genetic Detection of Pathogens Possible

Bolster the growth and digital transformation of your business amid the outbreak through the Anti COVID-19 SME Enablement Program. Get a $300 coupon package for all new SME customers or a $500 coupon for paying customers.

By Hanchao from ApsaraDB


Figure 1. Pathogen detection process

For bioanalytical testing, performing pathogen detection usually generates about 500 million 75-bp gene fragments, of which some sample sequences are filtered. After the filtering, there will be about 100 million gene fragments to be queried. Normally, the nBlast[1] tool is used for comparing sequence information. This step can be very time-consuming for the whole detection process as it takes about 2 to 3 hours to complete. Alibaba Cloud AnalyticDB Vector Edition is an efficient gene sequence retrieval system, which completes the entire pathogen query and detection process in tens of minutes, effectively improving the performance of gene analysis.

Applications of Gene Retrieval

Gene Retrieval Functions

Figure 2. Demo of nucleic acid sequence query

As shown in Figure 3, the user entered a sequence segment of SARS-CoV-2 in the search box. As you can see, the SARS-CoV-2 sequence was listed first among all returned sequences. AnalyticDB currently provides an efficient method for vector similarity search, allowing the system to retrieve similar gene fragments in milliseconds.

Figure 3. Retrieval results

Sequence-specific Detection of Different Strains

Figure 4. Matching genomes

The result showed that the system identified and returned three matching genomes. The accession numbers of these matches are NC_045512.2, NC_019843.3, and NC_038294.1. NC_045512.2 (65%) indicates the genome of SARS-CoV-2 (formerly called ‘Wuhan seafood market pneumonia virus’), NC_019843.3 (20%) indicates the genome of MERS-CoV, and NC_038294.1 (13%) indicates the genome of betacoronavirus England 1, which is an isolate of MERS-Cov[8]. As per the analysis result in the preceding graph, the mixture contained SARS-CoV-2 and MERS-Cov.

Overall Application Architecture Design

Figure 5. Gene retrieval framework

Training and Query Models

Genetic Query Process

Figure 6. Convert DNA sequences into vectors

Precision Evaluation

Experiment 1 (see Table 1): One 75-bp fragment was randomly selected from the database. Since we knew this 75-bp fragment was extracted from which fragment and of which genome, we then searched for this 75-bp fragment in the database to check if the correct genome fragment would appear in the first n result sets. The precision of the top n retrieved result sets, or Precision(n), was calculated from the following formula. After calculations, we got the values (in percentage) shown in Table 1.

Here, n indicates the length of the list returned by the query, and u indicates the number of queries. In this experiment, u is 1,000. The expressionff |Si ∈ Top(i,n)| indicates whether sequence si is among the top-n returned items. It takes one of two possible values: “1” for yes and “0” for no. A smaller value of n indicates a higher precision and better retrieval performance of the system. As Table 1 shows, the precision values of top-20 ranked result sets were both over 99% for the two models, which qualifies our gene sequence retrieval system for use in identification in fragmented genome sequences.

Table 1. Precision values from genetic sequence detection

Experiment 2 (see Table 2): One 75-bp fragment was randomly selected from the database. We generated 2% random mutations in the sequence (mutation rates in the natural world are quite low, for example, each newborn baby carries roughly 30 mutations among 3 billion base pairs of the human genome. RNA viruses have relatively high mutation rates, but the rates are generally lower than 1%) and searched for this fragment in the database to check if the correct genome fragment would appear in the first n result sets. As the values in Table 2 show, although the precision were reduced after the mutation, the precision values of top-20 ranked result sets still reached 0.99.

Table 2. Precision values from mutated sequence detection

Experiment 3 (see Table 3): Comparison was made between the retrieval speeds of BLAST and the algorithm used in our system. A total of 9.7 GB of viral sequences, genome sequences of fungi, and some plant genome sequences[7] were downloaded and imported to the AnalyticDB database and the BLAST database. We performed 100 different queries and averaged the experimental results. On average, Blast took 3.22 seconds to return results for a query. Our algorithm took 0.257 seconds to complete a query (including the conversion of genes into vectors, coarse-grained ranking of vectors, and fine-grained ranking with the Needleman-Wunsch algorithm) with a precision of 0.95 among top 30 results, representing a 12.5-fold improvement over the BLAST speed.

Table 3. Retrieval time


While continuing to wage war against the worldwide outbreak, Alibaba Cloud will play its part and will do all it can to help others in their battles with the coronavirus. Learn how we can support your business continuity at

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.