Cloud-based mNGS Analysis: Virus Sequence Comparison in 60 Seconds

Image for post
Image for post

Bolster the growth and digital transformation of your business amid the outbreak through the Anti COVID-19 SME Enablement Program. Get a $300 coupon package for all new SME customers or a $500 coupon for paying customers.

By Eric Lee, nicknamed Zhuang Huai, from Alibaba Container Service.

The Metagenomic next-generation sequencing (mNGS) of SARS-CoV-2 RNA detected in upper or lower respiratory tract within weeks of onset for pan-pathogen detection has contributed to the early discovery and accurate sequencing of COVID-19 RNA. Alibaba Cloud Genomics Service (AGS) provides researchers with the ability to quickly compare mNGS macro genome sequencing data. Simply by using an Alibaba Cloud Object Storage Service (OSS) bucket and the AGS command line tool, you can calculate 3.2 Gbase (22 million reads) of macro genomic data from alveolar sample sequencing and compare it with known pathogen genomes, including COVID-19 (SARS-CoV-2) and 39 reference sequences of BetaCov RNA, all within 60 seconds. In addition, you can upload custom virus libraries for comparison.

mNGS of COVID-19 (SARS-CoV-2) RNA detected in upper or lower respiratory tract within weeks of onset for pan-pathogen detection has contributed to the early discovery and accurate sequencing of COVID-19 RNA.

Image for post
Image for post

For COVID-19, nucleic acid detection uses the multiplex fluorescence RT-PCR kit. This kit is primarily used for the fluorescence detection of the ORF1ab, N, E gene targets of COVID-19. Antibody detection looks for the presence of the lgM and lgG antibodies that the body’s immune system produces when exposed to COVID-19. If such antibodies can be detected, this indicates that the subject has been infected and recovered. In addition, mNGS provides macro genome data obtained through the deep sequencing of diseased tissue to detect and investigate various pathogens.

Although there are many RT-PCR kits available for COVID-19, influenced by the concentration of the virus and the quality of the kit, the RT-PCR kit and other reagent kits produce a large number of false negatives. As a result, doctors and patients often need to repeat the test many times, extending the time spent waiting for results.

The advantage of mNGS is that all known pathogens can be checked for in a single process. mNGS testing can avoid the difficulties that repeated sampling brings for both doctors and patients. It also solves the difficulty of massive samples required for PCR testing. Based on the mNGS nucleic acid sequence alignment analysis method, once the genome of the pathogen is known, researchers simply need to update the database to efficiently and accurately detect subsequent cases.

AGS provides researchers with the ability to quickly compare mNGS macro genome sequencing data. With this ability, you can calculate 3.2 Gbase (22 million reads) of macro genomic data from alveolar sample sequencing and compare it with known pathogen genomes, including COVID-19 and 39 reference sequences of BetaCov RNA, all within 60 seconds. In addition, you can upload custom virus libraries for comparison. For the Chinese Center for Disease Control and Prevention, hospitals, and labs, simply by using an OSS bucket and the AGS command line, they can complete the entire comparison process and produce high-quality reads-matching data and preliminary quality reports. This provides fast and accurate data support for the detection of various pathogens and further protein and variation researches on COVID-19.

Working with the community to combat the epidemic, AGS has opened up its mNGS RNA comparison computing capabilities to gene sequencing vendors, the Chinese Center for Disease Control and Prevention, hospitals, schools, and pharmaceutical companies.

Preparation

1. To download and install the AGS command line interface.

2. To download and install OSSutil.

3. Prepare an Alibaba Cloud account and an OSS bucket to store mNGS sequencing data, such as oss://my-test-shenzhen.

4. Configure bucket access permissions for AGS. For example: ags config oss my-test-shenzhen.

5. Upload mNGS data to the bucket.

e.g.
ossutil cp ICU6G_S2_L001_R1_001.fastq.gz oss://my-test-shenzhen/cov2-samples/
ossutil cp ICU6G_S2_L001_R2_001.fastq.gz oss://my-test-shenzhen/cov2-samples/

6. Run the comparison task to compare the mNGS data with known RNA sequences and sequence databases. Repeat steps 5 and 6 to compare different samples.

Usage:
ags remote run rna-mapping \ # <rna-mapping>: RNA 序列的比对任务
--region cn-shenzhen \ # <cn-shenzhen|cn-beijing|...>: 地域ID,目前支持深圳和北京。
--bucket my-test-shenzhen \ # <bucket_name> 对象存储bucket的名称
--fastq1 cov2-samples/ICU6G_S2_L001_R1_001.fastq.gz \ # 双端测序数据fq1相对路径
--fastq2 cov2-samples/ICU6G_S2_L001_R2_001.fastq.gz \ # 双端测序数据fq2的相对路径
--output-bam bam/ICU6G_S2.bam \ #产出比对结果bam的输出路径,报告也在同样位置,以.txt结尾
--reference [sars-cov-2 | betacov-ncbi-39 | <path of RNA library reference in specified bucket >] # 参考序列预置了新型冠状病毒sars-cov-2和目前已经知道的39种betacov的冠状病毒,可以指定自定义的病毒序列库

COVID-19 Comparison

1. Submit a Comparison Task to Compare the Similarity Between the ICU6 G_S2_L001 Sequence Sample and COVID-19.

ags remote run rna-mapping \
--region cn-shenzhen \
--fastq1 cov2-samples/ICU6G_S2_L001_R1_001.fastq.gz \
--fastq2 cov2-samples/ICU6G_S2_L001_R2_001.fastq.gz \
--bucket my-test-shenzhen \
--output-bam bam/ICU6G_S2.bam \
--reference sars-cov-2
INFO[0002] {"JobName":"rna-mapping-gpu-2ms6w"}
INFO[0002] Job submit succeed

2. Check the Comparison Task and Comparison Results.

In this comparison task, 10 million reads (1.4 Gbase) and the COVID-19 sequence MN908947.3 were compared in 43 seconds, generating 3,629 high-quality mapped reads, and 404 reads that scored over 120 points in the COVID-19 feature range. This indicates that COVID-19 RNA sequences can be accurately detected in the sequencing data of this sample.

High Quality Mapped Reads is: 3629
Matched reads in orf1ab range is: 480
Matched reads in orf1ab range with alignment score (AS) is greater than 120: 404
feature sequence of ICU6G_S2_L001 is similar to SARS-CoV-2 with very high mappQ and AS reads: True
ags remote get rna-mapping-gpu-2ms6w --show
+-----------------------+------------------+-----------+-------------------------------+----------+-------------------------------+-------------+-------------+
| JOB NAME | JOB NAMESPACE | STATUS | CREATE TIME | DURATION | FINISH TIME | TOTAL READS | TOTAL BASES |
+-----------------------+------------------+-----------+-------------------------------+----------+-------------------------------+-------------+-------------+
| rna-mapping-gpu-2ms6w | XXXXXXXXXXXX | Succeeded | 2020-03-04 16:40:30 +0800 CST | 43s | 2020-03-04 16:41:13 +0800 CST | 10369818 | 1456539874 |
+-----------------------+------------------+-----------+-------------------------------+----------+-------------------------------+-------------+-------------+
+---------------------------------+--------------------------------------------+
| JOB DETAIL | |
+---------------------------------+--------------------------------------------+
| rna_matached_reads | 480 |
| rna_is_sars_cov2 | True |
| rna_mapping_oss_region | cn-shenzhen |
| rna_mapping_fastq_second_name | cov2-samples/ICU6G_S2_L001_R2_001.fastq.gz |
| rna_mapping_no_unmapped | |
| rna_mapping_service | s |
| rna_matached_reads_alignment | 404 |
| rna_high_quality_mapped | 3629 |
| rna_mapping_fastq_first_name | cov2-samples/ICU6G_S2_L001_R1_001.fastq.gz |
| rna_mapping_mark_dup | |
| rna_mapping_reference_file_name | sars-cov-2 |
| rna_cov_detail_file | bam/ICU6G_S2.bam.cov.txt |
| rna_mapping_bam_file_name | bam/ICU6G_S2.bam |
| rna_mapping_bucket_name | my-test-shenzhen |
+---------------------------------+--------------------------------------------+

3. Download the Comparison Data BAM File and Report.

ossutil ls oss://my-test-shenzhen/bam/ICU6G_S2.bam
LastModifiedTime Size(B) StorageClass ETAG ObjectName
2020-03-04 16:41:11 +0800 CST 356320 Standard 9596D012A30438A0073A2A0B38F5D578 oss://my-test-shenzhen/bam/ICU6G_S2.bam
2020-03-04 16:41:11 +0800 CST 2889 Standard 63175E7180D110BA9D3BAB34F4313C59 oss://my-test-shenzhen/bam/ICU6G_S2.bam.cov.txt
2020-03-04 16:41:11 +0800 CST 396 Standard 940D51FF7ECFF60B5E5A41D1F635180D oss://my-test-shenzhen/bam/ICU6G_S2.bam.summary.json
ossutil cp oss://my-test-shenzhen/bam/HKU2_160660.summary.json .
ossutil cp -r oss://my-test-shenzhen/bam/ICU6G_S2.bam.cov.txt .
ossutil cp oss://my-test-shenzhen/bam/HKU2_160660.bam .

Below is an example for when SARS-CoV-2 RNA detected.

cat bam/ICU6G_S2.bam.cov.txtSummary:
High Quality Mapped Reads is: 3629
Matched reads in orf1ab range is: 480
Matched reads in orf1ab range with alignment score (AS) is greater than 120: 404
/data/cov2-samples_ICU6G_S2_L001_R1_001.fastq.gz-output/ICU6G_S2.bam is similar to SARS-CoV-2 with very high mappQ and AS reads: True
21571 21581 21591 21601 21611 21621 21631
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGC
ATGTT GTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGC
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGT CAATTACCCCCTGC
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGA CCCCCTGC
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGC
ATGTTTGTTTTTCTTGTTTTATTGCCA agtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
atgtttgtttttcttgttttattgcca AGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGC
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGC
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGC
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGC
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGC
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGC
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGC
atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc
TGTTTGTTTTTCTTGTTTT CACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGC
tgtttgtttttcttgtttt
TTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGC
gtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgc

Further Analysis of the Comparison Data

You can use samtools stats, plot-bamstat, and similar tools to compare the BAM output data to further analyze the similarity in coverage and depth. Then, you can use the BAM data for protein composition and variation analysis.

Example stats

Image for post
Image for post

Coverage Analysis

Image for post
Image for post

Comparison with 39 Known Beta Coronaviruses

ags remote run rna-mapping \
--region cn-shenzhen \
--fastq1 cov2-samples/ICU6G_S2_L001_R1_001.fastq.gz \
--fastq2 cov2-samples/ICU6G_S2_L001_R2_001.fastq.gz \
--bucket my-test-shenzhen \
--output-bam bam/ICU6G_S2_virus.bam \
--reference betacov-ncbi-39
INFO[0011] {"JobName":"rna-mapping-gpu-6mpcc"}
INFO[0011] Job submit succeed
ags remote get rna-mapping-gpu-6mpcc --show
+-----------------------+------------------+-----------+-------------------------------+----------+-------------------------------+-------------+-------------+
| JOB NAME | JOB NAMESPACE | STATUS | CREATE TIME | DURATION | FINISH TIME | TOTAL READS | TOTAL BASES |
+-----------------------+------------------+-----------+-------------------------------+----------+-------------------------------+-------------+-------------+
| rna-mapping-gpu-6mpcc | XXXXXXXXX | Succeeded | 2020-03-04 17:36:21 +0800 CST | 40s | 2020-03-04 17:37:01 +0800 CST | 10369818 | 1456539874 |
+-----------------------+------------------+-----------+-------------------------------+----------+-------------------------------+-------------+-------------+
# 2014 mapped reads detected, but no mapped reads found in range
+---------------------------------+--------------------------------------------+
| JOB DETAIL | |
+---------------------------------+--------------------------------------------+
| rna_mapping_reference_file_name | betacov-ncbi-39 |
| rna_matached_reads_alignment | 0 |
| rna_mapping_bam_file_name | bam/ICU6G_S2_virus.bam |
| rna_mapping_fastq_first_name | cov2-samples/ICU6G_S2_L001_R1_001.fastq.gz |
| rna_mapping_oss_region | cn-shenzhen |
| rna_cov_detail_file | bam/ICU6G_S2_virus.bam.cov.txt |
| rna_mapping_no_unmapped | |
| rna_matached_reads | 0 |
| rna_mapping_mark_dup | |
| rna_mapping_service | s |
| rna_high_quality_mapped | 2014 |
| rna_mapping_bucket_name | my-test-shenzhen |
| rna_mapping_fastq_second_name | cov2-samples/ICU6G_S2_L001_R2_001.fastq.gz |
| rna_is_sars_cov2 | False |
+---------------------------------+--------------------------------------------+

Using Custom Virus Databases for Comparison

1. Download Reference Sequences from NCBI GenBank and Merge Them into a Multi-contig Reference Sequence. For example, search for all nucleic acid reference series containing betacov and download them.

Image for post
Image for post

2. Rename the Downloaded Sequence.fa File as betacov-ncbi-test.fa.

3. Upload the reference to the OSS bucket.

ossutil cp betacov-ncbi-test.fa oss://my-test-shenzhen/ref/

4. Submit the Comparison Task and Specify the Reference Path.

ags remote run rna-mapping \
--region cn-shenzhen \
--fastq1 cov2-samples/ICU6G_S2_L001_R1_001.fastq.gz \
--fastq2 cov2-samples/ICU6G_S2_L001_R2_001.fastq.gz \
--bucket my-test-shenzhen \
--output-bam bam/ICU6G_S2_virus.bam \
--reference ref/betacov-ncbi-test.fa
INFO[0002] {"JobName":"rna-mapping-gpu-69mwb"}
INFO[0002] Job submit succeed

5. View the Comparison Report and Obtain the Matched Data.

ags remote get rna-mapping-gpu-69mwb --show
+-----------------------+------------------+-----------+-------------------------------+----------+-------------------------------+-------------+-------------+
| JOB NAME | JOB NAMESPACE | STATUS | CREATE TIME | DURATION | FINISH TIME | TOTAL READS | TOTAL BASES |
+-----------------------+------------------+-----------+-------------------------------+----------+-------------------------------+-------------+-------------+
| rna-mapping-gpu-69mwb | 1365606736606053 | Succeeded | 2020-03-04 17:47:00 +0800 CST | 40s | 2020-03-04 17:47:40 +0800 CST | 10369818 | 1456539874 |
+-----------------------+------------------+-----------+-------------------------------+----------+-------------------------------+-------------+-------------+
+---------------------------------+--------------------------------------------+
| JOB DETAIL | |
+---------------------------------+--------------------------------------------+
| rna_mapping_fastq_first_name | cov2-samples/ICU6G_S2_L001_R1_001.fastq.gz |
| rna_mapping_fastq_second_name | cov2-samples/ICU6G_S2_L001_R2_001.fastq.gz |
| rna_mapping_mark_dup | |
| rna_mapping_oss_region | cn-shenzhen |
| rna_cov_detail_file | bam/ICU6G_S2_virus.bam.cov.txt |
| rna_is_sars_cov2 | False |
| rna_mapping_bam_file_name | bam/ICU6G_S2_virus.bam |
| rna_mapping_service | s |
| rna_matached_reads_alignment | 0 |
| rna_high_quality_mapped | 2014 |
| rna_mapping_bucket_name | my-test-shenzhen |
| rna_mapping_no_unmapped | |
| rna_mapping_reference_file_name | ref/betacov-ncbi-test.fa |
| rna_matached_reads | 0 |
+---------------------------------+--------------------------------------------+
+---------------------------------+------------------------------------------+

6. Download the Matched Data for Further Analysis.

ossutil ls oss://my-test-shenzhen/bam/ICU6G_S2_virus.bam
LastModifiedTime Size(B) StorageClass ETAG ObjectName
2020-03-04 17:47:38 +0800 CST 753458 Standard DF7B1A6CA5AF5DE6BF4FFDBB6DEF71C3 oss://my-test-shenzhen/bam/ICU6G_S2_virus.bam
2020-03-04 17:47:38 +0800 CST 1474 Standard 9D7968A779A0DE7C1993CC2A8D0E5A56 oss://my-test-shenzhen/bam/ICU6G_S2_virus.bam.cov.txt
2020-03-04 17:47:38 +0800 CST 397 Standard 81170E30BAAFEB947A2238E015171A51 oss://my-test-shenzhen/bam/ICU6G_S2_virus.bam.summary.json
Object Number is: 3
ossutil cp oss://my-test-shenzhen/bam/ICU6G_S2_virus.bam.summary.json .cat bam/ICU6G_S2_virus.bam.summary.json
{
"total_reads":10369818,
"total_bases":1456539874,
"pass_vendor_filter_reads":10369818,
"mapped_reads":6736,
"pair_reads":6680,
"properly_paired_reads":6520,
"mapq_40_to_inf_reads":2030,
"mapq_30_to_40_reads":0,
"mapq_20_to_30_reads":1,
"mapq_10_to_20_reads":3,
"mapq_0_to_10_reads":23,
"mapq_0_reads":10367761,
"GC":"46.499%",
"total_alignment":2057,
"supplementary_alignment":0
}%
ossutil cp oss://my-test-shenzhen/bam/ICU6G_S2_virus.bam .
samtools view bam/ICU6G_S2_virus.bam

While continuing to wage war against the worldwide outbreak, Alibaba Cloud will play its part and will do all it can to help others in their battles with the coronavirus. Learn how we can support your business continuity at https://www.alibabacloud.com/campaign/fight-coronavirus-covid-19

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store