See How a Professional Translation Platform’s Helping Experts Fight the Coronavirus
Bolster the growth and digital transformation of your business amid the outbreak through the Anti COVID-19 SME Enablement Program. Get a $300 coupon package for all new SME customers or a $500 coupon for paying customers.
By DAMO Academy Translation Platform Team.
The current pandemic poses a grave threat to human life and health across the world. The World Health Organization has called on the international community to regard the COVID-19 virus as “public enemy number one,” so fighting the epidemic is everyone’s priority. Throughout the world, there’s new research into the virus itself and possible treatments as well as much progress being made for a new vaccine. As of February 4, at least 77 research papers related to COVID-19 were published in English. In order to break down language barriers and coordinate the latest research on the virus between China and the world, the Alibaba DAMO Academy’s translation team launched the Professional Translation Platform for COVID-19. This platform provides free Chinese-to-English and English-to-Chinese translation services for professional documents in the medical field to help medical professionals around the world team together and fight the virus.
DAMO’s Professional Medical Translation Engine
The engine that powers this translation platform adopts iterative optimization based on an industry leading bidirectional Chinese-English translation model. For it, we used advanced automatic corpus filtering technology to collect a large amount of high-quality medical-domain data and refined the parameters of the model by incorporating some general-domain data. This enables the new translation engine to adapt well to medical tasks while retaining good translation performance in other fields. In addition, we used the latest intervention technology developed by the DAMO Academy to integrate latest bilingual knowledge base terms related to the epidemic. This ensures that medical terms are translated accurately. When translating a public test set taken from the medical field, the overall translation performance of the new translation engine was evaluated as being a 7% improvement over that of the original translation model.
Core Machine Translation Algorithms
At present, Alibaba Translate adopts a translation model based on deep neural networks. We use the seq-to-seq translation model framework, view individual strings as inputs, and use subwords as the minimum translation unit in order to generate sentence-by-sentence translations. We adopt a state-of-the-art Deep Transformer network architecture and use the deep neural network and self-attention technology to improve our modeling capabilities. This, as well as a highly parallel network structure design, significantly accelerates the training convergence speed of the model. Second, by making full use of linguistic knowledge and incorporating more a priori linguistic knowledge, we have improved the quality of the translation system. By integrating syntax, parts of speech, affix, and other information into the translation model, the output translation better follows grammatical rules and lexical norms.
Bilingual Corpus Scoring Technology
By automatically evaluating the quality of the collected bilingual corpus, we can filter high-quality domain data from a large volume of noisy data to better adapt the model to a specific domain. The following figure shows the overall architecture of the model.
The right side of the figure shows the main part of the model. This is the pre-trained bilingual expert model, which is very similar to the Transformer NMT model. However, as we are not performing generation tasks, we changed the target to a bidirectional transformer model. This model can effectively extract the bidirectional language features of the original text and the translation. Then, after this step of the training process, a powerful bilingual language model is created.
The left side of the figure shows a quality evaluation model based on Bi-LSTM, which integrates the features obtained from the bilingual expert model and some word distribution matching features. These features can effectively predict the quality of the corpus.
Machine Translation Intervention Technology
Our proprietary neural network translation intervention technology makes effective use of external a priori knowledge to make translations more professional and quickly correct translation errors. It can quickly repair badcases online and satisfy custom translation needs. As a part of this, we have implemented an online translation intervention module, which can smoothly perform whole-sentence and fragment intervention. This technology is widely used in e-commerce, voice, and communication translation scenarios. In medical scenarios, Alibaba Translate can naturally integrate up-to-date bilingual knowledge bases for the medical field through this technology to ensure the accuracy of medical terminology. The relevant terminology technology also allows users to customize terms.
Professional Epidemic Dictionary
Medical texts use a great deal of specialized terminology, which can make it difficult for researchers and clinicians to read medical texts in a foreign language. In addition, since medicine involves a wide range of professional fields and many sub-disciplines, even doctors who can use English with confidence in their own fields may be unable to read content in English that deals with other departments, fields, or disciplines. To address this pain point, this public welfare platform has gathered more than 500,000 medical terminology dictionaries that cover multiple subfields, including clinical care, biology, and pharmaceuticals. It also records new terms and translations related to the current epidemic in real time to facilitate terminology lookups by users. At the same time, users can add new term translations and share the latest epidemic terms with others in real time.
Epidemic-related Professional Literature Sharing
Since the outbreak of COVID-19, frontline researchers and clinicians have paid close attention to the progress of research on the virus itself, the epidemic situation, and prevention and control measures conducted in their own countries and abroad. At present, the public welfare platform includes content from authoritative journals, such as the New England Journal of Medicine, the Lancet, Nature, Science, as well as the Journal of Medical Virology and Journal of Clinical Medicine. This content includes nearly 20 papers from epidemiology, virology, clinical medicine, and other fields. Both the original English documents and the Chinese translations are provided free of charge. Users can read and download the papers, allowing them to promptly and conveniently keep up with the latest research worldwide. At the same time, the platform allows users to upload papers on their own and automatically generate translations. It also provides a document sharing mechanism to make it easier for users to gather and search for relevant materials.
Since most of the documents are shared and disseminated in PDF format, Alibaba Translate is optimized for PDF document translation. The specific optimizations include the following.
- PDF text parsing: This process generally uses optical character recognition (OCR) or directly parses PDF documents. After investigating the comparing different approaches, Alibaba Translate found that converting PDF documents into Microsoft Word .docx files facilitates the layout and restoration of translated documents. Therefore, PDF files are converted into docx files, which are then parsed for document translation.
- Layout retention: The layout of the original PDF document must be retained in the translation so that the source file and the translated file can be viewed and compared side by side. This improves the reading experience. When extracting the text content of a docx file, Alibaba Translate retains the text locations and other similar information. Then, after the translation is done, the platform generates the translated files while applying the retained information so they have the same layout as the source files.
- Web and mobile preview: By using Alibaba Cloud Intelligent Media Management (IMM) to provide a document preview function, we also optimized the document viewing experience.
You can go directly to the Professional Translation Platform for the COVID-19 Field portal at this link: https://medtrans.damo.alibaba.com/medtrans.htm
While continuing to wage war against the worldwide outbreak, Alibaba Cloud will play its part and will do all it can to help others in their battles with the coronavirus. Learn how we can support your business continuity at https://www.alibabacloud.com/campaign/fight-coronavirus-covid-19