How Yuque, Alibaba’s Work Collaboration Software, Has Evolved Over Time
By He Yiyu, nicknamed Busi at Alibaba. He Yiyu is a technical director for Yuque at Ant Financial. This article was compiled from He’s presentation at the SEE Conference in 2019.
Originally developed as an internal collaboration tool, Yuque is a powerful professional-grade platform for the sharing, editing, and organization of files on the cloud. Roughly translated as “messenger sparrow,” Yuque has already been used as the cloud knowledge base and virtual workspace of 100,000 employees at Alibaba Group.
As of this month, Yuque will go on the market as an enterprise-level work collaboration tool in China to be paired alongside Alibaba’s mobile workspace application DingTalk. At the same time, this tool will also be opened up to charity organizations, startups, and public education institutions free of charge.
The Evolution of Yuque’s Technical Architecture
Yuque in the Early Stages of Its Development
Yuque was first created in 2016, when Ant Financial needed a tool to host its documents. At that time, technical staff at Ant Financial used their spare time to build the documentation tool. In the early stage of the project, no personnel or resource support was available. So, to be able to quickly verify the prototype, the team chose the least costly technical solution. The underlying services were completely based on the BaaS service and container-hosting platform provided by the Technology Experience Department in Alibaba Group:
- Object service: This is a MongoDB-like data storage service.
- File service: This is a file storage service encapsulated on Alibaba Cloud Object Storage Service (OSS).
- DockerLab: This is a container-hosting platform.
These services and platforms were built based on a Node.js implementation and were dedicated to internal applications. Using these internal services helped to reduce the overall costs of research and development at Alibaba, allowing engineers to be able to enjoy an environment that was more conducive to new research and innovation. The application-layer server uses the Node.js web framework Egg, which is Ant Financial’s internal Chair encapsulation. It was later made open source by the Technology Experience Department. This framework is used for server implementation through a single web application. The application-layer client uses a React technology stack in combination with an internal antd. It also uses CodeMirror to implement an online markdown editor with superior functionality and an elegant experience.
This can be regarded as the “prototype stage” of Yuque. At that time, Yuque was merely a project that was created in the spare time of engineers at Alibaba and used the internal backend-as-a-service (BaaS) services as well as a series of open-source technical solutions dedicated to innovative applications. Later, the team verified the prototype of the online documentation tool.
An Internal Service
As the team saw the potential of this online documentation tool continue to grow, the goal of Yuque evolved from simply providing a documentation tool for Ant Financial to having an internal solution that could replace competing products such as Confluence. And then this even further went on to become an important knowledge management platform in Alibaba. Yuque is oriented towards technical innovators, team leaders, and knowledge-base creators. However, there were still hiccups, with the major problem being that simply providing a markdown editor wasn’t enough to allow non-technical personnel to be able to use Yuque efficiently.
Although many of us at Alibaba grew to love markdown, we could not overlook the lack of the need for a rich text editor. Unlike Word and other rich text editors, we chose a more “web”-base route and added special functions such as formulas, text graphs, and mind maps to enhance it.
As our team continued to explore the growing and exciting world of knowledge management, a three-layer knowledge management structure, which consists of the relevant team, their knowledge base, and their documentation, began to take shape with Yuque. On top of this, features such as collaboration, sharing, search, and message dynamics were growing in complexity as Yuque grew. With this new torrent of evolutionary change, BaaS services were no longer a feasible backbone for Yuque. So, to cope with these challenges, we need to make several different adjustments and changes.
Although BaaS services are easy to use and relatively cost-effective, their functions are insufficient to keep up the rapid development. Also, this underlying infrastructure provides less than satisfactory stability. Acknowledging all of these faults, we replaced the BaaS architecture with Alibaba Cloud’s internal Infrastructure as a Service (IaaS) services, including our database, storage, caching, search, and other services.
Besides this, the web layer still used Node.js and the Egg framework. However, the business layer has become a large-scale standalone application based on the practices of the Rails community. The data model layer was built by introducing ORM to clarify the code hierarchy.
The frontend editor was migrated from codeMirror to Slate. To better implement the functions of the Yuque editor, we forked Slate internally for a more in-depth development. At the same time, we customized an independent content storage format to achieve more efficient data processing and better compatibility.
While providing service internally, Yuque evolved into a formal product, just like any other Ant Financial products. This product was polished and perfected during its use within Alibaba.
Becoming an External Product
With the increasing internal influence of Yuque, some Alibaba alumni who had left the company began to ask Yubo: “Yuque is very useful. Have you ever considered releasing it as a product, so external companies can use it?” After less than 6 months of preparation and refactoring, Yuque became an official product of Alibaba in 2018.
When an application moves out of a single company to the commercial environment, the technical challenges are soon magnified. With this translation, some of the core knowledge and management functions became increasingly complex. And, with the addition of new formats such as tables and mind maps, the requirements of multi-person real-time collaboration posed an even greater challenge to our team. So, to better serve enterprise and individual users, at the Yuque team we had to work hard on to provide better enterprise and member services. As the business in China has continued to grow steadily, Yuque’s commercial services have in turn come to require a higher level of quality, security, and stability.
To keep up with this rapid business development, the Yuque architecture had to evolve with it. To do this, we migrated all the underlying dependencies of Yuque to Alibaba Cloud from their original data center systems. As Alibaba’s vendor of public cloud services, Alibaba Cloud not only provides several fundamental storage and computing capabilities, but also offers several richer and more advanced service offerings. At the same time, Alibaba Cloud’s services come with a guarantee of an extremely high level of service availability and reliability.
Alibaba Cloud’s wide range of basic cloud computing services has help us ensure that Yuque’s servers can select the storage, queue, search engine, and other basic services that are most suitable for its day-to-day business operations. Moreover the artificial intelligence and machine learning applications and services of Alibaba Cloud has also created several more possibilities for Yuque products, including optical character reader (OCR) image recognition and real-time translation, among other things. Ultimately, all these functions transformed into some of the unique features and assets that make Yuque different from its competitors.
At the application layer, Yuque servers still use large-scale Node.js web applications based on the Egg framework. However, as their functions increased, relatively independent services began to be decoupled from the primary service. These services can be divided into the following items:
- Microservices: For example, because the multi-person real-time collaboration service was a relatively independent persistent connection service that was not suitable for frequent release, we extracted it as an independent microservice to maintain its stability.
- Task service: For example, the preview service for massive local files provided by Yuque consume many resources and have complex dependencies. We extracted it from the primary service to avoid the impact of uncontrollable dependencies and resource consumption on the primary service.
- Function Compute: Tasks such as plantuml preview and mermaid preview are not sensitive to responsiveness, and their dependencies can be packaged into Alibaba Cloud Function Compute. So, we run these tasks in Function Compute to reduce costs and ensure security.
As Yuque’s rich text editor has become increasingly complex, more and more problems occurred during the development process that was based on Slate. Finally, Yuque chose to develop different editor solution in-house designed specific for different purposes. Among them, we implemented a rich text editor that uses the browser-based contenteditable, a table editor based on canvas, and a mind map editor based on SVG.
In summary, the underlying services of Yuque are fully migrated to the cloud, and cloud services are leveraged to create unique Yuque features. Yuque also provides knowledge creation and management tools for enterprise users and individual knowledge workers.
They also are experts in a certain technical field. Many of them are server, testing, frontend development, or CSS experts. They can improve product R&D efficiency by optimizing their R&D toolchains through their own specialized knowledge.
In Yuque, the product R&D process conducted by product engineers is as follows. In the product design phase, product engineers participate in the discussion and finally produce a final design draft. Since product engineers participate in all preliminary discussions, there are no technical problems caused by a disconnection between the product design draft and the subsequent R&D process.
Next, we perform a documented system analysis and design process in Yuque. Asynchronous review is also initiated on Yuque. Then, experts from other fields review the major technical solutions to ensure that all technical difficulties are clearly identified and organized appropriately.
After a clear system design is formulated, the R&D phase begins. During this phase, automated testing must cover all code. A full-coverage unit test is required for all newly added code and the modified business logic, and an end-to-end test is also required for key functions. After all the code is written, automated testing is mandatory before code review.
Asynchronous code review starts after phased function development and testing are completed. Relevant business leaders and experts in certain fields are invited to review the code. In this stage, the code is reviewed for business logic correctness, security, and maintainability.
When publishing a product, we must ensure that phased release, emergency response, and monitoring are possible. This prevents the risk that function changes will cause problems for a large number of users.
Finally, full stack R&D gives engineers the opportunity to fully participate in the entire product R&D process, allowing them to spontaneously come up with new optimization ideas and use technical means to improve the product performance. For example, the OCR image search function recently launched by Yuque was spontaneously developed by full stack engineers who did everything from preliminary technical research to product implementation.
To be specific, input and output are determined by “ports”. On the other hand, external systems use “adapters” to connect the system with the ports exposed by Yuque. As long as the implementation follows the definitions of ports, external systems can be replaced easily.
In this model, the Controller is the HTTP adapter exposed to the user interface by Yuque. In the Controller, we verify and convert the format of user request parameters, check user permissions, and format the output.
We have defined a method, which happens to usually be a series of methods, to allow Yuque to interact with third-party platforms and services. Through adapters, different services in different environments are encapsulated into a unified method so that services can be easily called through one method. During calls, call logs are also generated.
The data model layer provides a model for the data layer. For example, the metadata of the Doc model is stored in MySQL, whereas document body data is encrypted and stored in OSS. The core business logic of Yuque has no idea where the underlying storage is located. Furthermore, as long as Yuque uses the SQL to interact with databases, the underlying data can be seamlessly migrated to databases that fully support SQL syntax, such as OceanBase. As such, even minor modifications can also be encapsulated at the model layer.
Lastly, let’s look at a document release example. When you call the HTTP interface to interact with Yuque, data is written to the storage, including MySQL and OSS, through the model layer, and the document cache is updated. Sending asynchronous messages to other systems triggers the DingTalk WebHook and synchronizes the data to the search engine. These interactions with external systems can be performed after the adapters are encapsulated, allowing each system to perform its functions such as parameter conversion, permission verification, and logging. This not only ensures that the core logic is concise, but also makes it easier to trace system call routes.
Hybrid Application Architecture
When the system grows to a certain size, should we continue to add functions to large standalone applications or split them into microservices? The coexistence of these two architectures proves that they have their own pros and cons. Your architecture selection should be determined based on your current business scale and team distribution. Following this precise logic, the technical architecture of Yuque became a hybrid architecture along with our evolving business format.
Yuque’s primary service is a large Node.js service that integrates all the application business logic. In addition to the primary service, there are some other services in different formats.
- Microservice: Some independent and stable functional modules or services that require additional architecture deployment are deployed independently as microservices. Provisionally, we use HTTP interfaces for interaction between systems. For example, the real-time collaboration service is deployed as an independent microservice because it is an independent and stable persistent connection service that cannot be released and restarted frequently.
- Task cluster: Some CPU-intensive tasks or services with complex third-party dependencies are placed in an independent task cluster. For example, various file preview services may depend on other services and account for a large amount of computing costs. Therefore, it is best to place these services into a task cluster to eliminate concurrency through queues.
- Serverless Function Compute: As far as possible, we try to migrate services that are less responsive and can be functionalized to Alibaba Cloud Function Compute. Such services include plantuml, mermaid, and other text drawing services.
Let’s look at rendering by mermaid as an example. When you enter mermaid code to call Yuque, Yuque calls a function deployed in Alibaba Cloud Function Compute and runs puppeteer in the function to render the code into svg and return it.
However, why are Serverless tasks separated out here? As mentioned earlier, Node.js is single-thread and unsuitable for CPU-intensive tasks. Based on a serverless architecture, we can migrate tasks with security risks or that consume a large amount of CPU resources to Function Compute.
In this way, such tasks run in a sandbox environment, so no security risks caused by malicious code would occur. This approach also removes these CPU-intensive tasks from the primary service so that they do not block up the primary service during concurrent operations.
The pay-as-you-go billing method can significantly reduce costs because you do not have to deploy a resident service for low-frequency function scenarios. Therefore, we try our best to migrate such services to Serverless services, such as Alibaba Cloud Function Compute.
Common Fields Other than Language Fields
In addition to the programming language, other aspects need to be considered in any commercial system. Among them, the two most important aspects are security and stability.
Various security risks exist due to a system’s dependencies on the frontend, servers, and underlying infrastructure:
- Front-end security risks: cross-site scripting (XSS), phishing, and cross-site requests
- Server security risks: horizontal permission issues, unauthorized access, sensitive information leakage, Server-Side Request Forgery (SSRF), and SQL injection
- Cloud service security risks: SMS or email bombing, data leakage, and content security
There is no easy way to solve all these security problems. They can only be handled individually, but there are some basic principles can be followed. Do not trust any user input:
- XSS must be prevented anywhere rich text is rendered, and the content may not be input through an integrated development environment (IDE).
- When running user code on the server, put it in a sandbox.
- When requesting resources transmitted by users from the server, the request must be filtered for SSRF.
Develop a standard coding paradigm to handle security risks, and pay special attention to the following during code review:
- All interfaces must have a permission verification mechanism.
- Use the response serialization method to filter sensitive information.
- SQL statements cannot be spliced.
Yuque has been working with the security team since its commercialization, focusing on internal security awareness training, internal security team testing, internal reds-fight-blues defense drilling, and external white-hat penetration testing.
To ensure the stability of Yuque, we have done a lot of work on the frontend, servers, and cloud services. Just like security, stability is another long-term project that involves all aspects of the system. For Yuque, stability assurance is done in two main areas:
- Service availability assurance: In the architecture design, we must eliminate single point of failure (SPOF). Disaster recovery and backup are required for underlying data, and services must be deployed in multiple units and zones. Avoid introducing unnecessary strong dependencies.
- Abnormality monitoring and tracing: This includes the monitoring of frontend business tracking logs and exception logs, end-to-end tracking and collection of logs on the server, and the monitoring and analysis of system performance. Eventually, we will be able to promptly detect and track exceptions and locate and analyze performance problems.
Then, how can we avoid unnecessary strong dependencies?
To give an example from Yuque, MySQL is a strong dependency that cannot be removed, which is not the case for the cache. However, at the beginning, Yuque sessions were stored in the cache. This meant that, if a Redis cluster failed, user data would not be obtained and users would not be able to log on. Therefore, the cache was a strong dependency.
To address this problem, we moved session storage to MySQL, so Redis became a weak dependency and the system could continue to function in the case of a Redis failure.
Another example is the multi-person real-time collaborative editing feature recently launched by Yuque. Before this feature was launched, a document locking method was used to prevent multiple people from editing the same document at the same time. However, with the introduction of the multi-person real-time collaboration service, once the service fails, users cannot edit documents. This meant that this service was a strong dependency of the Yuque system. To solve this problem, when users fail to connect to the collaboration service, the system automatically fails over to the old lock mode. As such, the collaborative service becomes a weak dependency of Yuque.
How Did Yuque Choose an Appropriate Technology Stack?
Over the past few years, the technology behind Yuque has been evolving, but it has always followed several principles. The technology stack must be selected to match the product development stage. Products have different technical requirements at different stages. Earlier stages have higher requirements for iteration efficiency. After reaching a commercial scale, products require better stability and performance. It is not necessary to use the most advanced technical solutions as soon as they are released, but we must consider them in combination with the product stage.
So for this, the most important thing is that you must consider the security, stability, and maintenance and scalability of the platform. The technology you end up choosing is secondary to this. The language and services of your choice may change, but basic security awareness, stability awareness, and code maintenance are the fundamental and key factors that determine whether a project can survive over the long term.