Intelligently Generate Frontend Code from Design Files: imgcook
By Miaojing and Boben
Being one of the four major technical directions of the Frontend Committee of Alibaba, people may wonder what frontend development has to do with AI, how to achieve frontend development with AI, and whether this heavily impacts the whole industry.
Based on the theme to generate code automatically from design documents, this article analyzes these topics from the perspective of background analysis, competitive product analysis, and problem resolution.
Machine learning is trending in the industry, and AI has become the consensus for the future. Kai-Fu Lee also pointed out in “AI future” that artificial intelligence will replace nearly 50% of human work within 15 years, especially simple and repetitive tasks. Moreover, white-collar employees’ work will become easier to replace than that of blue-collar workers since the work of blue-collar workers may need breakthroughs in robotics and related technologies in both software and hardware. However, only technological breakthroughs in software can replace white-collar workers. Will our frontend “white-collar” work be replaced? When and how much will be replaced?
Looking back to 2010, software affected almost all industries, making the whole software industry prosperous in recent years. But in 2019, AI affected the software industry itself. In the DBA field, for instance, Question-to-SQL can generate SQL statements automatically when you ask questions in a field. Meanwhile, TabNine, a machine learning powered source code analysis tool, assists in code generation. Moreover, an intelligent designer, “Luban” was launched in the designer industry. What about the frontend development?
We have to mention a familiar question: How to generate code automatically from a design document (Design2Code, referred to as D2C). The Frontend Committee of Alibaba focuses on the direction of intelligence, and the current stage is to improve web development efficiency. We will try to put an end to simple and repetitive work, enabling web developers to focus on more challenging work.
Competitive Product Analysis
In 2017, Pix2Code, a paper about image to code, attracted the industry’s attention. It describes generating source code directly from the design image with deep learning. Subsequently, similar ideas based on this idea regularly emerged in the community. For instance, Microsoft AI Lab launched Sketch2Code in 2018, an open source tool for converting sketch into code. At the end of the same year, Yotako drew people’s attention as the platform to transfer design drafts to code. As such, machine learning has officially attracted frontend developers.
Based on the analysis of competitive products, we can get the following inspirations:
1) Currently, the object detection capability of deep learning in images is suitable for reusable material identification (module identification, basic component identification, and business component identification) with larger granularity.
2) The complete end-to-end model that generates code directly from images is highly complex, and the generated code is unreliable. We need several sub-networks to work together in order to achieve higher quality.
3) When the model cannot provide the expected accuracy, the design document’s hard rule intervention can be used. On the one hand, the manual intervention ability can help users get the desired results. On the other hand, these manual rule protocols are also high-quality samples, which can be used as training samples to optimize the model’s recognition accuracy.
The goal of generating code from the design document is to enable web developers to improve work efficiency and eliminate repetitive work. The general daily workflow is as follows for regular frontend developers, especially client-side developers.
The general workload of web development mainly focuses on view code, logical code, and frontend/backend integration. Next, we break down the goals and analyze them one by one.
In view code development, HTML and CSS code is generally written based on a design document. How to improve efficiency here? When facing the repetitive work of UI view development, it is natural to think about solutions like packaging and reusing materials, such as components and modules. Based on this solution, various UI libraries were precipitated. There are even higher-level encapsulations that are the platforms to build websites visually. However, reused materials cannot cover all scenarios. There are a lot of business scenarios needing personalized views. Facing the problem itself, is it possible to generate reliable HTML and CSS code directly?
To sum up, we are facing the following problems:
- Reasonable Layout: Include absolute position relative position, redundant node deletion, reasonable grouping, loop judgment, etc.
- Element Self-adaption: Extensibility of the element itself, alignment between elements, maximum width, and high fault tolerance of elements.
- Semantic: Multi-level semantics of class names.
- CSS expression: The background color, rounded corners, lines, etc.
- The industry has been working in this direction for a long term. The basic information of elements in a design document can be exported through the design tool’s plugin. But the problem remains in the aspect of the high requirement for the design document and poor maintainability of generated code.
We are building an expert rule system related to the layout algorithm. Yes, this part is more suitable for the rule system at the current stage. For users, the layout algorithm needs to be close to 100% availability. In addition, most of the problems involved here are the combination of numerous attributes and values. Currently, rules are more controllable.
However, when it’s hard to use rules to solve some problems, we can use models to solve the problems. For instance, we come across some cases where we need to recognize groups and loop. In the meantime, web developers often use existing UI libraries to build the UI interface, so it’s important to recognize base components in the design documents. For these problems, we use Pipcook to build an object detection pipeline to train our models and achieve the goals. Moreover, context semantic recognition across elements is required. This is also the key problem being solved by deep learning. For example, if we want to recognize what the image means in the design draft or why some text corpus was used in some places, we need image classification and text classification models, which are also built from Pipcook based on tfjs-node.
Usually, web development also includes logic code, including data binding, dynamic, and business logic codes. The improvable part is to reuse dynamic effect and business logic code, which can be abstracted as basic components.
- Data field binding: This is quite feasible. You can determine the candidate field based on the design document’s text or image. But the cost-performance ratio is not high because this is more about business logic, which is not general logic.
- Dynamic effect: The input of this part is a design document. Generally, the delivery forms of dynamic effect are various. Some of which are animated gif demonstrations, while some are text description or even oral. The generation of dynamic code is more suitable for visual generation. There is no reference for direct, intelligent generation, considering that the input-output ratio is not a short-term problem.
- Business logic: This part of the development is mainly based on PRD, and even the product manager’s logic. If you want to generate this part of logic code intelligently, there is too much input. Specifically, we need to see what problems intelligentization can solve in this sub-field.
Logical Code Generation
The ideal plan is to learn historical data like other artistic fields such as poetry, painting, and music. According to PRD’s input, the new logic code can be generated directly. But can the generated code run directly without errors?
At present, although AI is being developed rapidly, the problems it can solve are still limited. It is necessary to define problems as problem types it solves well. Reinforcement learning is good at strategy optimization, and deep learning is better at computer vision, classification, and object detection.
For business logic code, the first thing that comes to mind is to use LSTM (Long short-term memory network), which in terms of NLP is to obtain the semantics of function blocks. VS Code intelligent code reminder and TabNine are using this strategy.
In addition, we found that intelligence can also help identify the location (timing) of logical points in the view and guess the logical semantics based on the view.
Let’s summarize the advantages of intelligence at this stage:
- Analyzing and guessing the semantics of high-frequency function blocks (logical blocks) based on historical source code. In this way, we can recommend code blocks while editing code.
- We can guess some of the reusable logical points from the design draft. For instance, to bind the image or text data to view, we can use NLP classification or image classification to recognize the elements’ contents.
Therefore, in the current business logic generation, solvable problems are relatively limited. Especially when new business logic points appear with new logic orchestration, these references are all in the PRD or mind of PD. Therefore, the current strategies are as follows for the business logic generation scheme:
- Field binding: Use deep learning to intelligently identify the semantic classification of text and images in the design draft, especially the text part.
- Reusable business logic points: It is intelligently identified based on views. It contains small logic points (one line of expression, or several lines of code that are generally insufficient to be encapsulated into components), basic components, and business components.
- New business logic that cannot be reused: Structured (visualized) collection of PRD requirements is a difficult task and is still being tried.
We have described the strategies to generate HTML + CSS + part of JS + part of data intelligently from the above analysis. This is the primary process of D2C (Design2Code). The product we developed from this idea is imgcook. In recent years, with the maturity of third-party plugins of popular design tools (Sketch, PS, XD, etc.), the rapid development of deep learning even outperforms human recognition capabilities. This is the vital background for D2C’s birth and continuous evolution.
Object detection 2014–2019 paper
Based on the general analysis of the frontend intelligent development mentioned above, we have made an overview and architecture of the existing D2C intelligent technology system, which is mainly divided into the following three parts:
- Recognition capability: The ability to identify the design document. This is to intelligently analyze multiple dimensions of information from the design document, including layers, basic components, business components, layouts, semantics, data fields, and business logic. If the intelligent recognition is not accurate, then human intervention is used to correct errors. On the one hand, the high-availability code is generated from low-cost intervention. On the other hand, the artificial corrections can also be used as samples for online training.
- Expression ability: Mainly output the data and access the engineering part.
- Use DSL to make the standard structured description of Schema2Code.
- Perform Project Access through IDE plugins.
- Algorithm engineering: To better support the intelligentization D2C requires, high-frequency capabilities are served, mainly including data generation, processing, and model services.
- Sample generation: Mainly process each channel’s sample data and generate samples.
Summary layering of frontend intelligent D2C capabilities
We use the same Data Protocol Specification (D2C Schema) to connect different parts of the architecture shown above in the whole project. This ensures that the recognition can be mapped to the specific corresponding fields, and the code can be correctly generated through schemes such as the code generation engine in the expression layer.
Intelligent Identification Layer
In the entire D2C project, the core is the recognition capability part. The specific decomposition of this layer is as follows. The subsequent series of articles will focus on these subdivided layers.
- Material identification layer: To identify the materials in the image through image recognition, including module recognition, atomic module recognition, basic component identification, and business component identification.
- Layer processing layer: Mainly separate the layers in the design document or image, and combine the previous layer’s recognition results to sort out the layer meta information.
- Layer reprocessing layer: Further normalize the data from previous layers.
- Layout algorithm layer: Convert the absolute position to layout relative position and Flex layout.
- Semantic layer: The layer’s multi-dimensional features are used to make semantic expressions on the generated code.
- Field binding layer: Bind and map the static data in the layer with the actual backend data.
- Business logic layer: Generates the business logic codes through the business logic identification and expresser.
- Output engine layer: Finally, output the code intelligently processed by each layer’s various DSL.
Technology layering of D2C identification ability
Of course, incomplete recognition and low recognition accuracy have always been a major topic of D2C, and it is also our core technical point. We try to analyze the factors that cause this problem from these perspectives:
1) Problem definition is inaccurate: Inaccurate problem definition is the primary factor affecting model recognition’s inaccuracy. Many people think that samples and models are the main factors. But before that, there may be problems with the problem definition in the beginning. We need to judge whether our model is suitable for the problem, and if so, how to define the rules clearly.
2) Lack of high-quality dataset: The intelligent recognition capability of each layer depends on different datasets. How many frontend development scenarios can our samples cover? How is the data quality of each scenario? Are the data standards uniform? Is the feature engineering processing unified? Does the sample have ambiguity? How is interconnectivity? These are the problems we are facing now.
3) Low model recall and misjudgment: We often pile up many different kinds of samples in different scenarios as training, hoping to solve all identification problems through one model. However, this often leads to a low recall rate of the model’s partial classification, and misjudgment also exists for some classification with ambiguity.
At present, the computer vision models in deep learning are more suitable for solving classification and object detection problems. The premise for us to judge whether the deep model should be used for a recognition problem is whether we can judge and understand the problem by ourselves, whether this kind of problem has ambiguity, and so on. And if we cannot judge accurately, then this recognition problem may not be appropriate.
If the judgment is suitable for deep learning classification, you need to continue defining all the classifications, which need to be rigorous, exclusive, and can be enumerated completely. For example, when doing the semantic proposition of images, what are the common class names of common images? For example, the analysis process is as follows:
- Step 1: Find out as many relevant design documents as possible. Enumerate the types of images.
- Step 2: Reasonably summarize and classify the types of pictures, which is the easiest place to be controversial. Bad definition and ambiguity will lead to the model’s problem.
- Step 3: Analyze the features of each type of picture — whether these features are typical or not and whether they are the core feature points — because they are related to the inference generalization ability of subsequent models.
- Step 4: Whether the data sample source of each type of image is available or not, and if not, whether it can be automatically created or not. If the data sample cannot be available, it is not suitable to use the model. And you can replace the hard rules to see the effect first.
There are many such problems in D2C projects. The problem definition itself needs to be very accurate and scientific reference based, which is relatively challenging because there is no precedent for reference. You can only use the known experience to try it first and fix it after the user tests have problems. This is a pain point that requires continuous iteration and continuous improvement.
To improve sample quality, we need to establish standard specifications for these datasets, build multi-dimensional datasets in different scenarios, and uniformly process and provide the collected data. It is expected to establish a set of standardized data systems.
We are using Pipcook’s standard data format. We provide a unified sample evaluation tool for different problems (classification and object detection) to evaluate each dataset’s quality. For some specific models, feature engineering with better effect (normalization, edge amplification, etc.) can be adopted, and samples of similar problems are also expected to be able to circulate and compare in different models in the future in order to evaluate the accuracy and efficiency of different models.
We try to summarize scenarios to improve accuracy for model recall and misjudgment. The samples in different scenarios often have some similar features or some key features that affect local feature points, resulting in misjudgment. This results in a low recall rate. We expect that we can identify models by converging scenarios to improve model accuracy. We converge the scenario to the following three scenarios: wireless client-side marketing scenario, mini-app scenario, and PC scenario. The modes of these scenes have their own characteristics. Designing different recognition models for each scene can efficiently improve the recognition accuracy of a single scene.
Thoughts of the Process
Since a deep model is used, a more realistic problem is that the model cannot identify data other than the features learned in the training sample. And the accuracy rate cannot be 100% satisfactory to the user. Besides the samples, what can we do?
In the entire process of D2C, we also follow a methodology for identifying models. That is, designing a set of protocols or rules that can cover cases where deep learning gives wrong results. This ensures that users can still fulfill their demands when the model recognition is not accurate: Manual convention > rule policy > machine learning > deep learning. For example, you need to identify a loop in the design draft:
- At the beginning, we can reach the agreement of the loop manually in the design document.
- Based on the layer’s context information, you can make some rule judgments to determine whether it is a loop body.
- Using the layer features of machine learning, you can try to optimize rules.
- Generate some positive and negative samples of the loop to learn through the deep learning model.
Among them, the manually agreed design document agreement resolution has the highest priority. This can ensure that subsequent processes are not disturbed by blocking and error recognition.
Business Landing: 2019 Double 11
After nearly two years of optimization, the first closed-loop development of the marketing module uses D2C. This includes module creation, view code generation, logical code generation, writing supplementary logical code, and debugging.
In the Double 11 scene, it covers the new modules of Tmall and Taobao, including various scenarios. 31 modules are supported. About 79.34% of the code is generated by D2C, including the automatic generation of view code and some logic code. 98% of simple modules are generated automatically. The main reasons for manual changes to the code are new business logic, animations, field binding recognition errors, and loop recognition errors. These issues also need to be gradually improved.
Overall Landing Situation
As of 09 Nov 2019, the data is as follows:
- The number of modules is 12,681, and about 540 are newly added this week
- The number of users is 4,315, and about 150 new users are newly added every week
- Number of teams: 24
- Custom DSLs: 109
Currently, the service available are as follows:
- Restoration of design draft: Use the Sketch and Photoshop plugins to export the design information with one click to generate code.
- Restoration of Image: Allows you to upload images to restore and generate code directly.
- Continue to reduce the design documents’ requirements, in which the intelligent identification accuracy of grouping and loop are improved, and the manual intervention cost of design document is reduced.
- The component identification accuracy has been improved. Currently, the accuracy is only 72%, and the business application availability is low.
- The page-level and project-level restoration capabilities are improved, depending on the accuracy of page segmentation capabilities.
- Improve page-level restoration of mini-apps and PC programs, and improve overall restoration of complex forms, tables, and charts.
- Improve the ability to generate code from static images, which can be used in the production environment.
- Algorithm engineering products are improved, and sample generation channels are more diversified.
- Open source.
In the future, we hope that through the frontend co-construction project, we will use the collective strength to make the frontend intelligent technology solutions inclusive and deposit more competitive samples and models, providing higher accuracy and more availability of services. We hope to reduce repetitive and straightforward work and help web developers focus on more challenging work.
- GitHub Page: https://github.com/taofed/imgcook
- Homepage: https://www.imgcook.com/
- Pipcook (front-end machine learning framework): https://github.com/alibaba/pipcook