During the 18th TechDay of The Computing Conference, Qi Jun, CTO of Router Software Company Limited from Nanjing, made a presentation titled How Can Medium-sized and Small Enterprises Make Clever Use of Container Technology, sharing the company’s experiences, issues and lessons learnt during its use of Alibaba Cloud container services and Docker. Focusing on the impact of container technology on business production and the overall productivity, the presentation is of significant reference to medium-sized and small enterprises.
Download presentation slides
The following is a summary of the shared points of view at the activity.
First, let’s take a look at the first dilemma:
There are too many or overlapping sub-products in the product line, leading to numerous repeated basic services, such as e-mail services and SMS services. But in development stage, they cannot be split.
The second dilemma is the O&M. At present, there are a total of 40 servers in the company and a bunch of services, but only one full-time O&M personnel.
The third dilemma is the core issue: Extremely low efficiency in deliveries of products/services, giving rise to a serious waste of human resources.
The fourth dilemma is being unable to effectively manage load peaks and valleys for reasonable resource allocation.
The fifth dilemma is the demanding requirements of the company on reliability and security. Every day, hundreds of media rely on our products for their publication and production. If we cannot improve business reliability, a problem means the loss of an issue of a newspaper or an episode of a TV program in a region.
The above figure illustrates the old architecture of the company, and every line is identical. A heap of clients visit a public IP address behind which may be an ECS server. On the ECS server, one or more applications may be deployed, and these applications are connected to Alibaba Cloud database services, or cache or log services. All of these constitute the minimum unit. During peak hours, there may be up to 60 minimum units in the company, involving nearly 80 ECS servers, leading to upgrading and maintenance issues. But this approach has to be abandoned. Because once the code is released, you have to negotiate a time with the client for upgrading. But clients deploy their systems independently, and it is impossible to agree on a uniform upgrading time for all clients.
In the current architecture, users access through the internet, first by way of the SLB (load balancing forwarder) which can be understood as an official website IP address that accesses its back-end servers through the official IP addresses provided by SLB; in the back-end server, we have a VPC network which houses around 20–30 ECS servers. The 20–30 ECS servers constitute the present four major clusters in use. The container cluster needs to have several container instances to run after the cluster is created.
Upon receipt of the request for access, the container cluster will forward the requests to applications according to the request domain name and port number. After the requests arrive at the application, the containers for various services on the application will connect to Alibaba Cloud databases or self-commending databases.
Reliability, cost and agility are three core issues. Agility refers to agile development and deployment; cost is a shared concern of the boss and the CTO; and reliability secures our survival.
Container Technology & Agile Development
Container technology and agility development now face three problems:
First, the uniform development, testing and business environments. The difference in versions, applications and operating environments may cause a variety of, and usually unexpected, problems. Meanwhile, new employees still need to repeatedly establish various development, testing and operation environments, lowering the efficiency.
For this issue, our solution is to encapsulate and package all the basic environments in the image warehouse. New employees only need to download the image, and put his/her own code into the image during local debugging and development. Although read-only during the process, the image is actually connectible with data on external disks. In addition, the developed and debugged code won’t suffer problems, because all the images are consistent and are operating-system-independent.
Second, how can we achieve continuous development of applications developed during different stages. This is a pain point for development teams of many medium-sized and small enterprises. New employees are reluctant to maintain and fix issues in the old code.
For such problems, we can further break down the application: several items of a big application can be separated to small applications, slashing the cost for maintenance. During continuous development, these small applications do not require re-writing and you can focus on the necessary tasks.
Third, code contamination and manual packaging faults need urgent solution. When a company has many small applications, packaging may see frequent issues.
Currently, we adopt Git auto build to solve the problem, because it gets code from Git. When we push a branch, an image will be automatically created, in which case code contamination and manual packaging faults are unlikely to happen.
Container Technology & Business Cost
Next, let’s take a look at the relationship between container technology and business costs. At present, there are several cost problems as follows:
- Long downtime is costly for updating and greatly impairs user experience. Previously, we could only update at 3 o’clock in the morning. If any problems occur, we needed to solve them before 8 o’clock. This has increased the cost a lot, and sometimes the code updating will take a whole night, let alone the more effort-consuming case when rollback is required. For this issue, the blue-green release mentioned in the speech of the previous presenter actually is a good solution.
- Simultaneous updating for multiple loaded servers is hard to achieve, as the rollback cost in case of errors is very high. Blue-green release can solve this problem. In our company, we adopt Alibaba Cloud containers that can accommodate minuses. Before the container is fully updated, users cannot access the content in the new image.
- Business precision elasticity is not available, and server-level elasticity is far from enough. Among our 40 servers, probably only one third of their service time witnesses a utilization rate of 80% or above. During the remaining two thirds of their service time, the resources are idle and wasted. Through container level elasticity, we can activate more containers for web services or API services to realize better loads and execution efficiency. In such circumstances, more physical servers can provide such reliable computing resources and the performance is far better than imagined. Because in normal cases, it is impossible for a server to always stay fully loaded under any circumstances.
Container Technology & Reliability
Now let’s talk about container technology and reliability. One truth about reliability is that on-cloud full failover may not solve the problem instantly, but it is needed by all. On the cloud, many of us may have overlooked the hot backup issue, thinking hardware faults are not likely to happen on the cloud. But the truth is the other way round. Alibaba’s container service can be configured through arguments. When a cluster fails, you can manage to put the container into another cluster. Although there are no perfect solutions for domain name or port configurations, and manual adjustments are required, it is great progress.
Data synchronization and sharing in multi-server load scenarios. The appendixes of some old services cannot be separated in case of poor coupling, or some concentrated reads or writes may exist. In such cases, shared storage should be used to solve this issue. At present, we solve this problem through the two solutions provided by the container service: first, the OSS data volume, and the other is NSA data volume. Both can support access from multiple containers to the same file data source, as well as real-time concentrated writes and reads.
The traps we once fell into
Now I want to summarize the lessons I learnt over the years, in a hope they can be of some help to you:
First, containers without decoupling are difficult to use. This is much in evidence. Because if the container has a bunch of applications with high relevance, once the container goes wrong, the applications will fail and the whole business system may even collapse. For this reason alone, we need to decouple applications, which is very important.
Second, it is the micro services. Microservices are booming at present, but it does not mean the architecture would become amazing after it is split into microservices. Microservices have their own merits, but again, it does not mean all businesses should adopt microservices. Only universal, repeated and reusable applications are suitable to be split into microservices.
Third, the bigger the project, the more difficult the container architecture for use. A great majority of our projects are currently hosted in container services, but not all of them. Some products are too big to be placed in a container service. This involves the internal management, as well as the product or business scenario and user requests. The bigger the project, the wider the scope, and the harder to impose a sweeping approach on it, that is, placing the project into a container cluster.
Fourth, reliability. Do not count on Docker alone to solve all the reliability issues. Docker and container technology, in my opinion, are both a kind of architecture instead of a tool. Reliability is related with a lot of elements, from the network environment and the overall architecture layer, to the quality of developers. These elements are beyond the control of Docker.
Finally, container technology is only a kind of architecture. Running Docker on a single service is an experiment. Only when it drives the operation of a cluster can it truly give play to its power.
The first scenario is efficient API cluster. Sometimes we can encounter such circumstances that some APIs in the company are for external use, and some others are not, but the API can be used by the company APP and the APP may access the API through the official approach, and the API can be called by internal services at the same time. But the issue is: when an external domain name visits it, there are no problems. But when an intranet domain name does so, DNS is required and some official traffic may be even consumed for the visit. As a matter of fact, the server called through intranet may be very close to the API server.
In such circumstances, we can solve the problem using a model: when an internal server initiates the call, use the intranet SLB (Alibaba’s intranet SLB is free of charge); when an external server initiates the call, use the internet SLB (traffic charges only, if I recall it right); the two container clusters (container clusters are Alibaba’s container services) connected to the API respectively are correctly configured internally. One advantage of this is: when this service is used in a concentrated way, you can make choices at will. On-cloud calls can be completed through intranet SLB; and external calls can be completed through internet SLB, facilitating a maximum and most reliable access efficiency. Regarding internet access, the response may be around 10 milliseconds, while the figure for intranet access may be around 1 to 2 milliseconds.
The second scenario is fast delivery of applications/services. Our company focuses on SaaS services as well as applications and software. Most of the work is about assisting in development and customer services, and some deployment is also undertaken by developers. We solve the problem with a model. After the product image is developed, we submit the image to the code library; O&M personnel enter the container service and create a cluster, then they create an application using the orchestration template. After the application is created, the service configurations are modified. Because the container is not running in the same way as the code we are familiar with, some configuration files need to be modified. The current solution is: we put all configurations in the container environment variables, which is also the mainstream approach. Then we modify the service configuration, the environment variable and restart the service. After the service is restarted, the restarted container will read an environment variable as its configuration and it will then be running successfully.
Simply put: in the past year, from pains and the verge of giving up to the flush of dawn, and to the bumps along the way, we should thank all the media clients and Alibaba for their constant support.