Fault Tolerance with Application High Availability or Batch Compute

Image for post
Image for post

By Afzaal Ahmad Zeeshan, Alibaba Cloud Community Blog author and Alibaba Cloud MVP.

Background and Introduction

Undoubtedly, the dynamism and strength of the cloud and its services is the most prevailing topic of this entire decade. However, there comes huge challenges with this ease of dynamic cloud environments. Cloud is prone to faults and errors; different services behaves differently and can generate a series of unexpected errors. Here comes the system ability to handle these errors gracefully, moreover, the actual challenge is to make this handling effective in the long run.

For the continuous working of different components and their expected results, the appropriate implementation of fault tolerance is of paramount importance as responsiveness, scalability, and resilience all the core features of cloud computing demand efficient handling and insightful planning to foresee all the possible pitfalls a system (either hardware or software) can generate at any point in time.

In this article we will talk about two opposite areas where we are supposed to deal with fault tolerance differently — applications requiring high availability due to all the time user interaction and batch computing which is operational-specific domain. Comparatively it is more challenging to make applications fault tolerant which are user or consumer-based, they have interactions and demand to be available every minute, because growing business cannot afford any down time and requires high throughput with highly competitive response time too for their mission critical applications.

Whereas, in the case of batch computing, which is used mostly in data analysis or report generation, are most commonly related to background jobs scheduling and internal operations, with no front tier end user interactions. Thus, this is less critical area in terms of enduring, predicting and detecting faults and unexpected behavior. The reason for this is that customers are not waiting for a response, as quick as possible. However, highly capable systems and planning is required because in batch compute we work with very huge datasets of Tera and Peta bytes of data, its analysis and modeling, thus failure in this analysis can cause loss of useful insights.

If you want to explore the options that are available on Alibaba Cloud for you that can help you provide a better performance for the end users, as well as to maintain how the data is processed in batches for your warehousing or analysis content, this article is for you.

Fault Tolerance — Insights & Benchmark

Therefore, we get a wide range of recommended solutions to deal with any sort of anomalies which your system can face based on the nature or service which you are using; either it is consumer-oriented application, hardware machines, such as servers and network components or any other standalone external service.

As far as existing fault tolerance techniques are concerned, there are some major parameters to consider; such as, scalability, reliability, response time, security and performance. For Scalability fault tolerance means, the increasing nodes, resources and load should not affect the resilience of your overall application. For reliability fault tolerance ensures that, the application should always generate expected and accurate results even in the cases of high pressure and peak hours. Throughput and response-time define high output, tasks executions which are completed. And response time is the time which your system or algorithm takes to respond to any query.

Hence, we see, cloud computing alludes to a super-dynamic behavior that in turn can yield many unexpected faults and failures. Therefore, for the robust performance and expected functioning of the system in any of the beforementioned areas, these unexpected events should be handled adequately and effectively.

Fault Tolerance in Highly Available Applications

To ensure application high availability we get multiple approaches; the most common one is the reactive technique which means that we solve problem once it has occurred. These measurements include service or node restart, replication and switching from one machine to another in the case of crashing or downtime. The another widely followed technique is the proactive one, this include proper check and balance, preemptive migration, detection of the faulty components and their replacement with the working ones. Also, we get failure detectors which keeps evaluating the performance and reproducing reports, so based on the results team take further actions suspecting the current scenario.

However, some or in fact most of the times these methodologies require highly skilled team, extra resources and cost. And if these measures were not taken as required things can go overly unexpected resulting huge loss. Here comes the Alibaba Cloud Application High Availability Service (AHAS). This service is based on a widely adaptive cloud model — Software as a service (SaaS). This ensures your application high availability by implanting effective fault tolerance strategies, such as, architecture discovery, controlling and facilitating resources for high traffic and load distribution. It enables your application to get recovered quickly in case of any breakdown in a cost-efficient way. Since this is a SaaS based model, you only pay for the service and features you use, and everything is managed automatically by Alibaba Cloud.

Application High Availability Features

AHAS — Alibaba Cloud Application High Availability Service is based on some function modules that provide stable and highly competitive features to make the application actively fault tolerant. These include but are not limited to controlling the traffic to load balance, or to control and forward the packets of data to a specific web app instance to provide the service.

Architecture Discovery and Detection

The topology function module of AHAS ensures the automatic architectural level detection. Mostly, users work on different environments including development, testing and production, AHAS cater a mechanism using which it can automatically detect the application topology to implement recovery and tolerance solutions in a respective fashion. Also, it displays all the required and in used dependencies in visualized manner to monitor their performance continuously. The topology basically is generated as a graph in the portal, which shows how the users are using the services and which services forward the users to which service, up ahead.

Other than this, AHAS uses highly competitive artificial intelligent modules and models to detect all the other dependent and third-party services or resources, this helps AHAS to control the traffic flow and to maintain load balancing as well as high-availability since it knows when a node has gone down.

Availability Assessment

AHAS ensures high availability capabilities assessment using its comprehensive assessment function module. For the assessment, the architectural information which was gathered from the topology analysis is used, using this information it suggests the test scenarios to evaluate the suspicious risks and any failure point in the component. This can be done either via polling, or by sending a response to the AHAS service at specific intervals.

Based on the information it gets from the topology function module, it maps this information with the recommended settings of every component. Then it suggests and run different test scenarios and then generate their report.

Availability Protection

To ensure the high availability and protection of user experience, AHAS provides different ways to mitigate the risks of failing any resource. It uses AppGuard to provide an expected experience to every user by restricting the traffic flow on down or faulty nodes and redirecting it to the highly available nodes.

Furthermore, handling of traffic control and degradation is of paramount importance, they simply are the main features of any application. AppGuard ensures the high availability of these two componenets; it controls the inbound traffic and degrade the failed dependencies to secure the functioning of other depending dependencies. AppGaurd supports Java framework and manages multiple granular level information such as QPS, number of threads, throughput and response time etc. This service can be easily checked on the documentation, as it is a bit broader topic to cover here.

Compatibility Ease

All kind of traditional and modern applications such as monolith or distributed as well as the microservice applications can easily use AHAS to enable the high availability and fault ensuring the fault tolerance in your application. Since this is a SaaS option, it only requires a couple of clicks on the portal, and your service is up and ready to maintain the high-availability of your service.

Fault Tolerance in Batch Compute

Now that we have discussed how to provide a smoother experience to your end users, now let us study how to improve and maintain the services that your data warehouses are using. Batch compute makes heavy processing possible by launching different processes in parallel pipelines. There are huge datasets containing tera and peta bytes of model data such as for predictive analytics and analytical computing for the data warehouses which are not aimed to target real-time processing. Due to the system criticalities and load, the possibility of failures is huge, but their impact of failure is comparatively small and controllable. Because, the Alibaba Batch Compute Service is majorly used for the internal operations and heavy job scheduling. Since batch computing is not the consumer or end-user specific service operational teams are mainly concerned about their handling and configuration. Apart from just the configurations, with batch jobs the only consideration is the valid outputs. Operations teams usually understand that a batch service must take some time, thus they can either scale the resources up or they can wait, but what does not change is the fact that operations teams require a valid report being generated. And if your data gets lost mid-way, or a job fails, voiding your data, then you do not have any other option than to restart from the beginning.

Clusters with hundreds of machines are used for practical and scalable deployment of streaming frameworks at companies specifically at Alibaba for managing and maintaining their ecommerce solutions, or even other organizations like Google and Yahoo. However, as the number of machines in a cluster grow, it increases the likelihood of failure of a single machine inside a cluster at any given moment. Failure of machines includes node failures, which are mostly caused by the memory overhead of the data, network failures, software bugs, including provisioned software or proprietary software and resource limitations. Moreover, streaming applications run for indefinite period, which increases the chances for disruption due to unexpected failures or planned reboots. Failures in DSPEs may cause an application to block or produce erroneous results. Therefore, along with scalability and low latency requirements, many critical streaming applications require DSPEs to be highly available.

Batch Compute — Considerable Techniques for Fault Tolerance

Fault tolerance in the batch computing is the extensively discussed area. As there are hundreds of different clusters and machines are involved for the batch processing, and the number of nodes keep growing with the time. The chances of failure in any node inside the entire cluster increases. Moreover, these scheduling jobs and modeling keep running for massively huge time periods which exhaust the system and resources. This might result in any sort of unexpected behavior; master or slave machine get down, crash in system, abruption in connectivity services, network failure, software or hardware failure etc. Hence there are a variety of solutions which are being followed by experiencing the existing anomalies while working with batch computing.

And normally, each time a batch job fails, it either would lose the data, or corrupt the overall report. Batch Compute normally takes care of this and automatically re-provisions an instance that performs the job again. Since the job knows the batch of data it was provided, Batch Compute manages the overall availability and purity of the reports that are generated.

Checkpointing and Pre-Build Options

The passive technique for detecting and enduring fault during batch computing ensures that all the pre-check and pre-build check point are met prior of the job execution. Just as testing the connectivity, access rights and availability of the required resources. Moreover, the checkpointing can be handled in an asynchronous manner to avoid the overhead. Batch compute offers quick recovery and support distributed job scheduling and proactive checkpointing to mitigate chances of getting and unexpected errors and loss of data.

Dynamic Job Scheduling

Batch compute creates new VM and node instances as per the requirement dynamically. On click resource launching and enabling technique via API call makes the recovery reliable and instant.

Distributed Data Caching & Backup

Batch Compute support distributed model for the data caching and backup. It runs huge number of simultaneous nodes and instances to improve the overall downtime and reliable reference point at times when any of the machine crashes, this way log files do not get missed.

Original Source

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store