By Alwyn Botha, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud’s incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.

This tutorial describes how to run 3 types of Kubernetes jobs:

  • Jobs using one Pod that runs once
  • Parallel Jobs with a fixed completion count
  • Parallel Jobs with a work queue

Other Pods run continuously forever ( for example a web server or a database ).

All 3 types of jobs above have a fixed ( batch ) job to do. They finish it then their status becomes completed.

1) Simple Job Example

Create the following as your simple job example YAML spec file.

Note the kind is Job. The rest of the spec is the same as that of a normal Pod.

A job does its work using Pods.

Pods do work using its containers.

Containers work using Docker images. Our example above uses the Alpine Linux docker image to do a job: sleep 10 seconds.

Even this simple example will teach us about Kubernetes jobs.

Create the Job

Let’s list all Pods by running kubectl get pods several times.

We see this Pod ran for around 10 seconds. Then its status turns to Completed .

In the READY column we see that after 10 seconds the Pod is no longer ready. It is complete.

Delete Job

We just used kubectl get pods to monitor the progress of a job.

This example will use kubectl get job to monitor progress.

Repeatedly run kubectl get job

Only the last line shows : COMPLETIONS 1/1.

Frankly not information-loaded output at all. Later with more complex jobs it will reveal its value.

With experience you will learn when to use which :

  • kubectl get pods
  • kubectl get job

Demo complete, delete …

2) Job backoffLimit: 2

If your job has an unrecoverable error you may want to prevent it from continuously trying to run.

backoffLimit specify the number of retries before a Job is marked as failed. Default: 6

We set it to 2 in the example so that we can see it in action within seconds.

Note the exit 1 error exit code. This container will start up and exit immediately with an error condition.

Create the Job

Monitor job progress several times:

backoffLimit: 2 … job stops creating containers when third error occurs.

Describe detail about our job: ( only relevant fields shown )

Most informative lines are:

We have not discussed Parallelism: 1 and Completions: 1 yet. It simply means one Pod must run in parallel and one Pod must complete for job to be considered complete.

Determine a suitable backoffLimit for each of your production batch jobs.

Delete Job

3) Job Completions: 4

Job completions specify how many Pods must successfully complete for job to be considered complete.

Our job Pod below has work: sleep 3 seconds

4 Pods must complete … completions: 4

Create the Job

This time follow progress using kubectl get jobs myjob repeatedly.

We can clearly see the 4 Pods each sleeping 3 seconds successfully and one-by-one, not all simultaneously .

Delete Job

Same job but another demo: this time we monitor progress using kubectl get po

Create the Job

Monitor progress:

Note that only 1 Pod runs at a time. ( Default parallelism is one )

Delete Job

4) Job Parallelism: 2

I am running this on a 4 core server.

If you are running these exercises on an at-least 2 core server the following demo will work.

parallelism: 2 below specifies we want to run 2 Pods simultaneously.

Create the Job

Monitor progress.

We needed completions: 4 Therefore 2 Pods were run in parallel twice to get the completions done.

Next we monitor progress using kubectl get jobs myjob

Delete Job

Same job created again.

Monitor:

As expected COMPLETIONS get done in multiples of 2.

5) Job Parallelism: 4

If you are using at least 4 CPU server you can run this example :

We need 4 completions — all 4 run in parallel.

Create the Job

Monitor:

As expected 4 Pods running simultaneously.

Not shown — kubectl get jobs myjob … what do you expect the output to look like?

Delete Job

6) Job activeDeadlineSeconds

activeDeadlineSeconds specifies total runtime for job as a whole.

From https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#job-termination-and-cleanup

Once a Job reaches activeDeadlineSeconds, all of its Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded.

This exercise will demo this DeadlineExceeded

Note last line below: absurd low activeDeadlineSeconds: 3

Create the Job

Repeatedly monitor progress:

Note after 3 seconds Pods no longer exist. We will see why below.

Check job status:

Disappointingly NO indication that we have a problem job.

Describe details about job:

Much more helpful: 0 Succeeded / 4 Failed

Also informative: Warning — — DeadlineExceeded — — Job was active longer than specified deadline

Now we see why our Pods disappeared … they were deleted.

I do not understand the logic:

  • when Pods complete successfully they continue to exist so you can inspect the SUCCESS job logs
  • when Pods complete UNsuccessfully they get deleted so you CANNOT inspect the FAILED job logs

Let’s attempt to investigate the failed logs:

NOTE: use activeDeadlineSeconds only when you have successfully resolved this missing logs issue.

Delete Job

7) Parallel Jobs with a Work Queue — Setup

Kubernetes documentation contains 2 complex examples of parallel jobs

More than 80% of those guides focus on setting up the work queue functionality.

This tutorial focus on learning about parallel jobs and queues using my very simple bash implementation.

We need a directory for our bash work queue functionality:

Our work queue:

We will need this backup later ( Running Pods will delete lines from our workqueue file )

Work queue processing script:

Program explanation ( focus on getting minimal work queue just barely working ) :

  • for loop loops 12 times
  • if -s /usr/share/jobdemo/jobqueue … if workqueue not empty
  • echo “ did some work “ if there are lines left in workqueue
  • remove first line from workqueue file ; sleep 1 second
  • if no more lines in workqueue file then exit with return code 0 = success

Basically every Pod will delete whatever first line it finds from workqueue file and echo that they did some work.

When a Pod finds workqueue empty it just exits with 0 code which means success.

Seeing this in action several times will make it more clear.

We need to place the workqueue file and the jobscript on a persistent volume. Now all Pods will use the same workqueue file. Every Pod will take work from the SAME queue and delete the line ( work ) it took from the queue.

Create a 10Mi Persistent Volume pointing to the location / path of our 2 workqueue objects.

Reference : https://kubernetes.io/docs/concepts/storage/volumes/

Claim usage of storageClassName: pv-demo -> pointing to the Persistent Volume.

Reference : https://kubernetes.io/docs/concepts/storage/persistent-volumes/

8) Parallel Jobs with a Work Queue — Simplest Example

This example uses one Pod to read and process a workqueue until it is empty.

Create myWorkqueue-Job.yaml

This spec mounts our persistent volume and the command runs our jobscript.sh

Create the Job.

Repeatedly run kubectl get pods … monitor progress for parallelism: 1

Our single Pod took 12 seconds to delete the workqueue lines 1 by 1.

We can see this in the Pod log.

Delete job.

9) Parallel Jobs with a Work Queue : Parallelism: 2

Note last line in spec: we are now going to run 2 Pods in parallel.

We need to put the deleted lines back into workqueue file.

Repeatedly run kubectl get pods … monitor parallelism: 2

As expected 2 Pods were running parallel all the time.

If we investigate the logs of both our Pods we can see that each did around half the work.

Both Pods then exited with exit 0 when they found ‘no more work left’ ( workqueue empty ).

Get job overview.

We can see 2 Pods simultaneously took 9s versus 12s for just one Pod.

Delete job.

10) Parallel Jobs with a Work Queue : Parallelism: 4

Note last line in spec: we are now going to run 4 Pods in parallel.

We need to put the deleted lines back into workqueue file.

Create the job.

Monitor:

4 Pods starting up simultaneously.

4 Pods running parallel as expected.

4 parallel Pods faster than 2. ( Overhead of ContainerCreating prevents it from being twice as fast ).

Describe detail about our job: ( only relevant fields shown )

4 Succeeded / 0 Failed and only success lines in events at the bottom.

This is example of a perfectly done job.

These 2 outputs below mean the same thing ( success … note all the 4s for our parallelism: 4 job )

4 lines of 4 Completed Pods with zero RESTARTS.

Delete job.

11) Parallel Jobs with a Work Queue : Parallelism: 8

Note last line in spec: we are now going to run 8 Pods in parallel. ( On a node with only 4 CPU cores )

We need to put the deleted lines back into workqueue file.

Create job.

Monitor:

Determine total job runtime.

Using 4 Pods took 9 seconds and using 8 Pods took 8 seconds.

It does not make sense to run more CPU-intensive workload Pods in parallel than CPU cores on server. Jobs will context switch too much.

https://en.wikipedia.org/wiki/Context_switch

Notice how we used a very simple basic bash script to emulate a work queue.

Even that VERY simple script enabled us to learn a great deal about parallel jobs processing ONE shared, SIMULATED work queue.

Cleanup — delete job.

12) Different Patterns for Parallel Computation

Some different ways to manage parallel jobs are discussed at https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#job-patterns

Original Source

https://www.alibabacloud.com/blog/kubernetes-batch-jobs_595020?spm=a2c41.13112151.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store