How to Execute Mars in a Distributed Manner

Architecture

Submitting a Job

Controlling the Execution

Canceling a Job

Preparing the Execution Graph

Compressing a Graph

import mars.tensor as mt
a = mt.random.rand(100, chunks=100)
b = mt.random.rand(100, chunks=100)
c = (a + b).sum()

Allocating an Initial Worker

  1. The first initial node and the first machine in the list are selected;
  2. In the undirected graph converted from the operand graph, depth-first search is started from the node;
  3. If another unallocated initial node is accessed, we allocate it to the machine selected in step 1;
  4. When the total number of operands accessed exceeds the average number of operands accepted by each worker, the allocation is stopped;
  5. If there are still workers who have not been allocated operands, go to step 1. Otherwise, the computation ends.

Scheduling Policy

Selection Policy for Operands

  1. Operands with a larger depth need to be executed first;
  2. Operands, which are relied upon by deeper operands, needs to be executed first;
  3. Nodes with smaller output sizes need to be executed first.

Selection Policy for Workers

Operand State

  • UNSCHEDULED: An operand is in this state when its upstream data is not ready.
  • READY: An operand is in this state when all upstream input data is ready. After it enters this state, the OperandActor submits jobs to all workers selected in the AssignerActor. If a worker is ready to run a job, it sends a message to the scheduler. The scheduler sends a stop message to other workers, and then send a message to the worker to start the job execution.
  • RUNNING: An operand is in this state when its execution has been started. When it enters this state, the OperandActor checks whether the job has been submitted. If not, the OperandActor constructs a graph composed of FetchChunk Operands and the current operand, and submits it to the worker. Afterwards, the OperandActor registers a callback in the worker to obtain the message indicating that the job is completed.
  • FINISHED: An operand is in this state when the job is completed. When an operand enters this state and has no successors, a message is sent to the GraphActor to determine whether the execution of the entire graph has ended. At the same time, the OperandActor sends a message to its precursors and successors that the execution is completed. If a precursor receives the message, it checks whether all the successors have been completed. If so, the data on the current operand can be released. If a successor receives the message, it checks whether all the precursors have been completed. If so, the successor state can be transitioned to READY.
  • FREED: An operand is in this state when all its data has been released.
  • FATAL: An Operand is in this state when all attempts to re-execute fail. When an operand enters this state, it passes the same state to the successor node.
  • CANCELLING: An operand is in this state when it is being canceled. If the job is currently executing, a request to cancel execution is sent to the worker.
  • CANCELLED: An operand is in this state when the execution is canceled and stopped. If the execution enters this state, the OperandActor attempts to transition the state of all successors to CANCELLING.

Execution Details in Workers

Controlling the Execution

Sorting of Operands

Memory Management

Future Work

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Kubernetes CronJobs — Part 1: Basics

Launch AWS instance and create volume using AWS CLI

Solving Sudoku Programmatically

Flutter | The Comming future

Docker and Kubernetes | What The Heck is Container and How It Really Works?

How to Deliver iOS App with Fastlane and Apple API key

Why is Python good for Blockchain?

Zero to OSCP: Concise Edition

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Using Docker as your production machine with X11

How we made our integration tests delightful by optimizing our GitHub Actions workflow

Pre-commit Check with Shell Script to Avoid Using Certain Words

Concurrency & Parallel Programming in Python