Setting Up PySpark on Alibaba Cloud CentOS Instance

Prerequisites:

  1. One Alibaba Cloud ECS instance
  2. One ElP.
  1. Python
  2. Java
  3. Spark

Section 1: Cloud Resources

What Is an ECS?

What Is an EIP?

Acquiring an ECS Instance

Buying and Associating an EIP

Section 2: Installing Python

What Is Python?

Installing Python on Alibaba Cloud ECS Instance

yum install gcc openssl-devel bzip2-devel libffi-devel
cd /usr/srcwget https://www.python.org/ftp/python/3.7.2/Python-3.7.2.tgz
cd Python-3.7.2
./configure --enable-optimizations

Section 3: Installing Spark and PySpark

What is Spark?

  1. Spark SQL: A component of spark specifically for handling data using SQL syntax
  2. Spark Streaming: A core streaming data processing and handling library.
  3. MLlib (Machine learning library): A specific library for performing data clustering, predictive analytics and applying basic machine learning algorithms or data mining.
  4. GraphX: A library for handling networks and graphs in spark cluster.

What is PySpark?

Installing Spark/PySpark on Alibaba Cloud ECS instance

sudo yum update
sudo yum install java-1.8.0-openjdk-headless
wget https://www-eu.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
  • sbin/start-master.sh: Starts a master instance on the machine the script is executed on.
  • sbin/start-slaves.sh: Starts a slave instance on each machine specified in the conf/slaves file.
  • sbin/start-slave.sh: Starts a slave instance on the machine the script is executed on.
  • sbin/start-all.sh: Starts both a master and a number of slaves as described above.
  • sbin/stop-master.sh: Stops the master that was started via the sbin/start-master.sh script.
  • sbin/stop-slaves.sh: Stops all slave instances on the machines specified in the conf/slaves file.
  • sbin/stop-all.sh: Stops both the master and the slaves as described above.
cat /opt/spark-2.4.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-centos.out
./sbin/start-slave.sh <master-spark-URL>
./sbin/start-slave.sh spark://centos:7077
export SPARK_HOME=/opt/spark-2.4.0-bin-hadoop2.7  
export PATH=$SPARK_HOME/bin:$PATH
bin/pyspark

Sample Code

from pyspark import SparkContext
outFile = "file:///opt/spark-2.4.0-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-centos.out"
sc = SparkContext("local", "example app")
outData = sc.textFile(outFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
print("Lines with a: %i " % (numAs))

Original Source

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com