Big Data Storage and Spark on Kubernetes

Computing and Storage of Containerized Big Data

  • Hardware Limitations: Bandwidth on machines is increasing exponentially, while disk throughput often remains unchanged, making local data read/write less advantageous.
  • Computing Costs: The gap between the magnitude of computing and storage results in a significant waste in the computing power.
  • Storage Costs: Centralized storage can reduce the storage cost and ensure higher SLAs at the same time, and building data warehouses less competitive.

Cost-Efficiency and Bringing Down Costs

Achieving Larger Storage Capacities

Obtaining Faster Read/Write Speeds

Storage Solutions Using Alibaba Cloud Spark on Kubernetes

Storing Large Numbers of Small Files

package com.aliyun.emr.exampleobject OSSSample extends RunLocally {
def main(args: Array[String]): Unit = {
if (args.length < 2) {
System.err.println(
"""Usage: bin/spark-submit --class OSSSample examples-1.0-SNAPSHOT-shaded.jar <inputPath> <numPartition>
|
|Arguments:
|
| inputPath Input OSS object path, like oss://accessKeyId:accessKeySecret@bucket.endpoint/a/b.txt
| numPartitions the number of RDD partitions.
|
""".stripMargin)
System.exit(1)
}
val inputPath = args(0)
val numPartitions = args(1).toInt
val ossData = sc.textFile(inputPath, numPartitions)
println("The top 10 lines are:")
ossData.top(10).foreach(println)
}
override def getAppName: String = "OSS Sample"
}

Storing Files Using HDFS

/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "dfs://f-5d68cc61ya36.cn-beijing.dfs.aliyuncs.com:10290/logdata/ab.log"
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}

Regular File Storage

apiVersion: "sparkoperator.k8s.io/v1alpha1"
kind: SparkApplication
metadata:
name: spark-pi
namespace: default
spec:
type: Scala
mode: cluster
Image:"Gcr. io/spark-operator/spark: v2.4.0"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
restartPolicy:
type: Never
volumes:
- name: pvc-nas
persistentVolumeClaim:
claimName: pvc-nas
driver:
cores: 0.1
coreLimit: "200m"
memory: "512m"
labels:
version: 2.4.0
serviceAccount: spark
volumeMounts:
- name: "pvc-nas"
mountPath: "/tmp"
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 2.4.0
volumeMounts:
- name: "pvc-nas"
mountPath: "/tmp"

Other Storage Structures

Conclusion

Original Source

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

SQL Exercises 7 — Joins

Datetime coercion with DataWeave in Mulesoft

Improving the class inheritance

3D Paths on Maps using Mapbox and Unity3d

Setting up and Configuring a basic Samba Server in AWS

Spark 2.2.0 Dataframe Repartitioning

Strings — how Java 17 is worse and better than Java 8

Learn C Programming

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Apache Spark -SNAPSHOT on Kubernetes

Technical Practice of Apache DolphinScheduler in Kubernetes System

Apache Cassandra: CQL Commands

ksqlDB —real-time SQL magic in the cybersecurity scenario— part 1