Big Data Storage and Spark on Kubernetes

Computing and Storage of Containerized Big Data

  • Hardware Limitations: Bandwidth on machines is increasing exponentially, while disk throughput often remains unchanged, making local data read/write less advantageous.
  • Computing Costs: The gap between the magnitude of computing and storage results in a significant waste in the computing power.
  • Storage Costs: Centralized storage can reduce the storage cost and ensure higher SLAs at the same time, and building data warehouses less competitive.

Cost-Efficiency and Bringing Down Costs

Achieving Larger Storage Capacities

Obtaining Faster Read/Write Speeds

Storage Solutions Using Alibaba Cloud Spark on Kubernetes

Storing Large Numbers of Small Files

package com.aliyun.emr.exampleobject OSSSample extends RunLocally {
def main(args: Array[String]): Unit = {
if (args.length < 2) {
"""Usage: bin/spark-submit --class OSSSample examples-1.0-SNAPSHOT-shaded.jar <inputPath> <numPartition>
| inputPath Input OSS object path, like oss://accessKeyId:accessKeySecret@bucket.endpoint/a/b.txt
| numPartitions the number of RDD partitions.
val inputPath = args(0)
val numPartitions = args(1).toInt
val ossData = sc.textFile(inputPath, numPartitions)
println("The top 10 lines are:")
override def getAppName: String = "OSS Sample"

Storing Files Using HDFS

/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "dfs://"
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData =
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")

Regular File Storage

apiVersion: ""
kind: SparkApplication
name: spark-pi
namespace: default
type: Scala
mode: cluster
Image:"Gcr. io/spark-operator/spark: v2.4.0"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
type: Never
- name: pvc-nas
claimName: pvc-nas
cores: 0.1
coreLimit: "200m"
memory: "512m"
version: 2.4.0
serviceAccount: spark
- name: "pvc-nas"
mountPath: "/tmp"
cores: 1
instances: 1
memory: "512m"
version: 2.4.0
- name: "pvc-nas"
mountPath: "/tmp"

Other Storage Structures


