Big Data Storage and Spark on Kubernetes

Computing and Storage of Containerized Big Data

  • Hardware Limitations: Bandwidth on machines is increasing exponentially, while disk throughput often remains unchanged, making local data read/write less advantageous.
  • Computing Costs: The gap between the magnitude of computing and storage results in a significant waste in the computing power.
  • Storage Costs: Centralized storage can reduce the storage cost and ensure higher SLAs at the same time, and building data warehouses less competitive.

Cost-Efficiency and Bringing Down Costs

Achieving Larger Storage Capacities

Obtaining Faster Read/Write Speeds

Storage Solutions Using Alibaba Cloud Spark on Kubernetes

Storing Large Numbers of Small Files

package com.aliyun.emr.exampleobject OSSSample extends RunLocally {
def main(args: Array[String]): Unit = {
if (args.length < 2) {
"""Usage: bin/spark-submit --class OSSSample examples-1.0-SNAPSHOT-shaded.jar <inputPath> <numPartition>
| inputPath Input OSS object path, like oss://accessKeyId:accessKeySecret@bucket.endpoint/a/b.txt
| numPartitions the number of RDD partitions.
val inputPath = args(0)
val numPartitions = args(1).toInt
val ossData = sc.textFile(inputPath, numPartitions)
println("The top 10 lines are:")
override def getAppName: String = "OSS Sample"

Storing Files Using HDFS

/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "dfs://"
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData =
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")

Regular File Storage

apiVersion: ""
kind: SparkApplication
name: spark-pi
namespace: default
type: Scala
mode: cluster
Image:"Gcr. io/spark-operator/spark: v2.4.0"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
type: Never
- name: pvc-nas
claimName: pvc-nas
cores: 0.1
coreLimit: "200m"
memory: "512m"
version: 2.4.0
serviceAccount: spark
- name: "pvc-nas"
mountPath: "/tmp"
cores: 1
instances: 1
memory: "512m"
version: 2.4.0
- name: "pvc-nas"
mountPath: "/tmp"

Other Storage Structures


Original Source




Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to Use Alibaba Cloud LOG Java Producer

Thrift types — best practice

Developing an SMS web app for broadcasting menu updates to customers of a food delivery startup

Going further with Cloud Dataflow: conception of a real-time polls app — part 2

Windows Desktop Application Test Automation with WinAppDriver and .NET Core

KSwap Monthly Progress Report

Getting Associated with ActiveRecord and Rails.

The Complexities of Zero-Trust Network Access

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

More from Medium

Hive on Spark with Spark Operator

Mount CIFS for Spark Cluster

How poor provisioning of cloud resources can lead to 10X slower Apache Spark jobs

Using Airflow and Spark operator to Add Partitions to Hive Metastore