17 Low Cost Running Spark Data Processing

17 Low-Cost Running Spark Data Processing #

Product Introduction #

Alibaba Cloud Elastic Container Instance (ECI) #

ECI provides a secure serverless container running service. Without the need to manage underlying servers, you only need to provide a packaged Docker image to run the container, and only pay for the actual resources consumed by the container.

Alibaba Cloud Container Service Product Family #

Whether it is the managed version of Kubernetes (ACK) or the serverless version of Kubernetes (ASK), ECI can be used as the container resource layer. The implementation behind it is based on virtual node technology, which integrates with ECI through a virtual node called Virtual Node.

Kubernetes + ECI #

With Virtual Kubelet, a standard Kubernetes cluster can deploy ECS and virtual nodes together, using Virtual Node as an elastic resource pool to handle burst traffic.

ASK (Serverless Kubernetes) + ECI #

In a serverless cluster, there are no ECS worker nodes and no need to reserve or plan resources. There is only one Virtual Node, and all Pod creation is done on the Virtual Node, which is based on ECI instances.

Serverless Kubernetes is a serverless service based on containers and Kubernetes. It provides a simple, easy-to-use, highly elastic, cost-effective, and pay-as-you-go Kubernetes container service. It eliminates the need for node management and operations, and eliminates the need for capacity planning, allowing users to focus more on applications rather than infrastructure management.

Spark on Kubernetes #

Since version 2.3.0, Spark has experimental support for a new deployment method, Running Spark on Kubernetes, which is now mature.

Advantages of Kubernetes #

Advantages of Spark on Kubernetes compared to traditional deployment methods like YARN:

Unified resource management. Any type of job can run in a unified Kubernetes cluster, eliminating the need for a separate YARN cluster for big data jobs.
Traditional mixed deployment of computation and storage often leads to additional computation expansion for storage scaling, which is wasteful. Similarly, solely improving computational capability can lead to storage waste for a period of time. Kubernetes breaks through storage limitations and separates computation and storage for offline processing, better addressing one-sided deficiencies.
Elastic cluster infrastructure.
Easy implementation of resource isolation and restriction for complex distributed applications, freeing from the complexity of YARN queue management and allocation.
Advantages of containerization. Each application can package its own dependencies through Docker images and run in an isolated environment, including the version of Spark. All applications are completely isolated from each other.
Moving big data to the cloud. Currently, there are two common ways to move big data applications to the cloud: 1) Building a YARN cluster using ECS (not limited to YARN); 2) Purchasing EMR services, which are available from all cloud providers. Now there is an additional choice - Kubernetes.

Spark Scheduling #

The orange part in the diagram represents the native Spark application scheduling process, while Spark on Kubernetes has made some extensions (yellow part), implementing a KubernetesClusterManager. The KubernetesClusterSchedulerBackend extends the native CoarseGrainedSchedulerBackend and adds components such as ExecutorPodsLifecycleManager, ExecutorPodsAllocator, and KubernetesClient to manage standard Spark Driver processes as Kubernetes Pods.

Spark submit #

Before the Spark Operator appeared, Spark jobs could only be submitted to a Kubernetes cluster through Spark submit. After setting up the Kubernetes cluster, you can submit jobs locally.

The basic process of job startup is as follows:

Spark creates a Spark Driver (pod) in the K8s cluster.
After the Driver starts, it calls the K8s API to create Executors (pods), which are the carriers for executing the job.
When the job computation is complete, Executor Pods are automatically reclaimed, and the Driver Pod is in the Completed state (final state), which can be used by users to view logs, etc.
The Driver Pod can only be manually cleaned up by users or reclaimed by K8s GC.

Submitting Spark jobs in this way is difficult to maintain and not intuitive, especially when custom parameters are added. In addition, the concept of Spark Applications is no longer present, there are only scattered Kubernetes Pods, Services, and other basic units. As the number of applications increases, the maintenance cost increases, and there is a lack of unified management mechanisms.

Spark Operator #

Spark Operator is developed to deploy and maintain Spark applications in Kubernetes clusters. Spark Operator is a classic CRD + Controller implementation, which is the implementation of Kubernetes Operator.

The following diagram shows the SparkApplication state machine:

Serverless Kubernetes + ECI #

Running Spark in a Serverless Kubernetes cluster is a further simplification of native Spark.

Storage Selection #

For batch processing data sources, since the cluster is not based on HDFS, there will be different data sources, and computation and storage need to be separated. Kubernetes clusters only provide computing resources.

For storage of data sources, Alibaba Cloud Object Storage Service (OSS) and Alibaba Cloud Distributed File System (HDFS) can be used.
Temporary data and Shuffle data for computation can use the free 40GB system disk storage space provided by ECI. It is also possible to customize mounting Alibaba Cloud data disks, as well as CPFS/NAS file systems, all of which have excellent performance.