Sparkling Water can now be executed inside the Kubernetes cluster. Sparkling Water provides a Beta version of Kubernetes support in a form of nightlies. Both Kubernetes deployment modes, cluster and client, are supported. Also, both Sparkling Water backends and all clients are also ready to be tested.
Sparkling Water in Kubernetes is currently in open beta in the development branch, but we already publish nightly docker images for Spark 2.4 and 3.0 (https://hub.docker.com/u/h2oai/) so we can give it a try right away! The official support for Kuberentes will be in the next major release which we expect to roll out around September.
Let’s assume in this blog that we use Spark 2.4 for which the docker images are tagged as latest-nightly-2.4. If you want to use Spark 3.0, please use tag latest-nightly-3.0.
Before we start, please review the following prerequisites
Note: In this blog when we refer to H2O, we are referring to H2O-3.
The examples below are using the default Kubernetes namespace which we enable for Spark as:
kubectl create clusterrolebinding default --clusterrole=edit --serviceaccount=default:default --namespace=default
You can also use a different namespace setup for Spark. In that case please don’t forget to pass --conf spark.kubernetes.authenticate.driver.serviceAccountName=serviceName to your Spark commands.
Sparkling Water on Kubernetes can be run via Internal or External Backends for Scala, Python, or R. To read more about the backends, please read our documentation. In the rest of the blog, we will walk through the configuration and setups of these combinations.
This open beta of Sparkling Water on Kubernetes is an opportunity for you to try it, explore it, and provide us feedback. If you have questions or run into issues, please contacts us via our Community Slack channel or feel free to submit issues here.
In the internal backend of Sparkling Water, we need to pass the option spark.scheduler.minRegisteredResourcesRatio=1 to our Spark job invocation. This ensures that Spark waits for all resources and therefore Sparkling Water will start H2O on all requested executors. Dynamic allocation must be disabled in Spark.
Both cluster and client deployment modes of Kubernetes are supported.
$SPARK_HOME/bin/spark-submit \
--master k8s://KUBERNETES_ENDPOINT \
--deploy-mode cluster \
--class ai.h2o.sparkling.InitTest \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:latest-nightly-2.4 \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.jarmespace=default
1. Create Headless so Spark executors can reach the driver node
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF
2. Start pod from where we run the shell:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-scala:latest-nightly-2.4 -- /bin/bash
3. Inside the container, start the shell:
$SPARK_HOME/bin/spark-shell \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.executor.instances=3
4. Inside the shell, run:
import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()
5. To access flow, we need to enable port-forwarding from the driver pod:
kubectl port-forward sparkling-water-app 54321:54321
First create the headless service as mentioned in the step 1 above and run:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-scala:latest-nightly-2.4 -- /bin/bash \
/opt/spark/bin/spark-submit \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--class ai.h2o.sparkling.InitTest \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.jar
Both cluster and client deployment modes of Kubernetes are supported.
$SPARK_HOME/bin/spark-submit \
--master k8s://KUBERNETES_ENDPOINT \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:latest-nightly-2.4 \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.py
1. Create Headless so Spark executors can reach the driver node:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF
2. Start pod from where we run the shell:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-python:latest-nightly-2.4 -- /bin/bash
3. Inside the container, start the shell:
$SPARK_HOME/bin/pyspark \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.executor.instances=3 \
4. Inside the shell, run:
from pysparkling import *
hc = H2OContext.getOrCreate()
5. To access flow, we need to enable port-forwarding from the driver pod as:
kubectl port-forward sparkling-water-app 54321:54321
To submit a batch job using client mode:
First, create the headless service as mentioned in the step 1 above and run:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-python:latest-nightly-2.4 -- \
$SPARK_HOME/bin/spark-submit \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.py
First, start a docker image h2oai/sparkling-water-r:latest-nightly-2.4 where the required dependencies are already installed. You can also install the dependencies on the physical machine, but please make sure to use the same version of RSparkling as the one which is used inside the docker image.
library(sparklyr)
library(rsparkling)
config = spark_config_kubernetes("k8s://KUBERNETES_ENDPOINT",
image = "h2oai/sparkling-water-r:latest-nightly-2.4",
account = "default",
executors = 3,
version = "2.4.6",
conf = list("spark.kubernetes.file.upload.path"="file:///tmp")
ports = c(8880, 8881, 4040, 54321))
config["spark.home"] <- Sys.getenv("SPARK_HOME")
sc <- spark_connect(config = config, spark_home = Sys.getenv("SPARK_HOME"))
hc <- H2OContext.getOrCreate()
spark_disconnect(sc)
You can also submit an RSparkling batch job. In that case, create a file called batch.R with the content from the code box above and run:
Rscript --default-packages=methods,utils batch.R
Note: In the case of RSparkling, SparklyR automatically sets the Spark deployment mode and it is not possible to specify it.
Sparkling Water External backend can be also used in Kubernetes. First, we need to start an external H2O backend on Kubernetes. To achieve this, please follow the steps on theH2O on Kubernetes Documentation with one important exception. The image to be used needs to be h2oai/sparkling-water-external-backend:latest-nightly-2.4 and not the base H2O image as mentioned in H2O documentation as Sparkling Water enhances the H2O image with additional dependencies.
In order for Sparkling Water to be able to connect to the H2O cluster, we need to get the address of the leader node of the H2O cluster. If we followed the H2O documentation on how to start H2O cluster on Kubernetes, the address is h2o-service.default.svc.cluster.local:54321 where the first part is the H2O headless service name and the second part is the name of the namespace.
After we created the external H2O backend, we can connect to it from Sparkling Water clients as:
Both cluster and client deployment modes of Kubernetes are supported.
$SPARK_HOME/bin/spark-submit \
--master k8s://KUBERNETES_ENDPOINT \
--deploy-mode cluster \
--class ai.h2o.sparkling.InitTest \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:latest-nightly-2.4 \
--conf spark.executor.instances=3 \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.hadoop.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
local:///opt/sparkling-water/tests/initTest.jar
1. Create Headless so Spark executors can reach the driver node
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF
2. Start pod from where we run the shell:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-scala:latest-nightly-2.4 -- /bin/bash
3. Inside the container, start the shell:
$SPARK_HOME/bin/spark-shell \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.hadoop.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--conf spark.executor.instances=3
4. Inside the shell, run:
import ai.h2o.sparkling._
val hc = H2OContext.getOrCreate()
5. To access flow, we need to enable port-forwarding from the driver pod:
kubectl port-forward sparkling-water-app 54321:54321
First, create the headless service as mentioned in the step 1 above and run:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-scala:latest-nightly-2.4 -- /bin/bash \
/opt/spark/bin/spark-submit \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--class ai.h2o.sparkling.InitTest \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.hadoop.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.jar
Both cluster and client deployment modes of Kubernetes are supported.
$SPARK_HOME/bin/spark-submit \
--master k8s://KUBERNETES_ENDPOINT \
--deploy-mode cluster \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:latest-nightly-2.4 \
--conf spark.executor.instances=3 \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.hadoop.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
local:///opt/sparkling-water/tests/initTest.py
1. Create Headless so Spark executors can reach the driver node:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: sparkling-water-app
spec:
clusterIP: "None"
selector:
spark-driver-selector: sparkling-water-app
EOF
2. Start pod from where we run the shell:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-python:latest-nightly-2.4-- /bin/bash
3. Inside the container, start the shell:
$SPARK_HOME/bin/pyspark \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.hadoop.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--conf spark.executor.instances=3 \
4. Inside the shell, run:
from pysparkling import *
hc = H2OContext.getOrCreate()
5. To access flow, we need to enable port-forwarding from the driver pod as:
kubectl port-forward sparkling-water-app 54321:54321
First, create the headless service as mentioned in the step 1 above and run:
kubectl run -n default -i --tty sparkling-water-app --restart=Never --labels spark-app-selector=yoursparkapp --image=h2oai/sparkling-water-python:latest-nightly-2.4 -- \
$SPARK_HOME/bin/spark-submit \
--conf spark.scheduler.minRegisteredResourcesRatio=1 \
--conf spark.kubernetes.container.image=h2oai/sparkling-water-python:latest-nightly-2.4 \
--master "k8s://KUBERNETES_ENDPOINT" \
--conf spark.driver.host=sparkling-water-app \
--deploy-mode client \
--conf spark.ext.h2o.backend.cluster.mode=external \
--conf spark.ext.h2o.external.start.mode=manual \
--conf spark.ext.h2o.hadoop.memory=2G \
--conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \
--conf spark.ext.h2o.cloud.name=root \
--conf spark.executor.instances=3 \
local:///opt/sparkling-water/tests/initTest.py
First, start a docker image h2oai/sparkling-water-r:latest-nightly-2.4 where the required dependencies are already installed. You can also install the dependencies on the physical machine, but please make sure to use the same version of RSparkling as the one which is used inside the docker image.
To start H2OContext in an interactive shell, run the following code in R:
library(sparklyr)
library(rsparkling)
config = spark_config_kubernetes("k8s://KUBERNETES_ENDPOINT",
image = "h2oai/sparkling-water-r:latest-nightly-2.4",
account = "default",
executors = 3,
version = "2.4.6",
conf = list(
"spark.ext.h2o.backend.cluster.mode"="external",
"spark.ext.h2o.external.start.mode"="manual",
"spark.ext.h2o.hadoop.memory"="2G",
"spark.ext.h2o.cloud.representative"="h2o-service.default.svc.cluster.local:54321",
"spark.ext.h2o.cloud.name"="root",
"spark.kubernetes.file.upload.path"="file:///tmp")
ports = c(8880, 8881, 4040, 54321))
config["spark.home"] <- Sys.getenv("SPARK_HOME")
sc <- spark_connect(config = config, spark_home = Sys.getenv("SPARK_HOME"))
hc <- H2OContext.getOrCreate()
spark_disconnect(sc)
You can also submit an RSparkling batch job. In that case, create a file called batch.R with the content from the code box above and run:
Rscript --default-packages=methods,utils batch.R
Note: In the case of RSparkling, SparklyR automatically sets the Spark deployment mode and it is not possible to specify it.