H2O is an open-source, in-memory platform for distributed, scalable machine learning. A perfect match for deployment on a Kubernetes cluster, the very modern way of deploying, serving & scaling applications. With the major release 3.30.0.1
, released in Q1 2020, H2O obtained first class Kubernetes support .
This article explains how to create H2O deployment on Kubernetes. It also covers the selected H2O internal mechanisms for the reader’s better understanding of H2O’s behavior on Kubernetes cluster.
In order to understand the behavior and limitations of H2O distributed cluster, it is mandatory to understand the basics of H2O design.
Once H2O nodes are started a cluster is formed (H2O Cluster, not a cluster of Kubernetes nodes). As data are loaded into the H2O cluster, compression is applied internally and the data are evenly distributed across H2O nodes.
Such approach enables H2O to scale and perform do machine learning on big data. For fast computation, the data are kept in memory . Not only the data loaded, but also all the intermediate computations. Unless explicitly saved to a persistent storage. This is key to H2O’s speed. Once data are loaded and distributed across the cluster of H2O nodes, it is currently impossible to change the cluster configuration. Adding new nodes, changing node size (memory, CPU) is prohibited, as the cluster adapts to the initial configuration, including data distribution schemas and compression schemas. The ultimate upside to this approach is H2O’s speed.
The information above implies H2O cluster is stateful . If one H2O node is terminated, the cluster is immediately recognized as unhealthy and has to be restarted.
This implies H2O nodes must be treated as stateful by K8S. In Kubernetes, a set of pods sharing a common state is named a Stateful set . A Kubernetes Stateful Set ensures:
Key Takeaway: H2O is a stateful application. H2O Nodes are spawned together and die together. Kubernetes tooling for stateless applications in not applicable to H2O. Databases are deployed on Kubernetes cluster in a very similar way.
In order for H2O to cluster up, it requires addresses of other H2O nodes. On a simple local network, internal heuristics take care of the clustering. On Kubernetes cluster, the situation gets more difficult, as pods with H2O inside are distributed across Kubernetes nodes and the IP addresses are assigned on demand by default. Some users utilize the possibility to create and distribute a flatfile with H2O Node addresses. This is the wrong way. On a Kubernetes cluster, this represents an additional, very complicated and unnecessary step. First, persistent storage has to be allocated and mounted, which results in resources and time wasted. Then, Kubernetes cluster has to be queried for the IP addresses of the H2O pods created. The flatfile is afterwards written to the persistent storage, while the H2O Docker containers inside the pods wait for the file to be present to start H2O. A very complicated procedure indeed.
H2O is now able to discover other pods with H2O under the same service automatically using resources native Kubernetes – environment variables and services . No container hacking required.
According to the Kubernetes networking model , every Pod gets its own IP address. Also, pods on any node can communicate with all pods on all nodes. Therefore, a StatefulSet
of H2O Nodes is created, exposed via a headless service. The clustering is afterwards performed automatically by using DNS records created by the headless service. The name of the headless service is passed to the H2O pod and then passed all the way to the H2O Docker container by defining H2O_KUBERNETES_SERVICE_DNS
environment variable. The format usually follows the <service-name>.<project-name>.svc.cluster.local
pattern. Once this environment variable is present, H2O assumes it is running inside a Kubernetes cluster and waits for the clustering to be over before the main H2O program is actually started. All done automatically.
Key takeaway: H2O is able to cluster itself inside Kubernetes service using tools native to Kubernetes – services and environment variables. No other tools required.
In order to ensure reproducibility, all requests should be directed towards H2O Leader node. Leader node election is done after the node discovery process is completed. Therefore, after the clustering is formed and the leader node is known, only the pod with H2O leader node should be made available. This also makes the service(s) on top of the deployment route all requests only to the leader node. To achieve that, the readiness probe residing on /kubernetes/isLeaderNode
address is used. Once the clustering is done, all nodes but the leader node mark themselves as not ready, leaving only the leader node exposed.
Key takeaway: To ensure reproducibility, only the leader not should be contacted. Readiness probe ensures only the leader node is reachable via the corresponding service.
In order to spawn H2O cluster inside a Kubernetes cluster, the following list of requirements must be met:
A simple Docker container with H2O running on startup is enough. The simplest way to create one is demonstrated in the figure below.
FROM ubuntu:latest
ARG H2O_VERSION
RUN apt-get update \
&& apt-get install default-jdk unzip wget -y
RUN wget http://h2o-release.s3.amazonaws.com/h2o/rel-zahradnik/1/h2o-${H2O_VERSION}.zip \
&& unzip h2o-${H2O_VERSION}.zip
ENV H2O_VERSION ${H2O_VERSION}
CMD java -jar h2o-${H2O_VERSION}/h2o.jar
To build the Docker image, use the docker build . -t {image-name} --build-arg H2O_VERSION=3.30.0.1
. Make sure to replace the {image-name}
placeholder with a meaningful name. For the purpose of this article, the docker image will be named h2o-k8s
, resulting in docker build . -t h2o-k8s --build-arg H2O_VERSION=3.30.0.1
H2O Pods deployed on Kubernetes cluster require a headless service for H2O Node discovery. The headless service, instead of load-balancing incoming requests to the underlying H2O pods, returns a set of addresses of all the underlying pods. This enables H2O to cluster itself.
apiVersion: v1
kind: Service
metadata:
name: h2o-service
spec:
type: ClusterIP
clusterIP: None
selector:
app: h2o-k8s
ports:
- protocol: TCP
port: 54321
The clusterIP: None
defines the service as headless. The port: 54321
is the default H2O port. Users and client libraries use this port to talk to the H2O cluster.
The app: h2o-k8s
setting is of great importance , as this is the name of the application with H2O pods inside. The name has been chosen arbitrarily to correspond. Please make sure this setting corresponds to the name of H2O deployment name chosen.
It is strongly recommended to run H2O as a Stateful set on a Kubernetes cluster. Kubernetes assumes all the pods inside the cluster are stateful and does not attempt to restart the individual pods on failure. Once a job is triggered on an H2O cluster, the cluster is locked and no additional nodes can be added. Therefore, the cluster has to be restarted as a whole if required – which is a perfect fit for a StatefulSet.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: h2o-stateful-set
namespace: h2o-statefulset
spec:
serviceName: h2o-service
replicas: 3
selector:
matchLabels:
app: h2o-k8s
template:
metadata:
labels:
app: h2o-k8s
spec:
terminationGracePeriodSeconds: 10
containers:
- name: h2o-k8s
image: '<someDockerImageWithH2OInside>'
resources:
requests:
memory: "4Gi"
ports:
- containerPort: 54321
protocol: TCP
readinessProbe:
httpGet:
path: /kubernetes/isLeaderNode
port: 8081
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 1
env:
- name: H2O_KUBERNETES_SERVICE_DNS
value: h2o-service.h2o-statefulset.svc.cluster.local
- name: H2O_NODE_LOOKUP_TIMEOUT
value: '180'
- name: H2O_NODE_EXPECTED_COUNT
value: '3'
- name: H2O_KUBERNETES_API_PORT
value: '8081'
Besides standardized Kubernetes settings, like replicas: 3
defining the number of pods with H2O instantiated, there are several settings to pay attention to.
The name of the application app: h2o-k8s
must correspond to the name expected by the above-defined headless service in order for the H2O node discovery to work. H2O communicates on port 54321, therefore containerPort: 54321
must be exposed to make it possible for the clients to connect.
The readiness probe residing on /kubernetes/isLeaderNode
makes sure only the leader node is exposed once the cluster is formed by making all nodes but the leader node not available. Default port for H2O Kubernetes API is 8080. In the example, an optional environment variable changes the port to 8081
.
Environment variables:
H2O_KUBERNETES_SERVICE_DNS
– [MANDATORY] Crucial for the clustering to work. The format usually follows the <service-name>.<project-name>.svc.cluster.local
pattern. This setting enables H2O node discovery via DNS. It must be modified to match the name of the headless service created. Also, pay attention to the rest of the address to match the specifics of your Kubernetes implementation.H2O_NODE_LOOKUP_TIMEOUT
– [OPTIONAL] Node lookup constraint. Time before the node lookup is ended.H2O_NODE_EXPECTED_COUNT
– [OPTIONAL] Node lookup constraint. Expected number of H2O pods to be discovered.H2O_KUBERNETES_API_PORT
– [OPTIONAL] Port for Kubernetes API checks and probes to listen on. Defaults to 8080.If none of the optional lookup constraints is specified, a sensible default node lookup timeout will be set – currently defaults to 3 minutes. If any of the lookup constraints are defined, the H2O node lookup is terminated on whichever condition is met first.
Exposing the H2O cluster is a responsibility of the Kubernetes administrator. By default, an Ingress can be created. Different platforms offer different capabilities, e.g. OpenShift offers Routes .
The resulting YAML may be put into a single file, for example h2o.yaml
.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: h2o-stateful-set
namespace: h2o-statefulset
spec:
serviceName: h2o-service
replicas: 3
selector:
matchLabels:
app: h2o-k8s
template:
metadata:
labels:
app: h2o-k8s
spec:
terminationGracePeriodSeconds: 10
containers:
- name: h2o-k8s
image: 'pscheidl/h2o-k8s'
resources:
requests:
memory: "4Gi"
ports:
- containerPort: 54321
protocol: TCP
env:
- name: H2O_KUBERNETES_SERVICE_DNS
value: h2o-service.h2o-statefulset.svc.cluster.local
- name: H2O_NODE_LOOKUP_TIMEOUT
value: '180'
- name: H2O_NODE_EXPECTED_COUNT
value: '3'
---
apiVersion: v1
kind: Service
metadata:
name: h2o-service
spec:
type: ClusterIP
clusterIP: None
selector:
app: h2o-k8s
ports:
- protocol: TCP
port: 54321
The result might be applied locally using kubectl apply -f h2o.yaml
, or copied into your favorite Kubernetes cluster provider’s interface.
A huge thank you belongs to Nicholas Anderson from Discover Financial Services , as they were the drivers of H2O Kubernetes implementation, providing a lot of real-world use cases and templates.
Remember, H2 O.ai is open-source and can be found on GitHub . Found a bug ? Head to H2 O JIRA . Have questions ? H2 O offers community Gitter and Slack .