September 29th, 2014

Sparkling Water Tutorials

RSS icon RSS Category: Uncategorized [EN]
spark_embedded_cluster

Please follow the updated version of tutorials here

H2O is hosting a meetup tomorrow at our officewhere attendees are encourage to hack away with us as we run Deep Learning on Sparkling Water. If you haven’t already read allabout H2O’s integration into Spark then get started withHow Sparkling Water Brings H2O to Sparkand Sparkling Water!.

For those who can’t attend the meetup tomorrow or for the overachievers that want to get a head start and come with an arsenalof probing questions for our speaker, Michal Malohlava, we have the zip file with the prepackaged demo ready for download.

Running Sparkling Water Locally

spark_embedded_cluster
The workflow of our demonstration is as follows:

  • Start up a Spark Cluster
  • Launch H2O in conjunction to the Spark Cluster
  • Import and parse an airlines dataset
  • Filter and munge using Spark to create a subset of the origin data
  • Build a model using H2O
  • Predict using the model built in H2O

Video Tutorial

Walkthrough

Step 1 – Prerequisites

  • Linux or Mac OS
  • Oracle’s Java 1.7

Step 2 – Download the zip file

Step 3 – Unzip the demo.zip file and run the example script
$ unzip demo.zip
$ cd perrier/h2o-examples
$ export MASTER="local-cluster[3,2,1024]"
$ ./run-example.sh

Note: If your machine detects multiple home addresses that H2O can launch on look for the IP address that H2O actually launches onwhich can be found under “Attempting to determine correct address”. Cancel the operation and set Spark’s Local IP address to whereH2O was launching on and execute the example script again.
$ export SPARK_LOCAL_IP='127.0.0.1'
$ ./run-example.sh

Running Standalone Sparkling Water

spark_standalone
Much like running Sparkling Water locally, we’ll start up 3 H2O nodes except instead of starting 3 worker nodes and 1 masternode on a single JVM we will start up 4 JVM that forms the Spark Cluster.

Walkthrough

Step 1 – Prerequisites

  • Linux or Mac OS
  • Oracle’s Java 1.7

Step 2 – Download the zip file

Step 3 – Unzip the demo.zip file and launch the Spark cluster
$ unzip demo.zip
$ cd perrier/sbin
$ ./launch-spark-cloud.sh
$ export MASTER="spark://localhost:7077"
Note: You can access the Spark Cluster at localhost:8080 and H2O on localhost:54321

Step 4 – Run the example script
$ cd ../h2o-examples
$ ./run-example.sh


Running H2O Commands from Spark’s Command Line

For those adventurous enough to play with the source code, there is workflow available that will give the user more flexibilityso that different datasets can be used and different algothrims can be tried.

Video Tutorial

Walkthrough

Step 1 – Prerequisites

  • Linux or Mac OS
  • Oracle’s Java 1.7

Step 2 – Download the zip file

Step 3 – Launch a Spark Cluster (UI can be accessed at localhost:8080)
$ unzip demo.zip
$ cd perrier/sbin
$ ./launch-spark-cloud.sh
$ export MASTER="spark://localhost:7077"

Note: If your machine detects multiple home addresses that H2O can launch on look for the IP address that H2O actually launches onwhich can be found under “Attempting to determine correct address”. Set Spark’s Local IP address to whereH2O was launching on, for example:
$ export SPARK_LOCAL_IP='127.0.0.1'

Step 4 – Start the Spark Shell
$ cd ../h2o-examples
$ ./sparkling-shell

Step 5 – Import H2O Client App and Launch H2O (UI can be accessed at localhost:54321)“`scalaimport water.H2OClientAppH2OClientApp.start()import water.H2OH2O.waitForCloudSize(3, 10000)

import java.io.Fileimport water.fvec.import org.apache.spark.examples.h2o._import org.apache.spark.h2o.“`

Step 6 – Import datascala
val dataFile = "../h2o-examples/smalldata/allyears2k_headers.csv.gz"
val airlinesData = new DataFrame(new File(dataFile))

Step 7 – Move Data from Spark to H2O RDD (new RDD type in Spark) and count the number of flights in the airlines data“`scalaval h2oContext = new H2OContext(sc)import h2oContext._import org.apache.spark.rdd.RDD

val airlinesTable : RDD[Airlines] = toRDDAirlinesairlinesTable.count“`

Step 8 – Do the same count int Sparkscala
val flightsOnlyToSFO = airlinesTable.filter( _.Dest.equals(Some("SFO")) )
flightsOnlyToSFO.count

Step 9 – Run a SQL query that will only return flights flying to SFOscala
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
import sqlContext._ // import implicit conversions
airlinesTable.registerTempTable("airlinesTable")
val query = "SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO'"
val result = sql(query) // Using a registered context and tables
result.count
result.count == flightsOnlyToSFO.count

Step 10 – Set the parameters for running a Deep Learning model and build a model“`scalaimport hex.deeplearning._import hex.deeplearning.DeepLearningModel.DeepLearningParametersval dlParams = new DeepLearningParameters()

dlParams._training_frame = result( ‘Year, ‘Month, ‘DayofMonth, ‘DayOfWeek, ‘CRSDepTime, ‘CRSArrTime,’UniqueCarrier, ‘FlightNum, ‘TailNum, ‘CRSElapsedTime, ‘Origin, ‘Dest,’Distance, ‘IsDepDelayed)dlParams.response_column = ‘IsDepDelayed.nameval dl = new DeepLearning(dlParams)val dlModel = dl.train.get“`

Step 11 – Score on the Deep Learning model and grab the output predictionsscala
val predictionH2OFrame = dlModel.score(result)('predict)
val predictionsFromModel = toRDDDoubleHolder.map ( _.result.getOrElse("NaN") ).collect


Running Sparkling Water SandboxWindows users do not abandon hope! A sandbox version is available for download and will have the environment set up with the prerequisite settings.Simply download and run through any of the tutorials above.

Walkthrough

Step 1 – Launch SandboxDownload Virtualbox in order to use the OVA file.Start the OVA file and log in as user ops:
user: ops
password: 0xdata
Note: password for sudo (root) is also 0xdata

Step 2 – Upgrade Sparkling Water
$ ./upgrade.sh

Step 3 – Try Sparkling Water REPL

Simply run sshell to start the Spark shell and run H2O operations from the shell.
$ /opt/sparkling/h2o-examples/sparkling-shell

Leave a Reply

+
Recap of H2O World India 2023: Advancements in AI and Insights from Industry Leaders

On April 19th, the H2O World  made its debut in India, marking yet another milestone

May 29, 2023 - by Parul Pandey
+
Enhancing H2O Model Validation App with h2oGPT Integration

As machine learning practitioners, we’re always on the lookout for innovative ways to streamline and

May 17, 2023 - by Parul Pandey
+
Building a Manufacturing Product Defect Classification Model and Application using H2O Hydrogen Torch, H2O MLOps, and H2O Wave

Primary Authors: Nishaanthini Gnanavel and Genevieve Richards Effective product quality control is of utmost importance in

May 15, 2023 - by Shivam Bansal
AI for Good hackathon
+
Insights from AI for Good Hackathon: Using Machine Learning to Tackle Pollution

At H2O.ai, we believe technology can be a force for good, and we're committed to

May 10, 2023 - by Parul Pandey and Shivam Bansal
H2O democratizing LLMs
+
Democratization of LLMs

Every organization needs to own its GPT as simply as we need to own our

May 8, 2023 - by Sri Ambati
h2oGPT blog header
+
Building the World’s Best Open-Source Large Language Model: H2O.ai’s Journey

At H2O.ai, we pride ourselves on developing world-class Machine Learning, Deep Learning, and AI platforms.

May 3, 2023 - by Arno Candel

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More