H2O.ai Blog
Filter By:
67 results Category: Year:All models are wrong, but some models are useful!
George Box said that.There is no best model that works for all of your data. Wolpert reiterates that as the No free lunch theorem. Model predictive performance is domain specific. What works in one data domain has sometimes very little consequence in another one. Predictably, the rise of Domain Science: Data science needs to get closer ...
Read moreHack data with R + H2O ( aka, the last thursday of the 2013 meetup!)
Come join us and 32 other Data Scientistas to Hack airline dataset with R. This is our small intimate open house setup that we did every last thursday of each month – And this is the season finale! And what a year it has been for H2O! Nidhi will walk you through RStudio – don't forget to bring your tool belt (& a laptop with R install...
Read moreR & Scala for fast in-memory predictions on Hadoop via H2O!
Three of our best and brightest gave a talk last night on H2O, R, Scala and Hadoop (yes -all together and yes highlighting the integration).If you missed the talk last night the slides are linked here, and we're doing an encore next week (http://www.meetup.com/SF-Scala/events/153854762/ ) Tom Kraljevic presents using H2O on Hadoop – how w...
Read moreScalala on H2O at Typesafe
Please come catch us, catch up with us, and meet up with us next week, on the 17th. The makers & maintainers of Scala, Typesafe, is hosting us, where Adriaan Moors and the H2O team will be talking about Scala, working with data at scale, and getting the most out of your big data and domain. Meetup's in San Francisco, the details can ...
Read moreR & Scala for fast in-memory predictions on Hadoop via H2O!
Take R and Scala to Big Data using in-memory Algorithms from H2O. In this Triple Header for SF Big Data Science Anqi Fu , our resident R wiz, will present data munging and R adhoc analytics at scale. Be prepared for fireworks with R in RStudio and not a ton of powerpoint. Scala has reached tremendous adoption amongst Machine learning &...
Read moreMachine Learning for Adtech
Characteristics of advertising data: tens of thousands of columns or more (top 100k or 1 m sites) high collinearity factors: eg demographics, with a strong correlation between eg income and education collinearity: sports fans follow nfl + espn + bleacher report + fox sports; users of ravelry also shop etsy. Those features are certa...
Read moreMaking films is not too different from startups
Quentin Tarantino, Ang Lee and other great directors discuss making films, creative process, attention to detail and inspiring & directing one's team to do great work. ...
Read moreH2O goes to CodeMesh in London
An API for Distributed Computing We have defined an API and built an open-source platform for dealing with in-memory distributed data. We’ve used it to built state-of-the-art predictive modeling and analytics (e.g. GLMNET, GBM, Random Forest ) that’s 1000x faster than the disk-bound alternatives, and 100x faster than R (we love R but it’s...
Read moreH2O goes to qconsf
Math Algorithms have primarily been the domain of desktop data science. With the success of scalable algorithms at Google, Amazon, and Netflix, there is an ever growing demand for sophisticated algorithms over big data. In this talk, we get a ringside view in the making of the world's most scalable and fastest machine learning framework,...
Read moreDistributed Deep Learning with H2O in the Cloud @ Ebay
Cyprien Noel will present hand-picked algorithms that work on H2O at scale and a survey of the space. We will walk users through the a couple of datasets (mnist) and demonstrate the power of Multi-layer Neural Networks at Scale in EC2. Learn more and sign up at http://www.meetup.com/Silicon-Valley-Big-Data-Science/events/132780102/ ...
Read morePredictable Rise of Physicists: Domain Science
For years, I secretly suspected that a lot of our math came from Physics . Some of the greatest leaps in math were made closely alongside the greatest discoveries in Physics. Calculus. QED. Turing.The physics of our businesses is grounded in a complex systems understanding of domain. When Data science gets finally freed from time-sapping...
Read morePivotal hosts 0xdata - Distributed Random Forest, GBM, GLM & API for Big Data Algos
Distributed Machine Learning has come of age, just in time to meet the challenges of Big Data. We will present an API for extending and rolling your own Algorithms or use powerful contest-winning Gradient Boosting Machine, Generalized Linear Modeling and Random Forest at scale. Demo and Fireworks using big datasets from within ...
Read moreFrontier Big Data Meetup - Scalability & Availability
Come see Sri present on November 5th! 1. Sam Hamilton , Vice President of Data Technology at PayPal 2. SriSatish Ambati , Co-founder & CEO, 0xData 3. Sourav Mazumder, Technology Head of Big Data Practices, Infosys 4. Bruce Templeton, Co-founder & CEO, NephoScale At Room B3 in Mission City Ballroom, Santa Clara Convention Center...
Read more0xdata and Yelp - Machine Learning for Relevance and Serendipity/Distributed Gradient Boosting
Join us and Yelp for a chat on Machine Learning, and make sure not to miss Sri’s lightning talk on Distributed Gradient Boosting!Main Talk: Machine Learning for Relevance and Serendipity Speaker: Aria Haghighi (Prismatic ) Abstract: Careful use of well-designed machine learning systems can transform products by providing highly perso...
Read moreOur data, our math // our tools, our science!
Big data has always been with us. Our race's answer to data explosion was through math & computation. Whether it was Newton's calculus, Einstein's Relativity or Shannon's Information Theory, each generation's answer to it's big data problem arose from it's best and brightest.Our generation's challenge is here. Our lives are mired in d...
Read moreBuilding a Distributed GBM on H2O
At 0xdata we build state-of-the-art distributed algorithms – and recently we embarked on building GBM , and algorithm notorious for being impossible to parallelize much less distribute. We built the algorithm shown in Elements of Statistical Learning II , Trevor Hastie, Robert Tibshirani, and Jerome Friedman on page 387 (shown at the bo...
Read moreAn API For Distributed Analytics
There are so many APIs to choose from…Features of the space: Lots of data – which I’ll qualify as “bigger than 1 machine” and thus needing parallel i.o, parallel memory, & parallel compute – and distributed algorithms. Ease of programming; hide details (but expose when want to). High level for ease-of-use, but “under the covers” ...
Read moreStrata NYC & Hadoop World: How to Stop Worrying and Start Modeling Big Data with Better Algorithms and H2O
How to Stop Worrying and Start Modeling Big Data with Better Algorithms and H2O Srisatish Ambati (0xdata Inc), Cliff Click (0xdata Inc) 5:05pm Tuesday, 10/29/2013 Data Science Beekman Parlor – Sutton North Data Modeling has been constrained through scale; Sampling still rules the day for Adhoc Analytics. Scale brings much needed change t...
Read moreNYC Big Data Meetup - Distributed Random Forest, GBM, GLM & API for Big Data Algos
Distributed Machine Learning has come of age. Just in time to meet the challenges of Big Data, we present an API for extending and rolling your own Algorithms or using powerful contest-winning Gradient Boosting Machine, Generalized Linear Modeling and Random Forest at scale. Demo and Fireworks using big datasets from within the familiar...
Read moreGBM on Ecology - Recreating a model made for R
In the last couple of weeks we’ve had two meetups on GBM (gradient boosted classification and regression ), and hence a lot of excitement about running the algorithm as presented by Cliff, Earl and Dr. Hastie. You can find the hella cool videos of both presentations here: http://www.youtube.com/0xdata One of my favorite articles on GBM ...
Read moreJoin Us Tomorrow at Trulia - Distributed GBM!
Hi hackers! Just a quick reminder we’ll be joining our friends at Trulia tomorrow for a meetup on machine learning discussing Distributed GBM.GBM is one of the most popular machine learning algorithms used in data mining competitions. Most of us use GBM through R implementation. However, we have recently written a distributed version fo...
Read moreH2O & LiblineaR: A tale of L2-LR
tl;dr: H2O and LiblineaR have nearly identical predictive performance. OverviewIn this blog, we examine the single-node implementations of L2-regularized logistic regression (LR) by H2O and LiblineaR . Both LibR and H2O are driven from the R console on the same hardware and evaluated on the same datasets. We compare regression coeffici...
Read more0xdata + Vendavo = Awesome
For those of you who missed our recent meetup at Vendavo, our data scientist Earl Hathaway, CTO & Architect of Distributed Gradient Boosting, Cliff Click spoke on GBM that was (without exaggeration) totally awesome! Eric, the Algorithms and Data Science guru at Vendavo and their hacker-CEO, Neil Lustig, have been partnering with us d...
Read moreRunning a GLM Model in H2O + R (notes from the hands-on meetup Sept. 26)
This is a walk through of running H2O through R. Before you get started you will need three things: R (a recent version), H2O (wich you can get through github: https://github.com/0xdata/h2o) or directly from our website: http://0xdata.com/h2O/, and the h2oWrapper R package, which is the tool that makes H2O talk to R, and lets you talk to ...
Read moreHands on Workshop: Hack Big Data With Math
Thursday night (September 26) at 7, resident Math Hackers will demonstrate hands on attitude combined with un-encumbered brain-power. Wielding powerful 0xdata machine learning technology at their fingertips, they will show you how to pull predictions from a gigantic distributed heap of Java built vectors . Bring your laptop, we have WIFI ...
Read moreGradient Boosting Machine in III Acts: Trevor Hastie, Netflix & 0xdata
Gradient Boosting Machine in III Acts: Dr. Trevor Hastie, Netflix & 0xdata. Triple Header on Boosting & GBM: Act I: Trevor Hastie, Of Stanford Mathematical Sciences, the mathematician behind Lasso & GBM speaks of the nuances of the Algorithm. Act II: Cliff Click, CTO of 0xdata, the implementor of parallel and distributed GB...
Read moreEven More MNIST
Since we've been fooling around with the MNIST data set quite a bit lately (Spence is using it in benchmarking), I've been following the leaderboard and methods for the ongoing Kaggle competition around the same data. It's really amazing to see what people come up with. But of course, the purpose of H2O is entirely that one need not devo...
Read moreReplay: Modeling MNIST With RF Hands-on Demo
Last week Spencer put together a great hands on for modeling data using H2O (http://www.meetup.com/H2Omeetup/). This post is a write-up of the workflow for generating an RF model on MNIST data for those of you who want to walk through the demo again, or maybe missed the live action version. I’m running through one of our local servers, ...
Read moreHands on Workshop: Hack Data With Math
Thursday night (August 29) at 7, resident math hacker Spencer A. is leading a hands on workshop on using H2O to analyze real-world data. For those of you who are new to the math side of H2O, we have notes below to help you get prepared. H2O is a distributed math platform featuring a set of analytical tools that can be accessed through an ...
Read morePicture it: H2O and R
August 22, 2013 | Uncategorized [EN] | Picture it: H2O and R
Read moreBig Data Science in H2O with R
Big Data Science with H2O in R from Anqi Fu We had a great turnout at our Meetup last night! We took a look at the H2O/R API, then dove right in to a hands-on demo, where we imported, cleaned, and ran GLM on the airlines data set in H2O using R commands. Here are the slides from my talk, and interested users can take a look at the ...
Read morePublic Data Sets
For your data analysis pleasure, I give you a giant list of super cool publicly available data. If you’re looking at the data sets and wondering “now what?” – you can find this list AND tutorials on how to use H2O for analysis at the H2O docs page (here: http://docs.0xdata.com) . You can also get a detailed hands on experience analyzing a...
Read moreTCP Is Not Reliable
Been to long between blogs…“TCP Is Not Reliable ” – what's THAT mean?Means: I can cause TCP to reliably fail in under 5 mins, on at least 2 different modern Linux variants and on modern hardware, both in our datacenter (no hypervisor) and on EC2.What does “fail” mean? Means the client will open a socket to the server, write a bunch of st...
Read moreRun H2O From Within R
With the REST API, it’s simple to run H2O operations from within R using similar syntax to all your favorite R functions. In this post, we’ll walk through a simple demo of its capabilities. First, get H2O installed and running by following the tutorial here . Once you have the R package loaded, you can take a look at the included demos by...
Read moreUse R to run Better Algorithms on Big Data
Our resident R users will demonstrate how to use the R package and invoke big data modeling entirely from R. In this session our resident R & Math hacker, Anqi Fu will demonstrate the R API for H2O. Early users, community and customers of H2O have been invoking GLM, Random Forest and K-means from an RConsole or RStudio. In this meetu...
Read moreRandom Forest Measurements for the MNIST Dataset
This post discusses the performance of H2O’s Random Forest [5] algorithm. We compare different versions of H2O as well as the RF implementation by wise.io . We use wall-clock time to measure work flows that match up with the user experience. A link to the scripts used is available here [1] . SpecificationsHardware Amazon EC2 in US-EAS...
Read moreWe the people: Our meetup member introductions
You may have noticed that we have a ton of stuff going on at 0xdata, including several upcoming meetups that I expect will be very well attended. I was feeling a little curious about who exactly would be attending. What are the common areas of interest, are our members mostly software people or data scientists? Anyhow, I find that when I ...
Read moreHey good looking; Visualization and Data Mining 1
I recently came across an article by Shaw et al, in Decision Support Systems (1). The article discussed the importance of data mining and information management to good customer relationship management in increasingly competitive markets. A key point of the paper that I agree with is the importance of heuristics in data mining, particular...
Read moreBig Data Cloud Computing Streaming Systems & Infrastructures
Big Data Science at Frontier Real Time Streaming Meetup. 250 Big Data enthusiasts have signed up for a saturday presentation! Looks like it's going to be quite interesting presentation and panel! ...
Read moreImplement a Machine Learning Algorithm in 2hrs
We will take a simple yet popular & powerful math algorithm such as Linear Regression and implement a distributed version in 2hrs. Pre-requisites: Knowledge of Java or R See: http://h2o.0xdata.com/ Warning: Only software programmers ignore Warnings! That said, this seriously is a very hands on java-intense exercise. Extinguished en...
Read moreGLM Bells and Whistles Part 2: Analysis and Results from Million Songs Data
Using the Million Songs Data we want to characterize a subset of the songs. To do this we’re going to run a binomial regression in H2O’s GLM. The approach to characterizing songs from the 90’s is the same method you can apply to your own data to characterize your customers relative to some larger group. In turn, those findings can be app...
Read moreGLM and K means to find Social Response Bias - Dating and Fibbers
In any field where data collection is dependent on what your clients, customers, public, whomever …. tell you, there’s the risk that people are big fat fibbers. This often happens because people respond they way they think they SHOULD rather than with their own personal truths. Social sciences and marketing people call this phenomenon soc...
Read moreData Science is NOT Rocket Science - H2O at Big Data Cloud
DJ Das brings Sri to talk about H2O by 0xdata to the Big Data Cloud Meetup July 10, 2013. Venue: 3200 Coronado Drive, Santa Clara ...
Read moreBuilding A TB-Scale Math Platform @ Uberconf 2013, Denver
Building A TB-Scale Math Platform Datasets have gotten to PB-scale, but the modeling you can do has been limited to a single-node (e.g. R, SAS) or stuck inside the database or takes hours on Hadoop-like technologies. We have built a simple clustering package, and are using it to do distributed analytics on the sum of all ram in a cluster...
Read moreHands-on Data Science with H2O at GlobalBigDataConference
Experience a hands-on hack data session using H2O & R at BigDataBootCamp by GlobalBigDataConference. Every few months, Sridhar puts together a content-rich conference filled with highly engaged audience. This weekend Globalbigdataconference is doing a BigDataBootCamp – Tickets are on sale. Sri brings H2O & R to this audience, mun...
Read moreRunning analysis on the right data!
All in the day: Anqi Fu, our wickedly smart Math & Data Science hacker-intern from Stanford this summer, was characterizing GLMNet in R on sparse data and comparing with other tools. We were using a data sets predicting Two Bedroom median rent based on neighborhoods from huduser.org. DATA : http://www.huduser.org/portal/datasets/fmr/...
Read moreThe MillionSongs Data Part 1: Bells and Whistles of GLM in H2O
Using the Million Songs Data Set I want to go from beginning to end through H2O's GLM tool. Note that the original data are large, so downloading and fiddling with the full data set can be quite painful if you just do it from your desktop, that said you can find it here . It’s a good opportunity to take a really detailed look at H2 O so ...
Read moreAge of the Intelligent Apps Ahead
The Age of the Intelligent Apps is here – Let's gear up.! Businesses are continuously data from.. yes, applications & sensors. Applications are the key to data creation. The future of Applications is to analyze data in-motion – learn the rules of the game at creation, backed by a super-intelligent model from historical data! A powerfu...
Read moreH2O at the Hadoop Summit - Machine Learning Evening with Big Data Science
In a triple header with Mahout, Alpine Data and 0xdata, Sri presents at the Machine Learning Evening at the Hadoop Summit. Be sure to bring your Data Science hat on! Topic“Big Data + Better Algorithms ==> Better Predictions with H2O”Abstract:“H2O’s fast high scale open source algorithms are set to revolutionize Predictive Analytics. A...
Read moreSaving Big Data Science is Saving Science
For time is the ultimate non-renewable resource!Data Science represents the convergence of Domain knowledge, Data Collection and a series of hypotheses validated or invalidated by use of Math. And Big Data Science takes that one step further into the realm of massive datasets that become necessary and pre-condition in Science and Busines...
Read moreHacking K-means with Cyprien Noel
Last night, 0x offices were well populated with some very bright programmers. Big thanks to Cyprien Noel, the 0x hacker who designed k-means. He led the group as we collaboratively worked through building the code underpinning K-means modeling. In case you missed the group last night – we’re doing it again. In about a month we’ll be goin...
Read moreConvert DOS to Unix - Insert Tab A into Slot B
Every day as part of my 0x immersion program one of our hackers tries to explain something he is working on – an especially beautiful bit of code or something about data science and how the mechanics of our project work, or whatever. Every day, at least once, I am completely confused. I realize that this must be exactly how someone who...
Read moreH2O and Big Data Meetup at Elance
While giving a talk at the Meetup at Elance Thursday night Chris Pouliot (Netflix’ lead analyst) commented that good analysis happens not when you have an army of clones, but when you have a diverse set of bright, engaged minds all willing to tackle a problem. No more than an hour later, Sri was reminding us that good solutions and grea...
Read moreStandardized Coefficients
One of the (few) downsides of being in the Bay is the completely absurd traffic. Perhaps I am a bit more sensitive to this than most, given my epic daily commute. While I am normally inclined to whine about my cross-bay traverse a little, yesterday it paid off. You see, I’m used to making sense of things in my own way – which doesn’t al...
Read moreBIG VS. LITTLE: P-Values and Coefficients
The Quick and Dirty: For the moment let’s assume that we have some a priori hypothesis, and we want to test. We can talk about two things: how big the relationship is and how strong it is. P-values don’t care about big – they only care about strong. To get a sense for this recall from ANOVA the fairly common test statistic F . We decide...
Read moreChocolate Cake
Chocolate Cake (Wednesday, June 5, 2013) You know how sometimes you have one bite of really good chocolate cake, or a really amazing peach and totally assume that you could eat another 30lbs of whatever without regard for good manners or physical limitations? Yeah. Decreasing marginal returns dictate that it almost always turns out th...
Read moreData Science is NOT Rocket Science
Finding myself at 0x is a lot less like starting fresh in a new profession and more like choosing cultural expatriation – it is a whole new (beautiful) world. On my first day everyone spoke what I was relatively sure should be English, but it felt like they were actually speaking in their own dialect (which I’ve come to think of as Hexpe...
Read moreMeetup: Distributed Random Forest at SF Data Mining
Come watch Jan Vitek present Distributed Random Forest at SF Data Mining group. ...
Read moreBetter Big Data Algorithms with H2O by 0xdata
Manhattan loves data + math better than any one! Join us on our first New York City meetup talking high-scale algos at Pivotal Labs, Union Sq, NYC Cliff and I will walk through a Big GLM over large datasets and deep dive in parallelizing and distributing algorithms over distributed array-let datastructures. ...
Read moreBig Data Science Practice + Algo Implementation
In this double header we present a practitioners close view of the science and an engineer’s close view of design and implementation of distributed algorithm.Day in the Life of a Data Scientist – Chris Pouliot In this session, Netflix analytical leader Chris Pouliot shares his experience building a large team of data scientists at Netfli...
Read moreH2O Hack Data Meetup
Hack Data with Math, H2O Meetup We derive insights from Airline Dataset – We analyze airline take off and landing dataset of the past 20years and infer about how flying has changed (more delays, different airports) after 9/11? ...
Read moreHack Data with Math using H2O - Silicon Valley Big Data Science Meetup at Google
Thanks for attending! Presentations: Cliff’s H2O and API for Big Data Math Talk JanVitek’s talk on Distributed Random Forest Cliff & Jan will present a deep dive into H2O and Hacking Big Data with Math. We locked down the Venue – Google, Building 43, 1600 Amphitheatre Parkway, Mountain View, CA, 94040. Can’t wait for the fireworks...
Read moreHack Airline Data with Math
Last Thursday of the month, April 25, 2013, is here! It’s BigDataWeek. Join us on our monthly open house and meet the artists and hackers behind H2O. This time we are hacking the airline dataset! “Have you ever been stuck in an airport because your flight was delayed or cancelled and wondered if you could have predicted it if you’d had ...
Read moreH2O does BigDataWeek at SF Data Mining
Todd Holloway hosts the SF Data Mining meetup at Trulia and brings a lot of goodness to the community of data scientists here. We are fortunate to present for his group and bigdataweek . 0xdata’s own SriSatish Ambati will be giving a talk on H2O. Sri’s talk will dive into scaling GLM (Generlaized Linear Model), Random Forest , and other...
Read moreTime is ripe for a revolution in Math for Big Data!
Data has always been with us. Everytime we as a race complained about data, a new kind of math evolved to crush the scourge of BigData. Whether it was Newton with Calculus or Einstein with relativity or Shannon with Information theory. Our generation’s response to BigData is due. The time is ripe. For a revolution in Math. One that opens ...
Read moreH2O at Predictive Analytics World Conference in SF
Join H2O and the 0xdata team at the Predictive Analytics World conference in San Francisco, CA on April 15 – 16, 2013. Meet us at the 0xdata booth in the Exhibitor Center at PAW where we will be demoing H2O hacking large data sets. Not to mention showing off our latest video. Be sure to look out for great talks from Netflix’s Chris Poulio...
Read morePredicting Airline Data using a Generalized Linear Model (GLM)
Just recently I created a wiki post on the H2 O Github page with step by step directions on how to predict if a flight’s arrival would be delayed or not. I essentially uploaded airline data from the American Statistical Association to H2 O and used GLM (also known as generalized linear model , logistics regression, or logit regression) to...
Read more