Return to page

BLOG

Saving Big Data Science is Saving Science

 headshot

By H2O.ai Team | minute read | June 24, 2013

Category: Uncategorized
Blog decorative banner image

For time is the ultimate non-renewable resource!

Data Science  represents the convergence of Domain knowledge, Data Collection and a series of hypotheses validated or invalidated by use of Math. And Big Data Science takes that one step further into the realm of massive datasets that become necessary and pre-condition in Science and Business. So the business of making this work is the business of doing science. From the scientific process to decision making & solving business problems. Removing the drudgery of Data Science is freeing the brightest minds of our time to ask BIG questions & refine their hypotheses a dozen times a minute. Just like the act of search by Google made each of us use the world of information better than ever before. Big Data Science is set to Change the world in so many dimensions!
(reprinted with permission from the experiments of our math geek!) This is emblematic of issues with state of R, the lingua franca of data science.

---------- Forwarded message ----------
From: Irene Lang <irene@0xdata.com>
Date: Mon, Jun 24, 2013 at 2:44 AM
Subject: Re: more workloads/datasets>
To: SriSatish Ambati <srisatish@0xdata.com>

OK. I need your help with this, please.
I can no longer run even really moderately sized datasets on my laptop. For example – I tried running a straightforward glm validation on a 2MB dataset after generating a model using a training set, and I finally gave up at about 2:30AM after letting it run all afternoon because I needed to get something done today.
Doing this sort of thing on my machine is slowing me down because I don’t have the memory to do anything quickly, and it means that I end up taking hours to run a single test – limiting my ability to play with the data in any meaningful way. It also means that I’ve effectively paper-weighted my computer for several hours, which is kindof painful in terms of getting other tasks done while tests are running – because apparently my whole memory is in use, other functions are nill. So, I know the limitation is my computer and not R, and I know that there has to be a relatively straightforward solution to this. I imagine that if I could run R either on the server, or on amazon web services, I don’t have to worry about immediately replacing my computer with one that has more memory (which isn’t my first choice, because this one is perfectly good, otherwise.)
So, this experiment will now continue onto EC2 and a bigger server – However the problem is not her computer..

 headshot

H2O.ai Team

At H2O.ai, democratizing AI isn’t just an idea. It’s a movement. And that means that it requires action. We started out as a group of like minded individuals in the open source community, collectively driven by the idea that there should be freedom around the creation and use of AI.

Today we have evolved into a global company built by people from a variety of different backgrounds and skill sets, all driven to be part of something greater than ourselves. Our partnerships now extend beyond the open-source community to include business customers, academia, and non-profit organizations.