June 24th, 2013

Saving Big Data Science is Saving Science

RSS icon RSS Category: Uncategorized
Fallback Featured Image

For time is the ultimate non-renewable resource!

Data Science represents the convergence of Domain knowledge, Data Collection and a series of hypotheses validated or invalidated by use of Math. And Big Data Science takes that one step further into the realm of massive datasets that become necessary and pre-condition in Science and Business. So the business of making this work is the business of doing science. From the scientific process to decision making & solving business problems. Removing the drudgery of Data Science is freeing the brightest minds of our time to ask BIG questions & refine their hypotheses a dozen times a minute. Just like the act of search by Google made each of us use the world of information better than ever before. Big Data Science is set to Change the world in so many dimensions!
(reprinted with permission from the experiments of our math geek!) This is emblematic of issues with state of R, the lingua franca of data science.

---------- Forwarded message ----------
From: Irene Lang <irene@0xdata.com>
Date: Mon, Jun 24, 2013 at 2:44 AM
Subject: Re: more workloads/datasets>
To: SriSatish Ambati <srisatish@0xdata.com>

OK. I need your help with this, please.
I can no longer run even really moderately sized datasets on my laptop. For example – I tried running a straightforward glm validation on a 2MB dataset after generating a model using a training set, and I finally gave up at about 2:30AM after letting it run all afternoon because I needed to get something done today.
Doing this sort of thing on my machine is slowing me down because I don't have the memory to do anything quickly, and it means that I end up taking hours to run a single test – limiting my ability to play with the data in any meaningful way. It also means that I've effectively paper-weighted my computer for several hours, which is kindof painful in terms of getting other tasks done while tests are running – because apparently my whole memory is in use, other functions are nill. So, I know the limitation is my computer and not R, and I know that there has to be a relatively straightforward solution to this. I imagine that if I could run R either on the server, or on amazon web services, I don't have to worry about immediately replacing my computer with one that has more memory (which isn't my first choice, because this one is perfectly good, otherwise.)
So, this experiment will now continue onto EC2 and a bigger server – However the problem is not her computer..

Leave a Reply

+
Developing and Retaining Data Science Talent

It’s been almost a decade since the Harvard Business Review proclaimed that “Data Scientist” is

May 12, 2022 - by Jon Farland
+
The H2O.ai Wildfire Challenge Winners Blog Series – Team Too Hot Encoder

Note: this is a community blog post by Team Too Hot Encoder - one of

May 10, 2022 - by H2O.ai Team
+
The H2O.ai Wildfire Challenge Winners Blog Series – Team HTB

Note: this is a community blog post by Team HTB - one of the H2O.ai

May 10, 2022 - by H2O.ai Team
+
Bias and Debiasing

An important aspect of practicing machine learning in a responsible manner is understanding how models

April 15, 2022 - by Kim Montgomery
+
Comprehensive Guide to Image Classification using H2O Hydrogen Torch

In this article, we will learn how to build state-of-the-art models in computer vision and

March 29, 2022 - by H2O.ai Team
+
H2O Wave Snippet Plugin for PyCharm

Note: this blog post by Shamil Dilshan Prematunga was first published on Medium. What is PyCham? PyCharm

March 24, 2022 - by Shamil Prematunga

Start Your Free Trial