June 24th, 2013

Saving Big Data Science is Saving Science

RSS icon RSS Category: Uncategorized [EN]
Fallback Featured Image

For time is the ultimate non-renewable resource!

Data Science represents the convergence of Domain knowledge, Data Collection and a series of hypotheses validated or invalidated by use of Math. And Big Data Science takes that one step further into the realm of massive datasets that become necessary and pre-condition in Science and Business. So the business of making this work is the business of doing science. From the scientific process to decision making & solving business problems. Removing the drudgery of Data Science is freeing the brightest minds of our time to ask BIG questions & refine their hypotheses a dozen times a minute. Just like the act of search by Google made each of us use the world of information better than ever before. Big Data Science is set to Change the world in so many dimensions!
(reprinted with permission from the experiments of our math geek!) This is emblematic of issues with state of R, the lingua franca of data science.

---------- Forwarded message ----------
From: Irene Lang <irene@0xdata.com>
Date: Mon, Jun 24, 2013 at 2:44 AM
Subject: Re: more workloads/datasets>
To: SriSatish Ambati <srisatish@0xdata.com>

OK. I need your help with this, please.
I can no longer run even really moderately sized datasets on my laptop. For example – I tried running a straightforward glm validation on a 2MB dataset after generating a model using a training set, and I finally gave up at about 2:30AM after letting it run all afternoon because I needed to get something done today.
Doing this sort of thing on my machine is slowing me down because I don’t have the memory to do anything quickly, and it means that I end up taking hours to run a single test – limiting my ability to play with the data in any meaningful way. It also means that I’ve effectively paper-weighted my computer for several hours, which is kindof painful in terms of getting other tasks done while tests are running – because apparently my whole memory is in use, other functions are nill. So, I know the limitation is my computer and not R, and I know that there has to be a relatively straightforward solution to this. I imagine that if I could run R either on the server, or on amazon web services, I don’t have to worry about immediately replacing my computer with one that has more memory (which isn’t my first choice, because this one is perfectly good, otherwise.)
So, this experiment will now continue onto EC2 and a bigger server – However the problem is not her computer..

Leave a Reply

+
H2O LLM DataStudio Part II: Convert Documents to QA Pairs for fine tuning of LLMs

Convert unstructured datasets to Question-answer pairs required for LLM fine-tuning and other downstream tasks with

September 22, 2023 - by Genevieve Richards, Tarique Hussain and Shivam Bansal
+
Building a Fraud Detection Model with H2O AI Cloud

In a previous article[1], we discussed how machine learning could be harnessed to mitigate fraud.

July 28, 2023 - by Asghar Ghorbani
+
A Look at the UniformRobust Method for Histogram Type

Tree-based algorithms, especially Gradient Boosting Machines (GBM's), are one of the most popular algorithms used.

July 25, 2023 - by Hannah Tillman and Megan Kurka
+
H2O LLM EvalGPT: A Comprehensive Tool for Evaluating Large Language Models

In an era where Large Language Models (LLMs) are rapidly gaining traction for diverse applications,

July 19, 2023 - by Srinivas Neppalli, Abhay Singhal and Michal Malohlava
+
Testing Large Language Model (LLM) Vulnerabilities Using Adversarial Attacks

Adversarial analysis seeks to explain a machine learning model by understanding locally what changes need

July 19, 2023 - by Kim Montgomery, Pramit Choudhary and Michal Malohlava
+
Reducing False Positives in Financial Transactions with AutoML

In an increasingly digital world, combating financial fraud is a high-stakes game. However, the systems

July 14, 2023 - by Asghar Ghorbani

Ready to see the H2O.ai platform in action?

Make data and AI deliver meaningful and significant value to your organization with our state-of-the-art AI platform.