December 31st, 2013

Pathology of Data

RSS icon RSS Category: Uncategorized [EN]
Fallback Featured Image

Stephen Boyd's favorite way of summarizing a dataset at hand: “Understand the pathology of data. Sometimes it's not the pathology.” It's structure: dimensions, factors, outliers and principal components.

It's very much what data scientists want from Adhoc Analytics – Scope the data from enough angles and with different tools to get real intuition around it's structure. This often comes long before any advanced algorithms are run.
Like Linus (Pauling), look for forces and bonds within the data (and gather context by fusing more sources) – Then fire up imagination to probe & ask; Leading to insights that drive business decisions. An immediate consequence of fusing multiple data sources is the Curse of dimensionality.
One just has far more informative dimensions about one's customer these days. Knowing the top 100 good ones would enable faster categorization and modeling. And this pathology can come in simple and subtle ways, for example –
Single Feature Characteristics
Lots of useful single feature characteristics, include, range, standard deviation, mean, distribution, scatter plots.
Is it a constant column? Or mostly missing elements / NAs?
Multi-Feature & Inter-feature Characteristics
What features are nearly identical or share a linear relationship? (ex, delay, vs. arrival_time & departure_time)
What features share a non-linear relationship?
And how do those relations & feature characteristics influence the inquiry about the dataset at hand? Machine learning can help. So does big data – the regularization effects of big data are irrefutable.
It's a slick mystery: Different features intertwined in your data like characters in a Hitchcock thriller. Dial 'M' for Model.

Leave a Reply

+
Three Keys to Ethical Artificial Intelligence in Your Organization

There’s certainly been no shortage of examples of AI gone bad over the past few

September 23, 2022 - by H2O.ai Team
+
Using GraphQL, HTTPX, and asyncio in H2O Wave

Today, I would like to cover the most basic use case for H2O Wave, which is

September 21, 2022 - by Martin Turoci
+
머신러닝 자동화 솔루션 H2O Driveless AI를 이용한 뇌에서의 성차 예측

Predicting Gender Differences in the Brain Using Machine Learning Automation Solution H2O Driverless AI 아동기 뇌인지

August 29, 2022 - by H2O.ai Team
+
Make with H2O.ai Recap: Validation Scheme Best Practices

Data Scientist and Kaggle Grandmaster, Dmitry Gordeev, presented at the Make with H2O.ai session on

August 23, 2022 - by Blair Averett
+
Integrating VSCode editor into H2O Wave

Let’s have a look at how to provide our users with a truly amazing experience

August 18, 2022 - by Martin Turoci
+
5 Tips for Improving Your Wave Apps

Let’s quickly uncover a few simple tips that are quick to implement and have a

August 9, 2022 - by Martin Turoci

Start Your Free Trial