December 31st, 2013

Pathology of Data

RSS icon RSS Category: Uncategorized [EN]
Fallback Featured Image

Stephen Boyd's favorite way of summarizing a dataset at hand: “Understand the pathology of data. Sometimes it's not the pathology.” It's structure: dimensions, factors, outliers and principal components.

It's very much what data scientists want from Adhoc Analytics – Scope the data from enough angles and with different tools to get real intuition around it's structure. This often comes long before any advanced algorithms are run.
Like Linus (Pauling), look for forces and bonds within the data (and gather context by fusing more sources) – Then fire up imagination to probe & ask; Leading to insights that drive business decisions. An immediate consequence of fusing multiple data sources is the Curse of dimensionality.
One just has far more informative dimensions about one's customer these days. Knowing the top 100 good ones would enable faster categorization and modeling. And this pathology can come in simple and subtle ways, for example –
Single Feature Characteristics
Lots of useful single feature characteristics, include, range, standard deviation, mean, distribution, scatter plots.
Is it a constant column? Or mostly missing elements / NAs?
Multi-Feature & Inter-feature Characteristics
What features are nearly identical or share a linear relationship? (ex, delay, vs. arrival_time & departure_time)
What features share a non-linear relationship?
And how do those relations & feature characteristics influence the inquiry about the dataset at hand? Machine learning can help. So does big data – the regularization effects of big data are irrefutable.
It's a slick mystery: Different features intertwined in your data like characters in a Hitchcock thriller. Dial 'M' for Model.

Leave a Reply

+
10 Consejos para Convertirte en un Científico de Datos Exitoso

En este mundo que no deja de cambiar y sorprendernos, como científicos de datos debemos

January 19, 2023 - by Favio Vázquez
+
Explaining models built in H2O-3 — Part 1

Machine Learning explainability refers to understanding and interpreting the decisions and predictions made by a

December 22, 2022 - by Parul Pandey
+
H2O.ai at NeurIPS 2022

H2O.ai is proud to participate in the 36th Conference on Neural Information Processing Systems (NeurIPS)

December 6, 2022 - by Marcos V. Conde
+
A Brief Overview of AI Governance for Responsible Machine Learning Systems

Our paper “A Brief Overview of AI Governance for Responsible Machine Learning Systems” was recently

November 30, 2022 - by Navdeep Gill, Abhishek Mathur and Marcos V. Conde
+
H2O World Dallas Customer Talks

After three long years of not having an #H2OWorld, we finally held our first one

November 24, 2022 - by Vinod Iyengar
+
New in Wave 0.24.0

Another Wave release has arrived with quite a few exciting new features. Let's quickly go

November 21, 2022 - by Martin Turoci

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More