December 31st, 2013

Pathology of Data

RSS icon RSS Category: Uncategorized
Fallback Featured Image

Stephen Boyd's favorite way of summarizing a dataset at hand: “Understand the pathology of data. Sometimes it's not the pathology.” It's structure: dimensions, factors, outliers and principal components.

It's very much what data scientists want from Adhoc Analytics – Scope the data from enough angles and with different tools to get real intuition around it's structure. This often comes long before any advanced algorithms are run.
Like Linus (Pauling), look for forces and bonds within the data (and gather context by fusing more sources) – Then fire up imagination to probe & ask; Leading to insights that drive business decisions. An immediate consequence of fusing multiple data sources is the Curse of dimensionality.
One just has far more informative dimensions about one's customer these days. Knowing the top 100 good ones would enable faster categorization and modeling. And this pathology can come in simple and subtle ways, for example –
Single Feature Characteristics
Lots of useful single feature characteristics, include, range, standard deviation, mean, distribution, scatter plots.
Is it a constant column? Or mostly missing elements / NAs?
Multi-Feature & Inter-feature Characteristics
What features are nearly identical or share a linear relationship? (ex, delay, vs. arrival_time & departure_time)
What features share a non-linear relationship?
And how do those relations & feature characteristics influence the inquiry about the dataset at hand? Machine learning can help. So does big data – the regularization effects of big data are irrefutable.
It's a slick mystery: Different features intertwined in your data like characters in a Hitchcock thriller. Dial 'M' for Model.

Leave a Reply

+
Developing and Retaining Data Science Talent

It’s been almost a decade since the Harvard Business Review proclaimed that “Data Scientist” is

May 12, 2022 - by Jon Farland
+
The H2O.ai Wildfire Challenge Winners Blog Series – Team Too Hot Encoder

Note: this is a community blog post by Team Too Hot Encoder - one of

May 10, 2022 - by H2O.ai Team
+
The H2O.ai Wildfire Challenge Winners Blog Series – Team HTB

Note: this is a community blog post by Team HTB - one of the H2O.ai

May 10, 2022 - by H2O.ai Team
+
Bias and Debiasing

An important aspect of practicing machine learning in a responsible manner is understanding how models

April 15, 2022 - by Kim Montgomery
+
Comprehensive Guide to Image Classification using H2O Hydrogen Torch

In this article, we will learn how to build state-of-the-art models in computer vision and

March 29, 2022 - by H2O.ai Team
+
H2O Wave Snippet Plugin for PyCharm

Note: this blog post by Shamil Dilshan Prematunga was first published on Medium. What is PyCham? PyCharm

March 24, 2022 - by Shamil Prematunga

Start Your Free Trial