August 1st, 2013

Hey good looking; Visualization and Data Mining 1

RSS icon RSS Category: Uncategorized [EN]
Women.oc histogram

I recently came across an article by Shaw et al, in Decision Support Systems (1). The article discussed the importance of data mining and information management to good customer relationship management in increasingly competitive markets. A key point of the paper that I agree with is the importance of heuristics in data mining, particularly in markets where domain specific information is critical to data interpretation.

Slick, beautiful graphic presentations of results get a lot of love. The tools used for data mining get less attention, but are pretty important to analysts. Instead of the traditional list and review of tools, I’ve taken the 1million observation Movielens data set, and produced some visualizations that are meant to just ask some basic questions of the data, as part of a first pass in the exploratory process.
This is just the start in a series of posts. Over the next couple of weeks I hope to cover some of the most useful (and straightforward) tools in ggplot2 and lattice to analyze the Movielens data produced by Group Lens.  There are other tools out there, but these are two of the most common, and the examples here should be easy to reproduce.

Problem:

The data are from http://www.grouplens.org.
There are 1,000,209 total observations collected from 6,040 individuals on 3,952 movies.
The three original files have been merged into one data frame, so that each observation consists of a rating for a single movie by a single individual, inclusive of both the individual level attributes and the movie level attributes.
For now, we’ll just look at the users themselves. This sort of question falls into what the marketing literature calls descriptive exploratory analysis. It’s a precursor to market segmentation, and useful in better understanding who customers are. For instance, in general managers tend to be older than the average population (because it takes a while to work your way up the ranks). Managers who subscribe to MovieLens are younger than the average manager.
Customers are predominately employed in which occupations?  Are there occupational differences by gender?
It’s not surprising that most people chose not to report their occupation, nor that a large number of customers are also students. What is surprising are the high numbers customers who are professionals: programmers, managers, engineers, lawyers and healthcare workers.
Women.oc histogram
Men.oc histogram

require(ggplot2)
#build subset of data (ml.fact.t is the original data frame, namedOC is a column vector with alpha occupation codes instead of numeric codes).
oc<- ggplot(ml.fact.t, aes(factor(namedOC)))
# build a bar chart
oc + geom_bar()

For just women:

#slice the data to include just women
women.oc = namedOC[GENDER=="F"]
# build the subset of data to use for the chart
w.oc<- ggplot(data=NULL, aes(women.oc))
#build a barchart
w.oc + geom_bar()

There are a surprising number of individuals who self-report as managers. As managers tend to have higher levels of disposable income, as a group they might be important to try to target. 

A very quick histogram of managers by age produces the following chart. First pass visual inspection indicates that one of two things is happening here: either young managers disproportionately prefer Movielens (i.e., there is some age bias within the subset of managers), or there is some measurement error in self report – individuals who aren’t completely honest about their age or their occupation or both.
Histogram of mngrd

#pull the ages of just managers
mngrd = AGE[namedOC=="mgr"]
#make a quick histogram
hist(mngrd)
#examine by gender
table (mngr, GENDER)

If you want to market specifically to managers, you would want to know  what sorts of movies they like. 
This requires a little bit of math (which is all pretty easy in R). I pulled the total number of movies watched by managers from each genre. Using the total number of times managers picked this genre, and the subset of those that managers actually liked we get the percentage of movies actually liked.
It’s incredibly simple, but it’s also easy to talk about, to interpret, and to use as a starting point for more questions.
An auxiliary benefit of this type of  quick and dirty look at the data gives us common sense checks for any recommender system that we come up with.  At a high level, a recommender that isn’t suggesting managers in general watch action movies is probably needs a little tweak, or at least you should ask why.
Percentlike
#make a plot (using lattice). I should note that I created a data frame with the names of the genres shown,  the totals, total highly rated, and percents – which were previously calculated using basic arithmetic.

plot(factor(names), percent).

Two quick things. The first is that we have barely scratched the surface of visualization. The second is that the citation for the orginal article is given below. It’s totally worth a read.
Michael J Shaw, Chandrasekar Subramaniam, Gek Woo Tan, Michael E Welge, Knowledge management and data mining for marketing, Decision Support Systems, Volume 31, Issue 1, May 2001, Pages 127-137

Leave a Reply

+
Enhancing H2O Model Validation App with h2oGPT Integration

As machine learning practitioners, we’re always on the lookout for innovative ways to streamline and

May 17, 2023 - by Parul Pandey
+
Building a Manufacturing Product Defect Classification Model and Application using H2O Hydrogen Torch, H2O MLOps, and H2O Wave

Primary Authors: Nishaanthini Gnanavel and Genevieve Richards Effective product quality control is of utmost importance in

May 15, 2023 - by Shivam Bansal
AI for Good hackathon
+
Insights from AI for Good Hackathon: Using Machine Learning to Tackle Pollution

At H2O.ai, we believe technology can be a force for good, and we're committed to

May 10, 2023 - by Parul Pandey and Shivam Bansal
H2O democratizing LLMs
+
Democratization of LLMs

Every organization needs to own its GPT as simply as we need to own our

May 8, 2023 - by Sri Ambati
h2oGPT blog header
+
Building the World’s Best Open-Source Large Language Model: H2O.ai’s Journey

At H2O.ai, we pride ourselves on developing world-class Machine Learning, Deep Learning, and AI platforms.

May 3, 2023 - by Arno Candel
LLM blog header
+
Effortless Fine-Tuning of Large Language Models with Open-Source H2O LLM Studio

While the pace at which Large Language Models (LLMs) have been driving breakthroughs is remarkable,

May 1, 2023 - by Parul Pandey

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More