Return to page

Welcome to the Community

We look forward to seeing what you make, maker!

Learn


Self-paced Courses

View All

 

Docs Docs


Technical Documentation

View All

 


Blogs

Read All

 


YouTube

Watch All

 

H2O.ai Fights Fire Challenge

Help first responders and the public with new AI applications that can be used to help save lives and property

Learn More

Find a Meetup Near You

March 3, 2022

 

White-box AutoML: Techniques for Creating Interpretable Models

January 11, 2022



Roundtable Discussion: Better Healthcare with AI Automated Workflow

November 23 - November 24, 2021


AI & Big Data Expo Europe
 

 

Dec 2nd, 2021

 

AI-Driven Manufacturing



View on Meetup

Slack Community

Discuss, learn and explore with peers and H2O.ai employees the H2O AI Cloud platform, products and services.

Join the Slack Community

 

Already a member? Login

Stack Overflow

Aggregating Max using h2o in R

I have started using `h2o` for aggregating large datasets and I have found peculiar behaviour when trying to aggregate the maximum value using h2o's `h2o.group_by` function. My dataframe often has variables which comprise some or all NA's for a given grouping. Below is an example dataframe. df <- data.frame("ID" = 1:16) df$Group<- c(1,1,1,1,2,2,2,3,3,3,4,4,5,5,5,5) df$VarA <- c(NA_real_,1,2,3,12,12,12,12,0,14,NA_real_,14,16,16,NA_real_,16) df$VarB <- c(NA_real_,NA_real_,NA_real_,NA_real_,10,12,14,16,10,12,14,16,10,12,14,16) df$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16) ID Group VarA VarB VarD 1 1 1 NA NA 10 2 2 1 1 NA 12 3 3 1 2 NA 14 4 4 1 3 NA 16 5 5 2 12 10 10 6 6 2 12 12 12 7 7 2 12 14 14 8 8 3 12 16 16 9 9 3 0 10 10 10 10 3 14 12 12 11 11 4 NA 14 14 12 12 4 14 16 16 13 13 5 16 10 10 14 14 5 16 12 12 15 15 5 NA 14 14 16 16 5 16 16 16 In this dataframe Group == 1 is completely missing data for VarB (but this is important information to know, so the output for aggregating for the maximum should be NA), while for Group == 1 VarA only has one missing value so the maximum should be 3. This is a link which includes the behaviour of the behaviour of the `na.methods` argument (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/groupby.html). If I set the `na.methods = 'all'` as below then the aggregated output is NA for Group 1 for both Vars A and B (which is not what I want, but I completely understand this behaviour). h2o_agg <- h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "all")) Group max_ID max_VarA max_VarB max_VarD 1 1 4 NaN NaN 16 2 2 7 12 14 14 3 3 10 14 16 16 4 4 12 NaN 16 16 5 5 16 NaN 16 16 If I set the `na.methods = 'rm'` as below then the aggregated output for Group 1 is 3 for VarA (which is the desired output and makes complete sense) but for VarB is -1.80e308 (which is not what I want, and I do not understand this behaviour). h2o_agg <- h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "rm")) Group max_ID max_VarA max_VarB max_VarD <int> <int> <int> <dbl> <int> 1 1 4 3 -1.80e308 16 2 2 7 12 1.4 e 1 14 3 3 10 14 1.6 e 1 16 4 4 12 14 1.6 e 1 16 5 5 16 16 1.6 e 1 16 Similarly I get the same output if set the `na.methods = 'ignore'`. h2o_agg <- h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "ignore")) Group max_ID max_VarA max_VarB max_VarD <int> <int> <int> <dbl> <int> 1 1 4 3 -1.80e308 16 2 2 7 12 1.4 e 1 14 3 3 10 14 1.6 e 1 16 4 4 12 14 1.6 e 1 16 5 5 16 16 1.6 e 1 16 I am not sure why something as common as completely missing data for a given variable within a specific group is being given a value of -1.80e308? I tried the same workflow in dplyr and got results which match my expectations (but this is not a solution as I cannot process datasets of this size in dplyr, and hence my need for a solution in h2o). I realise dplyr is giving me `-inf` values rather than NA, and I can easily recode both `-1.80e308` and `-Inf` to NA, but I am trying to make sure that this isn't a symptom of a larger problem in `h2o` (or that I am not doing something fundamentally wrong in my code when attempting to aggregate in `h2o`). I also have to aggregate normalised datasets which often have values which are approximately similar to -1.80e308, so I do not want to accidentally recode legitimate values to NA. library(dplyr) df %>% group_by(Group) %>% summarise(across(everything(), ~max(.x, na.rm = TRUE))) Group ID VarA VarB VarD <dbl> <int> <dbl> <dbl> <dbl> 1 1 4 3 -Inf 16 2 2 7 12 14 14 3 3 10 14 16 16 4 4 12 14 16 16 5 5 16 16 16 16

H2O | ExtendedIsolation Forest | model.explain() gives, KeyError: &#39;response_column&#39;

I have been struggling with this error for a few hours now, but seem lost even after reading through the documentation. I'm using H2O's Extended Isolation Forest (EIF), an unsupervised model, to detect anomalies in an unlabelled dataset. Which is working as intended, however for the project i'm working on the model explainability is extremely important. I discovered the [explain](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/explain.html) function, which supposedly returns several explainablity methods for a model. I'm particularly interested in the SHAP values from this function. The documentation states >The main functions, h2o.explain() (global explanation) and h2o.explain_row() (local explanation) work for individual [H2O models,](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/training-models.html) as well a list of models or an H2O AutoML object. The h2o.explain() function generates a list of explanations. Since the H2O models link brings me to a page which covers both supervised and unsupervised models I assume the explain function would work for both types of models. When trying to run my code the following code works just fine. ```python import h2o from h2o.estimators import H2OExtendedIsolationForestEstimator h2o.init() df_EIF = h2o.H2OFrame(df_EIF) predictors = df_EIF.columns[0:37] eif = H2OExtendedIsolationForestEstimator(ntrees = 75, sample_size =500, extension_level = (len(predictors) -1) ) eif.train(x=predictors, training_frame = df_EIF) eif_result = eif.predict(df_EIF) df_EIF['anomaly_score_EIF') = eif_result['anomaly_score'] ``` However when trying to call explain over the model (eif) ```python eif.explain(df_EIF) ``` Gives me the following KeyError ```python KeyError Traceback (most recent call last) xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.py in <module> ----> 1 eif.explain(df_EIF) 2 3 4 5 C:\ProgramData\Anaconda3\lib\site-packages\h2o\explanation\_explain.py in explain(models, frame, columns, top_n_features, include_explanations, exclude_explanations, plot_overrides, figsize, render, qualitative_colormap, sequential_colormap) 2895 plt = get_matplotlib_pyplot(False, raise_if_not_available=True) 2896 (is_aml, models_to_show, classification, multinomial_classification, multiple_models, targets, -> 2897 tree_models_to_show, models_with_varimp) = _process_models_input(models, frame) 2898 2899 if top_n_features < 0: C:\ProgramData\Anaconda3\lib\site-packages\h2o\explanation\_explain.py in _process_models_input(models, frame) 2802 models_with_varimp = [model for model in models if _has_varimp(model)] 2803 tree_models_to_show = _get_tree_models(models, 1 if is_aml else float("inf")) -> 2804 y = _get_xy(models_to_show[0])[1] 2805 classification = frame[y].isfactor()[0] 2806 multinomial_classification = classification and frame[y].nlevels()[0] > 2 C:\ProgramData\Anaconda3\lib\site-packages\h2o\explanation\_explain.py in _get_xy(model) 1790 """ 1791 names = model._model_json["output"]["original_names"] or model._model_json["output"]["names"] -> 1792 y = model.actual_params["response_column"] 1793 not_x = [ 1794 y, KeyError: 'response_column ``` From my understanding this response column refers to a column that you are trying to predict. However, since i'm dealing with an unlabelled dataset this response column doesn't exist. Is there a way for me to bypass this error? Is it even possible to utilize the explain() function on unsupervised models? If, so how do I do this? If it is not possible, is there another way to extract the Shap values of each variable from the model? Since the shap.TreeExplainer also doesn't seem to work on a H2O model. TL;DR: Is it possible to use the .explain() function from h2o on an (Extended) Isolation forest? If so how?

Product Resources

Get started with our products

Datatable
 

View on Github
 

H2O-3
 

View on Github
 

H2O AI Feature Store
 

Learn More

H2O Document AI
 

View on Github
Learn More

H2O Driverless AI
 

View on Github
Learn More

H2O Hydrogen Torch
 

Learn More
Product Brief

H2O MLOps
 

Learn More
Product Brief

H2O Sparkling Water
 

View on Github
Learn More

Try the H2O.ai Cloud for free for 90 days

Get Started
 

Become part of our community by trying H2O.ai with a free 90-day trial