November 11th, 2016

Indexing 1 Billion Time Series with H2O and ISax

RSS icon RSS Category: Technical, Tutorials, Use Cases

At H2O, we have recently debuted a new feature called ISax that works on time series data in an H2O Dataframe. ISax stands for Indexable Symbolic Aggregate ApproXimation, which means it can represent complex time series patterns using a symbolic notation and thereby reducing the dimensionality of your data. From there you can run H2O’s ML algos or use the index for searching or data analysis. ISax has many uses in a variety of fields including finance, biology and cybersecurity.
Today in this blog we will use H2O to create an ISax index for analytical purposes. We will generate 1 Billion time series of 256 steps on an integer U(-100,100) distribution. Once we have the index we’ll show how you can search for similar patterns using the index.
We’ll show you the steps and you can run along, assuming you have enough hardware and patience. In this example we are using a 9 machine cluster, each with 32 cores and 256GB RAM. We’ll create a 1B row synthetic data set and form random walks for more interesting time series patterns. We’ll run ISax and perform the search, the whole process takes ~30 minutes with our cluster.
Raw H2O Frame Creation
In the typical use case, H2O users would be importing time series data from disk. H2O can read from local filesystems, NFS, or distributed systems like Hadoop. H2O cluster file reads are parallelized across the nodes for speed. In our case we’ll be generating a 256 column, 1B row frame. By the way H2O Dataframes scales better by increasing rows instead of columns. Each row will be an individual time series. The ISax algo assumes the time series data is row based.

rawdf = h2o.create_frame(cols=256, rows=1000000000, real_fraction=0.0, integer_fraction=1.0,missing_fraction=0.0)

Random Walk
Here we do a row wise cumulative sum to simulate random walks. The .head call triggers the execution graph so we can do a time measurement.

tsdf = rawdf.cumsum(axis=1)
print tsdf.head()

Lets take a quick peek at our time series


Run ISax
Now we’re ready to run isax and generate the index. The output of this command is another H2O Frame that contains the string representation of the isax word, along with the numeric columns in case you want to run ML algos.
res = tsdf.isax(num_words=20,max_cardinality=10)
Takes 10 minutes and H2O’s MapReduce framework makes efficient use of all 288 cpu cores.
Now that we have the index done, lets search for similar time series patterns in our 1B time series data set. Lets make indexes on the isax result frame and the original time series frame.

res["idx"] =1
res["idx"] = res["idx"].cumsum(axis=0)
tsdf["idx"] = 1
tsdf["idx"] = tsdf["idx"].cumsum(axis=0)

Im going to pick the second time series that we plotted (the green “C2”) time series.

myidx = res[res["iSax_index"]=="5^20_5^20_7^20_9^20_9^20_9^20_9^20_9^20_8^20_6^20

There are 4342 other time series with the same index in the 1B time series dataframe. Lets just plot the first 10 and see how similar they look

mylist = myidx.as_data_frame(use_pandas=True)["idx"][0:10].tolist()
mydf = tsdf[tsdf["idx"].isin(mylist)].as_data_frame(use_pandas=True)

The successful implementation of a fast in memory ISax algo can be attributed to the H2O platform having a highly efficient, easy to code, open source MapReduce framework, and the Rapids api that can deploy your distributed algos to Python or R. In my next blog, I will show how to get started with writing your own MapReduce functions in H2O on structured data by using ISax as an example.

Leave a Reply

A Brief Overview of AI Governance for Responsible Machine Learning Systems

Our paper “A Brief Overview of AI Governance for Responsible Machine Learning Systems” was recently

November 30, 2022 - by Navdeep Gill, Abhishek Mathur and Marcos V. Conde
H2O World Dallas Customer Talks

After three long years of not having an #H2OWorld, we finally held our first one

November 24, 2022 - by Vinod Iyengar
New in Wave 0.24.0

Another Wave release has arrived with quite a few exciting new features. Let's quickly go

November 21, 2022 - by Martin Turoci
Fallback Featured Image
+ Raises $40 Million to Democratize Artificial Intelligence for the Enterprise

Series C round led by Wells Fargo and NVIDIA MOUNTAIN VIEW, CA – November 30, 2017

November 20, 2022 - by
+ Placed Furthest in Completeness of Vision in 2021 Gartner Data Science and Machine Learning Magic Quadrant in the Visionaries Quadrant. — Copy

At, our mission is to democratize AI, and we believe driving value from data

November 18, 2022 - by Read Maloney, SVP of Marketing
+ Expands Market Footprint in Healthcare AI by Signing Hackensack Meridian Health and Other Key Providers

We’re excited to attend the HLTH conference this week in Las Vegas, NV. This industry

November 14, 2022 - by Prashant Natarajan

Start Your Free Trial