This is a top-level overview of the H2O architecture. H2O does in-memory analytics on clusters with distributed parallelized state-of-the-art Machine Learning algorithms . However, the platform is very generic, and very very fast. We’re building Machine Learning tools with it, because we think they’re cool and interesting, but the platform can do much more.
H2O is based on a number of layers, and is coded to at different layers to best approach different tasks and objectives. We talk about the MapReduce execution flavor alot in public, because it’s easy to explain, because it covers a lot of ground, and because we’re implemented a bunch of dense linear-algebra style algorithms with it – but that’s not the only thing we can do with H2O, nor is it the only coding “style”. Here’s a rundown of the layers in H2O:
In-memory distributed K/V store layer: H2O sports an in-memory (not-persistent) K/V store, with *exact * (not lazy) consistency semantics and transactions. The memory model is exactly the Java Memory Model, but distributed. Both reads and writes are fully cacheable, and typical cache-hit latencies are around 150ns (that’s 150 nanoseconds ) from a NonBlockingHashMap. Let me repeat that: reads and writes go through a non-blocking hash table – we do NOT suffer from a classic hot-blocks problem. Cache-misses obviously require a network hop, and the execution times are totally driven by the size of data moved divided by available bandwidth… and of course the results are cached. The K/V store is currently used hold control state, all results, and the Big Data itself. You can certainly build a dandy graph-based algorithm directly over the K/V store, or integrate H2O’s model scoring into a low-latency production pipeline.
* A note on cluster size : H2O clusters are peer-to-peer, and have no upper bound on the size. We have built clusters with over a 100 nodes and multiple Tb’s of DRAM. Every node “knows” where every Key is, without actually having to hold onto the Key. Hence asking for a particular Key’s Value is no more expensive than 1 network hop from the Key requester to some Key holder, and back.
* A note on compression: Big Data is heavily (and losslessly) compressed – typically 2x to 4x better than GZIP on disk (YMMV), and can be accessed like a Java Array (a Giant greater-than-4billion-element distributed Java array). H2O guarantees that if the data is accessed linearly then the access time will match what you can get out of C or Fortran – i.e., be memory bandwidth bound, not CPU bound. You can access the array (for both reads and writes) in any order, of course, but you get strong speed guarantees for accessing in-order. You can do pretty much anything to an H2O array that you can do with a Java array, although due to size/scale you’ll probably want to access the array in a blatantly parallel style.
* A note on decompression: The data is decompressed Just-In-Time strictly in CPU registers in the hot inner loops – and THIS IS FASTER than decompressing beforehand because most algorithms are memory bandwidth bound. Moving a 32byte cached line of compressed data into CPU registers gets more data per-cache-miss than moving 4 8-byte doubles. Decompression typically takes 2-4 instructions of shift/scale/add per element, and is well covered by the cache-miss costs. As an example, adding a billion compressed doubles (e.g., to compute the average) takes 12ms on a 32-core (dual-socket) E5-2670 Intel – an achieved bandwidth of over 600Gb/sec – hugely faster than all other Big Data solutions. Accounting for compression, H2O achieved 83Gb/sec out of the theoretical maximum of 100Gb/sec on this hardware.
* A note on Big Data and GC: H2O keeps all our data in heap , but in large arrays of Java primitives. Our experience shows that we run without GC issues, even on very large heaps with the default collector . We routinely test with e.g. heaps from 2G to 200G – and never see FullGC costs exceed a few seconds every now and then (depends on the rate of Big Data writing going on). The normal Java object allocation used to drive the system internally has a negligible GC load. We keep our data in-heap because its as fast as possible (memory bandwidth limited), and easy to code (pure Java), and has no interesting GC costs. Our GC tuning policy is: “only use the -Xmx flag, set to the largest you can allow given the machine resources”. Take all the other GC defaults, they will work fine.
* A note on Bigger Data and GC: We do a user-mode swap-to-disk when the Java heap gets too full, i.e., you’re using more Big Data than physical DRAM. We won’t die with a GC death-spiral, but we will degrade to out-of-core speeds. We’ll go as fast as the disk will allow. I’ve personally tested loading a 12Gb dataset into a 2Gb (32bit) JVM; it took about 5 minutes to load the data, and another 5 minutes to run a Logistic Regression.
* A note on data ingest: We read data fully parallelized from S3, HDFS, NFS, URI’s, browser uploads, etc. We can typically drive HDFS disk spindles to an interesting fraction of what you can get from e.g. HDFS file-copy. We parse & compress (in parallel) a very generous notion of a CSV file (for instance, Hive files are directly ingestable), and SVM light files. We are planning on an RDD ingester – interactivity with other frameworks is in everybody’s interest.
* A note on sparse data: H2O sports about 15 different compression schemes under the hood, including ones designed to compress sparse data. We happily import SVMLight without ever having the data “blow up” and still fully supporting the array-access API, including speed guarantees.
* A note on missing data: Most data sets have missing elements, and most math algorithms deal with missing data specially. H2O fully supports a notion of “NA” for all data, including setting, testing, selecting in (or out), etc, and this notion is woven through the data presentation layer.
* A note on streaming data: H2O vectors can have data inserted & removed (anywhere, in any order) continuously. In particular, it’s easy to add new data at the end and remove it from the start – i.e., a build a large rolling dataset holding all the elements that fit given a memory budget and a data flow-rate. This has been on our roadmap for awhile, and needs only a little more work to be fully functional.
new MyMR<MRTask> extends MRTask { double sum0, sum1, sq_sum0; // Most things are allowed here @Override public void map( double d0, double d1 ) { sum0 += d0; sum1 += d1; sq_sum0 += d0*d0; // Again most any Java code here } @Override public void reduce( MyMR my ) { // Combine two MyMRs together sum0 += my.sum0; sum1 += my.sum1; sq_sum0 += my.sq_sum0; } }.doAll( Vec v0, Vec v1 ); // Invoke in-parallel distributed
This code will be distributed ’round the cluster, and run at memory-bandwidth speeds (on compressed data!) with no further ado. There’s a lot of mileage possible here that I’m only touching lightly on. Filtering, subseting, writing results into temp arrays that are used on later next passes; uniques on billions of rows, ddply-style group-by operations all work in this Map/Reduce framework – and all work by writing plain old Java.
* A note Group-By: We have a GroupBy operator running at scale (called ddply in the R community), built using this Map/Reduce framework. The GroupBy can handle millions of groups on billions of rows, and runs Map/Reduce tasks on the group members (so a few large groups runs as fast as many small groups).
* Scala, and a note on API cleanliness: We fully acknowledge Java’s weaknesses here – this is the Java6 flavor coding style; Java7 style is nicer – but still not as nice as some other languages. We fully embrace & support alternative syntax(s) over our engine. In particular, we have an engineer working on a in-process Scala (amongst other) interfaces. We are shifting our focus now, from the excellent backend to the API interface side of things. This is a work-in-progress for us, and we are looking forward to much improvement over the next year.
* A note on R: H2O supports an R-like language – not full R semantics – but the obviously data-parallel data-munging aspects of R, and of course all the operators run fully parallel and distributed. There is a REPL. You can use it to add or drop columns or rows, manufacture features, impute missing values, or drop-in many R-expressions and have them run at-scale.
Pre-Baked Algorithms Layer: We have the following algorithms pre-baked, fully optimized and full-featured: Generalized Linear Modeling, including Logistic Binomial Regression plus Gaussian, Gamma, Poisson, and Tweedie distributions; Deep Learning; Random Forest (that scales *out* to all the data in the cluster); Gradient Boosted Machine that provides both multinomial classification and regression boosted models (again, in-parallel and fully distributed); PCA; KMeans (and variants); and a full suite of data characterization – for example Quantiles (any quantile, computed exactly in milliseconds). All these algorithms support Confusion Matrices (with adjustable thresholds), AUC & ROC metrics, incremental test data-set results on partially trained models during the build process. Within each algorithm, we support a full range of tuning parameters and options that you’d find in the similar R or SAS package.
* A note on some Mahout algorithms: We’re clearly well suited to e.g. SSVD and Co-occurrence and have talked with Ted Dunning at length on how they would be implemented in H2O.
REST/ JSON/ R/ python/ Excel/ REPL: The system is externally drivable via URLs/REST API calls, with JSON responses. We use REST/JSON from Python to drive all our testing harness. We have a very nice R package with H2O integrated behind R – you can issue R commands to an H2O-backed R “data.frame” – and have all the Big Math work on the Big Data in a cluster – including 90% of the typical “data munging” workflow. This same REST/JSON interface also works with e.g. Excel (yes we have a demo) or shell scripts. We have a pretty web-GUI over the REST/JSON layer, that is suitable for lightweight modeling tasks.
Cliff