This blog was originally posted by Justin Murray of VMware and can be accessed here.
This brief article introduces a short 4.5 minute video that explains the reasons why VMware vSphere is a great platform for data scientists/engineers to use as their base operating platform. The video then demonstrates an example of this, showing a data scientist conducting a modeling experiment with an input set of data, while using the Driverless AI tool from H2O.ai to do the data analysis and model training, all in VMs. The key idea here is that the world of machine learning/data science is rapidly changing, with new, powerful tools, platforms and versions appearing and upgrading at a very fast pace. The tool vendors are racing to innovate here and producing new workbenches for the both the expert and the novice in the field.
Data scientists and data engineers (who organize and cleanse the data first) want to be able to try out these new tools and updated versions of the tools while keeping a stable environment for their existing production deployments. The end goal is to produce a highly accurate trained ML model as quickly as they can for any input data set and predicted outcome. One measure of accuracy you will see in use at the end of the tool demo is the ROC (or Receiver Operating Characteristic) curve – but other measures of model accuracy are also available. That trained ML model will subsequently be used in production applications for the inference phase. The example model that is chosen here is called XGBoost- a popular algorithm for certain kinds of data. The ML inference phase (or production deployment of the model in a pipeline) is often concerned with classification of something, such as a fraudulent transaction, or prediction of what may happen in the future, such as the likelihood someone will not pay their credit card bill – or will choose a related book or movie. The ML practitioners (data scientists and data engineers) want to use the best tools and platforms they can get their hands on in order to build the best trained model that will be able to recognize these patterns.
This rapid change of tooling places a significant demand on an IT department, just to keep up with the innovation and satisfy their customer, the data scientists and data engineers, by giving them what they want, while maintaining some control. To achieve this, deploying on VMware vSphere gives them the ability to create different sandboxes for the data scientists to work in, each contained in one or more virtual machines. This provides isolation, checkpointing and the ability for the data scientist to innovate in a safe environment.
While many well-known examples of machine learning focus on solving problems to do with image recognition and classification, this particular H2O tool is being used in our demo to analyze tabular data, which happens to be contained in a CSV file in this example. This data is representative of many thousands of datasets that are found in enterprises, such as in database tables, spreadsheets and regular human-readable files. A term that is used frequently here is “independent and identically distributed” data or IID. This kind of data is structured into rows and columns (which is much different to the layout of pixels in an image) so the ML models that best analyze IID/tabular data may well be different to those models that deal with images. Financial institutions, insurance companies, retail operations and dozens of other enterprises have lots of this “tabular” data – so there is a big opportunity here for machine learning to be applied to this type of dataset so as to enhance these enterprises’ business understanding of their customers. These types of data may also be somewhat sensitive and so will likely be modeled in-house for the foreseeable future.