This is a guest post re-published with permission from our friends at Datapipe. The original lives here.
One of the advantages of public cloud is the ability to experiment and run various workloads without the need to commit to purchasing hardware. However, to meet your data processing needs, a well-defined mapping between your objectives and the cloud vendor offerings is a must. In collaboration with Denis Perevalov (Milliman), we’d like to share some details around one of our most recent – and largest – big-data projects we’ve worked on; a project with our client, Milliman, to build a machine-learning platform on Amazon Web Services .
Before we get into the details, let’s introduce Datapipe’s data and analytics consulting team. The goal of our data and analytics team is to help customers with their data processing needs. Our engagements fall into data engineering efforts, where we help customers build data processing pipelines. In addition to that, our team helps clients get a better insight into their existing datasets by engaging with our data science consultants.
When we first started working with Milliman, one of their challenges was running compute intensive machine learning algorithms in a reasonable amount of time on a growing set of datasets. In order to cope with this challenge, they had to pick a distributed processing framework and, after some investigation, narrowed down their options to two frameworks: H2O and Apache Spark . Both frameworks offer distributed computing leveraging multiple nodes and Milliman was left to decide if they would use their on-premise infrastructure or use an alternative option. In the end, Amazon Web Services as an ideal choice considering their use-case and requirements. To execute on this plan, Milliman engaged with our data and analytics consultants to build a scalable and secure Spark and H2O machine learning platform using AWS solutions Amazon EMR , S3 , IAM , and Cloudformation ‘.
Early in our engagement with Milliman, we identified the following as the high-level and important project goals:
Let’s dive into each of these priorities further, with a focus on Milliman’s requirements and how we mapped their requirements to AWS solutions, as well as the details around how we achieved the agreed upon project goals.
Security
Security is the most important part of any cloud application deployment, and it’s a topic we take very seriously when we get involved in client engagements. In this project, one of the security requirements was the ability to control and limit access to datasets hosted on Amazon S3 depending on the environment. For example, Milliman wanted to restrict the access of development and staging clusters to a subset of data and to ensure the production dataset is only available to the production clusters.
To achieve Milliman’s security isolation goal, we took advantage of AWS Identity and Access Management offering (IAM). Using IAM we created:
AmazonS3 Development Environment
Example IAM Policy:
{
“Statement”: [
{
“Resource”: [
“arn:aws:s3:::parvizexamples/dev/data/*”,
“arn:aws:s3:::parvizexamples/dev/data”
],
“Action”: [“s3:*” ],
“Effect”: “Allow”
}
],
“Version”: “2012-10-17”
}
Elasticity And Cost
In general, there are two main approaches in deploying Spark or Hadoop platforms on AWS:
For this project, our recommendation to the client was to use Amazon’s Elastic MapReduce, which is a managed Hadoop and Spark offering. This allows us to build Spark or Hadoop clusters in a matter of minutes simply by calling APIs. There are a number of reasons why we suggested Amazon Elastic MapReduce instead of building their platform using 3rd party vendor/open-source offerings:
Automation
Automation is an important aspect of any cloud deployment. While it’s easy to build and deploy various applications on top of cloud resources manually or using inconsistent processes, automating the deployment process with consistency is the cornerstone of having a scalable cloud operation. That is especially true for distributed systems where reproducibility of the deployment process is the only way an organization can run and maintain a data processing platform.
As we mentioned earlier, in addition to Amazon EMR, other AWS services such as IAM and S3 were leveraged in this project. In addition to multiple AWS services, Milliman required having multiple AWS environments such as staging, development, and production with different security requirements. In order to automate the deployment of various AWS services while keeping the environments unique, we leveraged Amazon’s Cloudformation scripts.
Cost Visibility
One common requirement that regularly is overlooked is the ability to have fine-grained visibility to your cloud costs. Most organizations delay this requirement until they’re further down in their path to cloud adoption which can be a costly mistake. We were glad to hear that it was, in fact, part of Milliman’s project requirement to track AWS costs down to specific workloads.
Fortunately, AWS provides an offering called Cost Allocation Tags that can be tailored towards meeting a client’s cost visibility requirements. With Cost Allocation Tags, clients can tag their AWS resources with specific keywords that AWS can use to generate a cost report aggregated by customer’s tags. Specifically for this project, we instructed Milliman to use tagging feature of EMR to tag each cluster with workload specific keywords that can later be recognized in AWS’ billing report. For example, the following command line demonstrates how to tag an EMR cluster with workload specific tags:
aws emr create-cluster –name Spark –release-label emr-4.0.0 –tags Name=”Spark_Recommendation_Model_Generator” –applications Name=Hadoop Name=Spark
….
….
Interactivity
While building automated Spark and H2O clusters using AWS EMR and Cloudformation is a great start to building a data processing platform, at times developers need an interactive way of working with the platform. In other words, using the command-line to work with data processing platform may work in some use-cases, but in other cases UI access is critical for developers to perform their job by engaging with the cluster in an interactive fashion.
Part of our data & analytics offering at Datapipe, we help customers picking the right 3rd party tools/vendor for a given requirement. For Milliman, to build an interactive UI that engages with EMR Spark and H2O clusters, we leveraged IPython/Jupyter notebooks. Jupyter provides integration with Hadoop, Spark, and other platforms and allows developers to type their code in a UI and submit to the cluster for execution in real time. In order to deploy Jupyter notebooks, we leveraged EMR’s Bootstrap feature that allows customers to install custom software on EMR EC2 nodes.
Consolidated platform
Lastly, we needed to build a single platform to host both H2O and Spark frameworks. While H2O is not a supported platform on EMR, using Amazon EMR Bootstrap action feature, we were able to install H2O on EMR nodes and avoided creating a separate platform to host H2O. In other words, Milliman now has the ability to launch both Spark and H2O clusters using a single platform.
Data processing platforms require various considerations including but not limited to security, scalability, cost, interactivity, and automation. Fortunately by defining a set of clear project objectives and goals, and mapping those objectives to the applicable solutions (in this case AWS offerings), companies can meet their data processing requirements efficiently and effectively. We hope that this post demonstrated an example of how this process can be achieved using AWS offerings. If you have any questions or like to talk to one of our consultants, please contact us .
In the next set of blog posts we’ll provide some insight into how we use data science and data driven approach to gain insight into operational metrics.