Machine learning (ML) models are only as good as the data fed into them. In tabular problems, the data is a collection of rows (samples) and columns (features). So, you could say that tabular ML models are only as good as the features fed into them.
But how do you manage features? Can you share them across the company? Can you easily reuse them for multiple projects? Are you sure the ones used for training are the same as those in production? Feature stores aim to provide concise and effective answers to these questions and more. This article explains what feature stores are and the benefits of using them.
A feature store is a central repository to create, store, and serve features.
Data Scientists and Machine Learning Engineers want to spend their time building and tuning models, but they mostly spend it doing data prep. A massive chunk of this is feature engineering . With a feature store, all the prep work you do on one project is saved and reusable for future projects. This results in massive productivity gains for your teams.
A feature store provides one central location for all your features and prevents problems caused by data silos. If data is siloed in different areas, you could have multiple teams working on the same solution to the same issue simultaneously! With a feature store, you can see which features have been added or updated recently and by whom. Seeing someone on the other side of the company adding similar-looking features to you would signal that you need to send them an email to see what is going on.
Without a feature store, you will spend a lot of time checking that the features used in production match those in training. Moreover, you’ll probably need to write separate processing pipelines for training and production. With a feature store, you won’t need to do either of these things. Simply use the straightforward API to access both data types and let the feature store handle the processing differences in the background. Removing this friction point leads to much slicker model deployment and ensures your models run in production for longer without breaking.
Typically, if a feature is offline (e.g., created in a batch job through your data warehouse) or online (e.g., created in real-time and delivered by a streaming service such as Kafka), you’ll need to write different processing code. Just like feature stores remove the distinction between training and production data, they also eliminate the difference between offline and online features. You access both feature types with the unified API and get your data. There is no need to worry about latency issues or calculating the feature. No need to worry about streaming or batch jobs. The data is just there, ready for your models to consume. This results in more time saved and more models getting and staying in production.
Once a feature is in the feature store, it is available forever to you and everyone in the company! Gone are the days spent harassing others for data access or feature approval. Once it’s in the feature store, everyone can access it and start using it in their own models right away. This access will fuel collaboration and creativity, resulting in more innovative models being deployed and adding value to your business.
The H2O AI Feature Store lets you add over 40 metadata attributes, including a description, data source explanation, and data sensitivity categories. Keeping detailed documentation next to features increases data understanding utilization.
Moreover, feature documentation opens the door to deeper, previously impossible insights. For example, you could calculate each feature’s rank based on their popularity and how many models use them. Or see how powerful it is on average based on its SHAP values across multiple production models. The value you can extract with such insights is endless.
All businesses would benefit from understanding the lifecycle of their features. However, this information is essential for companies in industries like Financial Services, Healthcare, and Security, which must satisfy regulatory compliance. If a regulator asks to see a model’s history, quickly accessing the lifecycle of every feature the model has used and is currently using will speed up this process immensely.
We recently announced a first-of-its-kind intelligent, enterprise-grade feature store at H2O. Let’s look at some of its unique components.
Since we’ve spent the last few years building world-class AutoML systems, we thought it was only fitting to include this knowledge in our feature store as well. If you pass H2O AI Feature Store a set of features, you can ask it how they can be improved upon. It will then automatically recommend new features or feature combinations for you to try that have a high chance of improving model performance.
Production data changes continually; this is called data (or feature) drift. You had to monitor this drift yourself in the past, but the H2O AI Feature Store does it for you. It can track individual features and feature sets and alert you if anything drifts too far. You can even set it to automatically retrain a model if the drift is above a certain threshold.
There you have it, a whirlwind tour of feature stores and some of their most significant benefits. Their use is spreading, but it is by no means an industry standard. Indeed, the FANG companies built their own from scratch! The fact they are investing heavily in this technology should demonstrate that it is here to stay and is an essential component if you want to perform top-notch machine learning at scale.
Implementing a feature store now will likely give you a competitive advantage. Expect significant productivity gains, deeper company-wide data understanding, and more models to be put into production faster. If you want to implement a cutting-edge feature store to transform your business, sign up for early access today.