Comparing Machine Learning as a Service: Amazon, Microsoft Azure, Google Prediction API
For most businesses, machine learning seems close to rocket science, appearing expensive and talent demanding. And, if you’re aiming at building another Netflix recommendation system, it really is. But the trend of making everything-as-a-service has affected this sophisticated sphere, too. You can jump-start an ML initiative without much investment, which would be the right move if you are new to data science and just want to grab the low hanging fruit.
One of ML’s most inspiring stories is the one about a Japanese farmer who decided to sort cucumbers automatically to help his parents with this painstaking operation. Unlike the stories that abound about large enterprises, the guy had neither expertise in machine learning, nor a big budget. But he did manage to get familiar with Tensor Flow and employed deep learning to recognize different classes of cucumbers.
By using machine-learning cloud services, you can start building your first working models, yielding valuable insights from predictions with a relatively small team. We’ve already discussed machine learning strategy. Now let’s have a look at the best machine learning platforms on the market and consider some of the infrastructural decisions to be made.
Machine learning as a service
ML-as-a-service platforms cover most infrastructure issues as far as data pre-processing, model training, and model evaluation, with further prediction performed in a cloud. Prediction results can be bridged with your internal IT infrastructure through REST APIs. Amazon Machine Learning, Azure Machine Learning, and Google Prediction API are three leading cloud services that allow for fast model training and deployment with little to no data science expertise. These should be considered first if you assemble a homegrown data science team out of available software engineers. Have a look at our data science team structures story to have a better idea of roles distribution.
This post isn’t intending to provide exhaustive instructions of when and how to use these platforms, but rather what to look for before you start reading through their documentation.
Amazon Machine Learning
Amazon Machine Learning is one of the most automated solutions on the market and the best fit for deadline-sensitive operations. The service can load data from multiple sources, including Amazon RDS, Amazon Redshift, CSV files, etc. All data preprocessing operations are performed automatically: The service identifies which fields are categorical and which are numerical, and it doesn’t ask a user to choose the methods of further data preprocessing (dimensionality reduction and whitening).
Prediction capacities of Amazon ML are limited to three options: binary classification, multiclass classification, and regression. That said, Amazon doesn’t support any unsupervised learning methods, and a user must select a target variable to label it in a training set. Also, a user isn’t required to know any machine learning methods because Amazon chooses them automatically after looking at the provided data.
This high automation level acts both as an advantage and disadvantage for Amazon ML use. If you need a fully automated yet limited solution, the service can match your expectations. However, it doesn’t contribute a lot to understanding machine learning specifics and can’t be used as a launch pad to train domestic developers in data science.
Microsoft Azure Machine Learning
Unlike the Amazon ML product, Azure Machine Learning is aimed at setting a powerful playground both for newcomers and experienced data scientists. Almost all operations in Azure ML must be completed manually. This includes data exploration, preprocessing, choosing methods, and validating modeling results.
Approaching machine learning with Azure entails quite a steep learning curve. But it eventually leads to a deeper understanding of all major techniques in the field. On the other hand, Azure ML supports graphical interface to visualize each step within the workflow. Perhaps the main benefit of using Azure is the variety of algorithms available to play with. The Studio supports around 100 methods that address classification (binary+multiclass), anomaly detection, regression, recommendation, and text analysis. It’s worth mentioning that the platform has one clustering algorithm (K-means).
Another big part of Azure ML is the Cortana Intelligence Gallery. It’s a collection of machine learning solutions provided by the community to be explored and reused by data scientists. The Azure product is a powerful tool for starting with machine learning and introducing its capabilities to new employees.
Google Prediction API
The machine learning product from Google is very similar to what Amazon offers. Its minimalistic approach narrows down to solving two main issues: classification (both binary and multiclass) and regression. Trained models can be deployed through the REST API interface.
Google doesn’t disclose exactly which algorithms are utilized for drawing predictions. Thus, Prediction API would be a weak tool for acquiring that knowledge for newcomers. On the other hand, Google’s environment would be the best fit for running machine learning within tight deadlines and the early launch of the ML initiative. Similar to Azure, Google offers a gallery of pre-trained models, which unlike Azure is small and yet to be expanded.
IBM Watson, Tensor Flow, and others
All three platforms described before provide quite an exhaustive documentation to jump-start machine learning experiments and deploy trained models in a corporate infrastructure. There are also a number of other ML-as-a-service solutions that come from startups, and are respected by data scientists, like PredicSis and BigML.
But what about IBM Watson Analytics and Tensor Flow?
IBM Watson Analytics isn’t yet a full-fledged machine learning platform for the purpose of business prediction. Currently, Watson’s strength is visualizing data and describing how different values in it interact. It also has visual recognition service similar to what Google offers and a set of other cognitive services. The current problem with Watson is that the system performs narrow and relatively simple tasks that are easy to operate for non-professionals. When it comes to custom machine learning or prediction duties , it’s too early in its development to consider IBM Watson.
Tensor Flow. This is another Google product, which is a library of different machine learning tools rather than ML-as-a-service. It doesn’t have visual interface and the learning curve for Tensor Flow would be quite steep. However, the library is also targeted at software engineers that plan transitioning to data science. On top of that, Google provides a cloud infrastructure for machine learning that is built to use Tensor Flow with it but it’s still in Beta. Basically, the combination of Tensor Flow and Google Cloud service suggests infrastructure-as-a-service and platform-as-a-service solutions without the software level. We’ve talked about the three-tier model of cloud in our digital transformation whitepaper. Have a look at it, if you aren’t familiar with the concept.
Finding the right storage for collecting data and further processing it with machine learning is no longer a great challenge, assuming that your data scientists have enough knowledge to operate popular storage solutions.
In most cases, machine learning requires both SQL and NoSQL database schemes, which are supported by many established and trusted solutions like Hadoop Distributed File System (HDFS), Cassandra, Amazon S3, and Redshift. For organizations that have used powerful storage systems before embarking on machine learning, this won’t be a barrier. If you plan to work with some ML-as-a-service system, the most straightforward way is to choose the same provider both for storage and machine learning as this will reduce time spent on configuring a data source.
However, some of these platforms can be easily integrated with other storages. Azure ML, for instance, mainly integrates with other Microsoft products (Azure SQL, Azure Table, Azure Blob) but also supports Hadoop and a handful of other data source options. These include direct data upload from a desktop or on-premise server. The challenges may arise if your machine learning workflow is diversified and data comes from multiple sources.
Modeling and computing
We’ve discussed ML-as-a-service solutions that mainly provide computing capacities. But if the learning workflow is performed internally, the computing challenge will strike sooner or later. Machine learning in most cases requires much computing power. Data sampling (making a curated subset) is still a relevant practice, regardless of the fact that the era of big data has come. While model prototyping can be done on a laptop, training a complex model using a large dataset requires investment into more powerful hardware. The same applies to data preprocessing, which can take days on regular office machines. In a deadline-sensitive environment – where sometimes models should be altered and retrained weekly or daily – this simply isn’t an option. There are three viable approaches to handling processing while keeping high performance:
1. Accelerate hardware. If you do relatively simple tasks and don’t apply your models for big data, use solid-state drives (SSDs) for such tasks as data preparation or using analytics software. Computationally intensive operations can be addressed with one or several graphical processing units (GPUs). A number of libraries are available to let GPUs process models written even with such high-level languages as Python.
2. Consider distributed computing. Distributed computing implies having multiple machines with tasks split across them. However, this approach isn’t going to work for all machine learning techniques.
3. Use cloud computing for scalability. If your models process customer-related data that has intensive peak-moments, cloud computing services will allow for rapid scalability. For the companies that are required to have their data on-premise only, it’s worth considering private cloud infrastructure.
The next move
It’s easy to get lost in the variety of solutions available. They differ in terms of algorithms, they differ in terms of required skillsets, and eventually they differ in tasks. This situation is quite common for this young market as even the three leading solutions that we’ve talked about aren’t fully competitive with each other. And more than that, the velocity of change is impressive. There’s a high likelihood that you’ll stick with one vendor and suddenly another one will roll out something unexpectedly that matches your business needs.
The right move is to articulate what you plan to achieve with machine learning as early as possible. It’s not easy. Creating a bridge between data science and business value is tricky if you lack either data science or domain expertise. We at AltexSoft encounter this problem often when discussing machine learning applications with our clients. It’s usually a matter of simplifying the general problem to a single attribute. Whether it’s the price forecast or another numeric value, the class of an object or segregation of objects into multiple groups, once you find this attribute, deciding the vendor and choosing what’s proposed will be simpler.
Bradford Cross, founding partner at DCVC, argues that ML-as-a-services isn’t a viable business model. According to him, it falls in the gap between data scientists who are going to use open source products and executives who are going to buy tools solving tasks at the higher levels. However, it seems that the industry is currently overcoming its teething problems and eventually we’ll see far more companies turning to ML-as-a-service to avoid expensive talent acquisitions and still possess versatile data tools.