Home / Papers / Machine Learning at Microsoft with ML . NET

Machine Learning at Microsoft with ML . NET

88 Citations2018
Matteo Interlandi
journal unavailable

This paper introduces ML.NET: a recently open-sourced machine learning framework allowing developers to author and deploy in their applications complex ML pipelines composed of data featurizers and state of the art machine learning models.

Abstract

We are witnessing an explosion of new frameworks for building Machine Learning (ML) models [14, 17, 7, 24, 13, 11, 18, 23, 10, 15, 4]. This profusion is motivated by the transition from machine learning as an art and science into a set of technologies readily available to every developer. An outcome of this transition is the abundance of applications that rely on trained models for functionalities that evade traditional programming due to their complex statistical nature. Speech recognition and image classification are only the most prominent such cases. This unfolding future, where most applications make use of at least one model, profoundly differs from the current practice in which data science and software engineering are performed in separate and different processes and sometimes even organizations. Furthermore, in current practice, models are routinely deployed and managed in ways that are very different from those of other software artifacts. While typical software packages are seamlessly compiled and ran on a myriad of heterogeneous devices, machine learning models are often relegated as services in relatively inefficient containers [6, 19, 5, 22, 12]. This pattern not only severely limits the kinds of applications one can build with machine learning capabilities, but also discourages developers from embracing ML as a core component of applications. At Microsoft we have encountered this phenomenon across a wide spectrum of applications and devices, ranging from services and server software to mobile and desktop applications running on PCs, Servers, Data Centers, Phones, Game Consoles and IoT devices. A machine learning toolkit for such diverse use cases, frequently deeply embedded in applications, must satisfy additional constraints compared to other available toolkits. For example, it has to limit library dependencies that are uncommon for applications; it must cope with datasets too large to fit in RAM; it has to be portable across many target platforms; it has to be model class agnostic, as different ML problems lend themselves to different model classes; and, most importantly, it has to capture the entire end-to-end prediction pipeline that takes a test example from a given domain (e.g., an email with headers and body) and produces a prediction that can often be structured and domain-specific (e.g., a collection of likely short responses). The requirement to encapsulate predictive pipelines is of paramount importance because it allows for effectively decoupling application logic from model development. Carrying the complete train-time pipeline into production provides a dependable way for building efficient, reproducible, production-ready models [26]. The need for ML pipelines has been recognized previously. Python libraries such as Scikit-learn [23] provide the ability to author complex machine learning cascades. Python has become the most popular language for data science thanks to its simplicity, interactive nature (e.g., notebooks [8, 16]) and breadth of libraries (e.g., numpy [25], pandas [21], matplotlib [9]). However, Python-based libraries inherit many syntactic idiosyncrasies and language constraints (e.g. interpreted execution, dynamic typing, global interpreter lock that restrict parallelization, etc.), making them suboptimal for high-performance applications targeting wide range of devices. 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. In this paper we introduce ML.NET: a recently open-sourced [2] machine learning framework allowing developers to author and deploy in their applications complex ML pipelines composed of data featurizers and state of the art machine learning models. Pipelines implemented and trained in ML.NET can be seamlessly surfaced for prediction without any modification: training and prediction, in fact, share the same code paths, and adding a model into an application is as easy as importing ML.NET runtime and binding the inputs/output data sources. ML.NET’s ability to capture full, end-to-end pipelines has been demonstrated by the fact that thousands of Microsoft’s data scientists and developers have been using ML.NET over the past decade, infusing 100s of products and services with machine learning models used by hundreds of millions of users worldwide. ML.NET supports large scale machine learning thanks to an internal design borrowing ideas from relational database management systems and embodied in its main abstraction: DataView. DataView provides compositional processing of schematized data while being able to gracefully and efficiently handle high dimensional data in datasets larger than main memory. Like views in relational databases, a DataView is the result of computations over one or more base tables or views, and is generally immutable and lazily evaluated (unless forced to be materialized, e.g., when multiple passes over the data are requested). Under the hood, DataView provides streaming access to data so that working sets can exceed main memory. ML.NET is open source and publicly available [2]; a recent demo showcasing ML.NET capabilities can be found at [1], while we refer readers to [3] for example pipelines. Next we will give an overview of ML.NET main concepts using a simple pipeline.