According to notes on the project repository, Microsoft has revamped its MMLSpark open-source project to improve the integration of “many deep learning and data science tools into the Spark ecosystem.” First published last year, MMLSpark covers a series of projects intended to make Spark more useful in many contexts involving machine learning in particular, but also in other more general areas.

Some features of MMLSpark integrate Spark with Microsoft’s machine learning offerings like Cognitive Toolkit (CNTK) and LightGBM, as well as third-party projects like OpenCV. Other functions make it possible to turn Spark into a service or a client – ​​for example, to more easily serve Spark calculations (including machine learning predictions) via the web, or allow Spark to interact with other services Web over HTTP. Another function called “LIME on Spark” provides annotated results for predictions served by an image classifier. This solution lets you know at a glance if the classifier is working correctly.

To be placed on Azure Databricks

MMLSpark consolidates all these functions into a set of APIs available for Scala and Python. The repo also contains a few examples to help get you started, including using web services in Spark, using OpenCV on Spark for image manipulation, and training an image classifier. in depth on Azure VMs running with GPUs.

MMLSpark itself can be installed on existing Spark clusters as packages. It is also possible to use it in the Databricks cloud (or a Databricks appliance on Azure), or install it directly in a Python or Anaconda instance, or even run it in a Docker container. The integration is also available for the R language, but for now only through an auto-generated wrapper in beta.