Whether you’re looking to try out an implementation of Machine Learning (ML) this year, or simply looking to keep up with the trends, learning about solid ML framework libraries can help you.
Below I'm going to do a quick intro on Machine Learning per sé and then jump straight into this Top Machine Learning Libraries collection, which includes embedded links to their GitHub repos.
To put together this list, I've used both their popularity according to GitHub and my own experience as a software engineer and as someone running a startup community and a tech meetup. It is not a 'ranked' list but the 'top' libraries tend to be either more popular in terms of usage or more generalist when it comes to their domain of application.
Most of these are built using a low level language like C++ under the hood, but with an interface for higher level languages: chiefly Python —which has become the de-facto standard for implementing Machine Learning models.
What is Machine Learning and how does it work?
Computer scientist use Machine learning as a term to define the study, development and use of mathematical models of data to help a machine learn without direct instruction.
Computer Scientist Arthur Samuel first used the term in a 1959 research paper, whilst Tom Mitchell's 1997 book Machine Learning famously defined it as programs that gets better with experience:
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.
If you were a software engineer and you went by that definition, Machine Learning would basically mean: the building and training of a data model that generalises a certain decision against a performance measure.
What can you do with Machine Learning?
Fundamentally, modern day machine learning has two aims:
- To classify data based on a given model, and
- To make predictions for future outcomes based on such models.
For example, a machine learning algorithm specific to classifying data may be trained on an image dataset in order to recognise images containing a certain given pattern of pixels, which may indicate the presence of a certain object. Whereas an algorithm for stock trading may inform the trader of predicted price movements based on historical behaviour.
Approaches to Machine Learning
Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal" or "feedback" available to the learning system:
- Supervised learning: The program is presented with example inputs and their desired outputs, which must be therefore previously labelled, and the goal is to learn a general rule that maps inputs to outputs.
- Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).
- Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that's analogous to rewards, which it tries to maximise.
There are, of course, programs that don't fall neatly into these categories, such as semi-supervised learning, topic modelling or genetic algorithms.
Today, however, we don't actually have to re-implement from scratch all the algorithms that have already been discovered! That's because there are many frameworks around, which we engineers can import into a project as libraries and which can speed up the process of building and training your own data model.
So, without any further ado, here are what I think to be 15 Top Machine Learning Libraries you should try out this year:
Top 20 Machine Learning Libraries to Try in 2022
PyTorch is a Python package that provides two high-level features:
- Tensor computation (like NumPy) with strong GPU acceleration
- Deep neural networks built on a tape-based autograd system ─i.e. a system that uses reverse-mode automatic differentiation
You can also extend the framework with other packages for scientific calculations, such as NumPy or SciPy.
It has been primarily developed by Facebook's AI Research lab (FAIR) and released under the open source 'Modified BSD' license.
PyTorch is an extremely flexible library that has several easy-to-use built-in features, but that you can also extend by writing your own functions and integrating with other libraries, such as if you want to perform computations that are not differentiable.
TensorFlow is an open source library that was originally created by Google. It is used to design, build, and train deep learning models.
It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications. If you'd like to try it out, you can check out its official website, which is full of resources to get you going in no time.
TensorFlow now offers also support for federated learning, which means algorithms that can learn collaboratively without centralised training data.
TensorFlow provides stable Python and C++ APIs, as well as non-guaranteed backward compatible API for other languages.
LightFM is a Python implementation of a number of popular recommendation algorithms for both implicit and explicit feedback, including efficient implementation of Bayesian Personalised Ranking (BPR) and Weighted Approximately Ranked Pairwise (WARP) ranking losses. It's easy to use, fast (via multithreaded model estimation), and produces high quality results.
It also makes it possible to incorporate both item and user metadata into the traditional matrix factorisation algorithms. It represents each user and item as the sum of the latent representations of their features, thus allowing recommendations to generalise to new items (via item features) and to new users (via user features).
It is beautifully simple to use but not simplistic and you can develop extremely powerful models thanks to it.
Whilst not really a Machine Learning framework, Pandas is an extremely useful library to do Machine Learning with. It is a Python library that is used for faster data analysis, data cleaning, and data pre-processing.
Pandas is built on top of the numerical library of Python, called numPy.
NumPy is a Python library used for working with arrays. It also has functions for working in the domain of linear algebra, fourier transform, and matrices, which makes it pretty much the fundamental package for scientific computing with Python ─and a requirement for many higher-level data manipulation libraries.
SciPy is an open-source Python library used for scientific and technical computing. Built with NumPy, SciPy provides algorithms for optimisation, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, statistics and many other classes of problems.
Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library.
It was developed with a focus on enabling fast experimentation. Its key thesis being that enabling developers to go from idea to result as fast as possible is key to doing good research.
Keras is the high-level API of TensorFlow 2: an approachable, highly-productive interface for solving machine learning problems, with a focus on modern deep learning.
Keras allows you to take full advantage of the cross-platform capabilities of TensorFlow 2. For example, you can run Keras on TPU or on large clusters of GPUs, and you can export your Keras models to run in the browser or on a mobile device.
Scikit-learn (formerly scikits.learn) is anothr great open-source Python machine learning library.
It includes various implementations of classification, regression and clustering algorithms, such as support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Just like Theano, Aesara is a Python library that you can use to power large-scale computationally intensive scientific processes. It allows you to define, optimise, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Because of this, it is a key foundational library for Deep Learning in Python that you can use directly to create Deep Learning models or through wrapper libraries based on it or on its predecessor Theano.
Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents to improve via game-like environments that provide them with feedback and benchmarks.
You can write your agent using your existing numerical computation library, such as TensorFlow or Aesara/Theano.
Gym has been developed by OpenAI, the company behind the GPT-3 Natural Language Processing algorithm.
XGBoost implements machine learning algorithms under the Gradient Boosting framework. It supports multiple languages including C++, Python, R, Java, Scala, Julia.
XGBoost provides a parallel tree boosting (also known as Gradient Boosting Machine [GBM], or Gradient Boosting Decision Trees [GBDT]) that solve many data science problems in a fast and accurate way.
Boosting is an ensemble technique where predictors are assembled sequentially, one after the other. Therefore gradient boosting means that they predictors are assembled using the gradient descent optimisation technique.
MLFlow is a framework to manage the machine learning lifecycle, including experimentation, reproducibility, deployment, and the keeping of a central model registry.
MLflow is library-agnostic. You can use it with any machine learning library, and in any programming language, since all functions are accessible through a REST API and a command line interface (CLI).
MLflow's tracking URI and logging API, collectively known as MLflow Tracking is a component of MLflow that logs and tracks your training run metrics and model artifacts, no matter your experiment's environment.
LightGBM, short for Light Gradient Boosting Machine, is a free and open source distributed gradient boosting framework for machine learning originally developed by Microsoft.
MindsDB enables advanced predictive capabilities directly inside databases. Anyone who knows the basics of SQL can therefore build data models.
It works with pretty much all the main databases out there, from MySQL and PostgreSQL to MongoDB and Kafka, as well as with the main ML frameworks, including PyTorch, TensorFlow and Scikit-Learn.
OpenNN (Open Neural Networks Library) is a software library written in C++ that implements neural networks with an emphasis for advanced performance.
This library is probably less user-friendly that TensorFlow or PyTorch, but is constantly optimised and parallelised in order to maximise its efficiency in execution speed and memory allocation.
OpenNN is an open-source library developed by the company Artelnics.
OpenCV has become almost an industry standard when it comes to Computer Vision.
It includes several hundreds of computer vision algorithms, which play a major role in self-driving cars, robotics as well as in photo correction apps.
By using it, one can process images and videos to identify objects, faces, or even handwriting of a human. When it integrated with various libraries, such as NumPy, python is capable of processing the OpenCV array structure for analysis. It has C++, C, Python and Java interfaces and supports Windows, Linux, Mac OS, iOS and Android.
NLTK (Natural Language Toolkit) is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) written in the Python programming language.
It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. It is published under the MIT license and, unlike NLTK, which is mostly used for experimenting, teaching and research, spaCy focuses on providing software for production usage.
SpaCy also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow or PyTorch through its own machine learning library Thinc.
MLPack is a fast, flexible machine learning library, written in C++, that aims to provide fast, extensible implementations of cutting-edge machine learning algorithms, such as Hidden Markov Models, Clustering and Regression models, and Tree-based searches amongst others. MLPack provides these algorithms as simple command-line programs, Python bindings, and C++ classes which can then be integrated into larger-scale machine learning solutions.
It was built on top of the linear algebra library Armadillo by NumFOCUS, a nonprofit dedicated to supporting the open source scientific computing community.
Chainer is an open-source Deep Learning framework written in Python on top of NumPy.
It the first Deep Learning framework to introduce the define-by-run approach. In this approach, you first need to define the fixed connections between mathematical operations in the network (for instance, matrix multiplication and nonlinear activations). Then you run the actual training computation.
It supports GPU training via CUDA/cuDNN using the numPy array compatibility library CuPy.
Caffee is an open-source deep learning framework developed for Machine Learning. It is written in C++ and Caffe’s interface is coded in Python. It has been developed by the Berkeley AI Research, with contributions from the community developers.
This software has been designed keeping in mind the expressions, speed, modularity, openness and full community support to enable seamless creation of Deep Learning models.
Concluding strategy to try ML frameworks out
There you have them. The top 20 libraries to keep an eye on or try out for yourself this year in Machine Learning.
My advice, if you're looking to get your hands dirty with a ML framework this year, is the following:
- Build first a few models with a high-level easy-to-use library that has plenty of tutorials around to follow and pre-built algorithms to experiment with, such as LightFM, TensorFlow or PyTorch. This will help you have fun and learn without worrying too much about complex environment setups. I'd stick with Python libraries.
- Once you get a sense of how machine learning works and have built functioning tools, try playing around with maths frameworks like numPy, Pandas and Aesara to manipulate the data in your models and understand what's going on under the hood.
- Get creative with powerful wrapper libraries: now you can 'go back up' and use higher level multi-purpose wrapper libraries like Keras and TensorFlow, which you can now mold to your needs, or specialised libraries that are better suited to develop more complex production-ready models.
Ultimately, however, when progressing your journey into machine learning, only you can answer the question of what are the top frameworks to learn and use. Make your selection based on your interests and the type of applications you want to develop, such as pure data crunching vs computer vision. If you don't have a clear picture of what you will be looking to learn and develop, then let your curiosity decide:
I have no special talents. I am only passionately curious. ─Albert Einstein