Open Source Python Data Science Libraries

When I wrote the article entitled A few thoughts on scientific software one of the responses I got was that people did not know about the existence of open-source chemistry toolkits so I thought I’d publish a page that hopefully prevent stop people reinventing the wheel. Here are a few open-source cheminformatics toolkits that I’m aware of.

As a follow up I thought I’d put together a list of useful python libraries for data science

If you have installed Anaconda a number of these packages will be preinstalled, however the fastest way to obtain conda is to install Miniconda, a minimal version of Anaconda that includes only conda and its dependencies. You can then use

conda install

1 2	conda install

to install specific packages from the Anaconda repository. An alternative python package manager is PIP https://pypi.org/project/pip/, to install packages use

pip install

1 2	pip install

Pandas

https://pandas.pydata.org

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language, pandas provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. There are over 1300 contributors on GitHub.

It can be installed using conda

conda install pandas

1 2	conda install pandas

Or PIP

pip install pandas

1 2	pip install pandas

pandas requiers: NumPy: 1.9.0 or higher python-dateutil: 2.5.0 or higher pytz: 2011k or higher

There is extensive documentation

License Open source – BSD license
Source code; https://github.com/pandas-dev/pandas
Mailing list; https://groups.google.com/forum/?fromgroups#!forum/pydata
Stackoverflow: https://stackoverflow.com/questions/tagged/pandas

Modin

https://modin.readthedocs.io/en/latest/

Modin is a library designed to accelerate Pandas by automatically distributing the computation across all of the system’s available CPU cores. Modin uses Ray to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical. Modin is a DataFrame designed for datasets from 1MB to 1TB+

It can be installed using PIP

pip install modin

1 2	pip install modin

If you don’t have Ray or Dask installed, you will need to install Modin with one of the targets:

pip install modin&#91;ray] # Install Modin dependencies and Ray to run on Ray
pip install modin&#91;dask] # Install Modin dependencies and Dask to run on Dask
pip install modin&#91;all] # Install all of the above

pip install modin[ray] # Install Modin dependencies and Ray to run on Ray

pip install modin[dask] # Install Modin dependencies and Dask to run on Dask

pip install modin[all] # Install all of the above

Currently, Modin depends on pandas version 0.23.4.

License: Apache 2.0 Source Code: https://github.com/modin-project/modin Mailing list: https://groups.google.com/forum/#!forum/modin-dev

NumPy

https://www.numpy.org

NumPy is the fundamental package for scientific computing with Python. It contains among other things, a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, Fourier transform, and random number capabilities. There are over 700 contributors on GitHub.

It can be installed using conda

conda install numpy

1 2	conda install numpy

It can be installed using PIP

pip install --user numpy

1 2	pip install --user numpy

There is extensive documentation

License: BSD
Source Code : https://github.com/numpy/numpy
Mailing List: https://mail.python.org/mailman/listinfo/numpy-discussion

SciPy

https://www.scipy.org/index.html

SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more. SciPy depends on NumPy, which provides convenient and fast N-dimensional array manipulation. SciPy is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. There are over 700 contributors on GitHub.

It can be installed using conda

conta install scipy

1 2	conta install scipy

It can be installed using PIP

pip install --user scipy

1 2	pip install --user scipy

SciPy requires the following software installed for your platform: Python 2.7 or >= 3.4 NumPy >= 1.8.2

There is extensive documentation

License; BSD
Source Code: https://github.com/scipy/scipy
Mailing List ; https://scipy.org/scipylib/mailing-lists.html

Scikit-learn

https://scikit-learn.org/stable/
Scikit-learn is written in Python and is a library for machine learning built on NumPy, SciPy and matplotlib. It provides a very wide variety of tools for data mining and data analysis with a focus on machine learning. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN. There are over 12,000 contributors of GitHub, the project was started in 2007 by David Cournapeau as a Google Summer of Code project.

Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (2011) 2825-2830

Scikit-learn requires: Python (>= 2.7 or >= 3.4), NumPy (>= 1.8.2), SciPy (>= 0.13.3).

It should be noted Scikit-learn 0.20 is the last version to support Python 2.7 and Python 3.4. Scikit-learn 0.21 will require Python 3.5 or newer.

It can be installed using conda

conda install scikit-learn

1 2	conda install scikit-learn

or PIP

pip install -U scikit-learn

1 2	pip install -U scikit-learn

There is extensive documentation and a number of tutorials.

Also worth looking at sklearn-pandas a bridge between Scikit-Learn’s machine learning methods and pandas-style Data Frames.

License Open source, commercially usable – BSD license
Source code https://github.com/scikit-learn/scikit-learn Mailing list : https://mail.python.org/mailman/listinfo/scikit-learn
Also stackoverflow : https://stackoverflow.com/questions/tagged/scikit-learn

PyTorch

https://pytorch.org

PyTorch is a Python package that provides two high-level features:

Tensor computation (like NumPy) with strong GPU acceleration
Deep neural networks built on a tape-based autograd system

You can reuse your favorite Python packages such as NumPy, SciPy and Cython to extend PyTorch when needed. Currently in an early-release beta. Expect some adventures and rough edges. There are over 800 contributors of GitHub

It can be installed using conda

conda install pytorch torchvision -c pytorch

1 2	conda install pytorch torchvision -c pytorch

or PIP

pip3 install torch torchvision

1 2	pip3 install torch torchvision

Or built from source You will need to build from source if you want CUDA support.

License; BSD-style license.
Source code: https://github.com/pytorch/pytorch
Mailing list : https://discuss.pytorch.orgs

Tensorflow

https://www.tensorflow.org

TensorFlow™ is an open source software library for high performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices. There are over 1700 contributors on GitHub.

To install the current release for CPU-only:

pip install tensorflow

1 2	pip install tensorflow

Use the GPU package for CUDA-enabled GPU cards:

pip install tensorflow-gpu

1 2	pip install tensorflow-gpu

Docker images are also available https://hub.docker.com/r/tensorflow/tensorflow/.

License: Apache License 2.0 Source code: https://github.com/tensorflow/tensorflow
Mailing List: https://groups.google.com/a/tensorflow.org/forum/#!forum/discuss
Stackoverflow: https://stackoverflow.com/questions/tagged/tensorflow

Keras

https://keras.io
Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. There are over 700 contributors of GitHub

It can be installed using PIP

pip install keras

1 2	pip install keras

Or built from source

There is extensive documentation.

You can use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow via the wrappers https://keras.io/scikit-learn-api/.

License; MIT license.
Source code: https://github.com/keras-team/keras
Mailing list : https://groups.google.com/forum/#!forum/keras-users

xgbost

https://xgboost.ai

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples. There are over 300 contributors of GitHub.

XGBoost: A Scalable Tree Boosting System DOI

It can be installed using PIP

First, obtain gcc-7 with Homebrew (https://brew.sh/) to enable multi-threading (i.e. using multiple CPU threads for training). The default Apple Clang compiler does not support OpenMP, so using the default compiler would have disabled multi-threading.

pip3 install xgboost

1 2	pip3 install xgboost

Or built from source

License; Licensed under an Apache-2 license.
Source code: https://github.com/dmlc/xgboost
Mailing list : https://discuss.xgboost.ai

statsmodels

https://www.statsmodels.org/stable/

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. There are 164 contributors on GitHub.

Statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.PDF

It can be installed using conda

conda install statsmodels

1 2	conda install statsmodels

Or PIP

pip install --upgrade --no-deps statsmodels

1 2	pip install --upgrade --no-deps statsmodels

Or you can build from source.

There is extensive Documentation

License; open source Modified BSD (3-clause)
Source code: https://github.com/statsmodels/statsmodels
Mailing list : https://groups.google.com/forum/#!forum/pystatsmodels

pyjanitor

pyjanitor is a project that extends Pandas with a verb-based API, providing convenient data cleaning routines for repetitive tasks.

It can be installed using conda

conda install pyjanitor -c conda-forge

1 2	conda install pyjanitor -c conda-forge

Or PIP

pip install pyjanitor

1 2	pip install pyjanitor

There is extensive Documentation, including a section on cleaning chemistry data.

Matplotlib

https://matplotlib.org

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits. There are nearly 800 contributors on GitHub. NOTE: The current master branch is now Python 3 only. Python 2 support is being dropped.

It can be installed using PIP

pip install -U matplotlib

1 2	pip install -U matplotlib

Matplotlib requires the following dependencies:

Python (>= 3.5) FreeType (>= 2.3) libpng (>= 1.2) NumPy (>= 1.10.0) setuptools cycler (>= 0.10.0) dateutil (>= 2.1) kiwisolver (>= 1.0.0) pyparsing

License: Python Software Foundation (PSF) license.
Source code: https://github.com/matplotlib/matplotlib
Mailing list: matplotlib-users@python.org

Seaborn

https://seaborn.pydata.org

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics, it is closely integrated with pandas.

There is a really comprehensive set of tutorials

It can be installed using conda

conda install seaborn

1 2	conda install seaborn

Or PIP

pip install seaborn

1 2	pip install seaborn

Seaborn requires: numpy (>= 1.9.3) scipy (>= 0.14.0) matplotlib (>= 1.4.3) pandas (>= 0.15.2)

License: BSD 3-clause license Source code: https://github.com/mwaskom/seaborn Mailing List: https://stackoverflow.com/questions/tagged/seaborn

Jupyter

http://jupyter.org

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. While Jupyter runs code in many programming languages, Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook itself

It can be installed using conda

conda instal jupyter

1 2	conda instal jupyter

Or PIP

pip install jupyter

1 2	pip install jupyter

There is extensive documentation

License: modified BSD license
Stackoverflow: https://stackoverflow.com/questions/tagged/jupyter

I thought I’d also mention collections it is in the standard library but I seem to use default_dict regularly.

Last Updated 9 January 2020

Open Source Python Data Science Libraries

Pandas

Modin

NumPy

SciPy

Scikit-learn

PyTorch

Tensorflow

Keras

xgbost

statsmodels

pyjanitor

Matplotlib

Seaborn

Jupyter

Related Posts

ChEMBL 36 is out

Selecting random clusters from a large dataset in Vortex