When I wrote the article entitled A few thoughts on scientific software one of the responses I got was that people did not know about the existence of open-source chemistry toolkits so I thought I’d publish a page that hopefully prevent stop people reinventing the wheel. Here are a few open-source cheminformatics toolkits that I’m aware of.
As a follow up I thought I’d put together a list of useful python libraries for data science
If you have installed Anaconda a number of these packages will be preinstalled, however the fastest way to obtain conda is to install Miniconda, a minimal version of Anaconda that includes only conda and its dependencies. You can then use
1 2 |
conda install |
to install specific packages from the Anaconda repository. An alternative python package manager is PIP https://pypi.org/project/pip/, to install packages use
1 2 |
pip install |
Pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language, pandas provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. There are over 1300 contributors on GitHub.
It can be installed using conda
1 2 |
conda install pandas |
Or PIP
1 2 |
pip install pandas |
pandas requiers: NumPy: 1.9.0 or higher python-dateutil: 2.5.0 or higher pytz: 2011k or higher
There is extensive documentation
License Open source – BSD license
Source code; https://github.com/pandas-dev/pandas
Mailing list; https://groups.google.com/forum/?fromgroups#!forum/pydata
Stackoverflow: https://stackoverflow.com/questions/tagged/pandas
Modin
https://modin.readthedocs.io/en/latest/
Modin is a library designed to accelerate Pandas by automatically distributing the computation across all of the system’s available CPU cores. Modin uses Ray to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical. Modin is a DataFrame designed for datasets from 1MB to 1TB+
It can be installed using PIP
1 2 |
pip install modin |
If you don’t have Ray or Dask installed, you will need to install Modin with one of the targets:
1 2 3 4 |
pip install modin[ray] # Install Modin dependencies and Ray to run on Ray pip install modin[dask] # Install Modin dependencies and Dask to run on Dask pip install modin[all] # Install all of the above |
Currently, Modin depends on pandas version 0.23.4.
License: Apache 2.0 Source Code: https://github.com/modin-project/modin Mailing list: https://groups.google.com/forum/#!forum/modin-dev
NumPy
NumPy is the fundamental package for scientific computing with Python. It contains among other things, a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, Fourier transform, and random number capabilities. There are over 700 contributors on GitHub.
It can be installed using conda
1 2 |
conda install numpy |
It can be installed using PIP
1 2 |
pip install --user numpy |
There is extensive documentation
License: BSD
Source Code : https://github.com/numpy/numpy
Mailing List: https://mail.python.org/mailman/listinfo/numpy-discussion
SciPy
https://www.scipy.org/index.html
SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more. SciPy depends on NumPy, which provides convenient and fast N-dimensional array manipulation. SciPy is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. There are over 700 contributors on GitHub.
It can be installed using conda
1 2 |
conta install scipy |
It can be installed using PIP
1 2 |
pip install --user scipy |
SciPy requires the following software installed for your platform: Python 2.7 or >= 3.4 NumPy >= 1.8.2
There is extensive documentation
License; BSD
Source Code: https://github.com/scipy/scipy
Mailing List ; https://scipy.org/scipylib/mailing-lists.html
Scikit-learn
https://scikit-learn.org/stable/
Scikit-learn is written in Python and is a library for machine learning built on NumPy, SciPy and matplotlib. It provides a very wide variety of tools for data mining and data analysis with a focus on machine learning. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN. There are over 12,000 contributors of GitHub, the project was started in 2007 by David Cournapeau as a Google Summer of Code project.
Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (2011) 2825-2830
Scikit-learn requires: Python (>= 2.7 or >= 3.4), NumPy (>= 1.8.2), SciPy (>= 0.13.3).
It should be noted Scikit-learn 0.20 is the last version to support Python 2.7 and Python 3.4. Scikit-learn 0.21 will require Python 3.5 or newer.
It can be installed using conda
1 2 |
conda install scikit-learn |
or PIP
1 2 |
pip install -U scikit-learn |
There is extensive documentation and a number of tutorials.
Also worth looking at sklearn-pandas a bridge between Scikit-Learn’s machine learning methods and pandas-style Data Frames.
License Open source, commercially usable – BSD license
Source code https://github.com/scikit-learn/scikit-learn Mailing list : https://mail.python.org/mailman/listinfo/scikit-learn
Also stackoverflow : https://stackoverflow.com/questions/tagged/scikit-learn
PyTorch
PyTorch is a Python package that provides two high-level features:
- Tensor computation (like NumPy) with strong GPU acceleration
- Deep neural networks built on a tape-based autograd system
You can reuse your favorite Python packages such as NumPy, SciPy and Cython to extend PyTorch when needed. Currently in an early-release beta. Expect some adventures and rough edges. There are over 800 contributors of GitHub
It can be installed using conda
1 2 |
conda install pytorch torchvision -c pytorch |
or PIP
1 2 |
pip3 install torch torchvision |
Or built from source You will need to build from source if you want CUDA support.
License; BSD-style license.
Source code: https://github.com/pytorch/pytorch
Mailing list : https://discuss.pytorch.orgs
Tensorflow
TensorFlow™ is an open source software library for high performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices. There are over 1700 contributors on GitHub.
To install the current release for CPU-only:
1 2 |
pip install tensorflow |
Use the GPU package for CUDA-enabled GPU cards:
1 2 |
pip install tensorflow-gpu |
Docker images are also available https://hub.docker.com/r/tensorflow/tensorflow/.
License: Apache License 2.0 Source code: https://github.com/tensorflow/tensorflow
Mailing List: https://groups.google.com/a/tensorflow.org/forum/#!forum/discuss
Stackoverflow: https://stackoverflow.com/questions/tagged/tensorflow
Keras
https://keras.io
Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. There are over 700 contributors of GitHub
It can be installed using PIP
1 2 |
pip install keras |
There is extensive documentation.
You can use Sequential Keras models (single-input only) as part of your Scikit-Learn workflow via the wrappers https://keras.io/scikit-learn-api/.
License; MIT license.
Source code: https://github.com/keras-team/keras
Mailing list : https://groups.google.com/forum/#!forum/keras-users
xgbost
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples. There are over 300 contributors of GitHub.
XGBoost: A Scalable Tree Boosting System DOI
It can be installed using PIP
First, obtain gcc-7 with Homebrew (https://brew.sh/) to enable multi-threading (i.e. using multiple CPU threads for training). The default Apple Clang compiler does not support OpenMP, so using the default compiler would have disabled multi-threading.
1 2 |
pip3 install xgboost |
License; Licensed under an Apache-2 license.
Source code: https://github.com/dmlc/xgboost
Mailing list : https://discuss.xgboost.ai
statsmodels
https://www.statsmodels.org/stable/
statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. There are 164 contributors on GitHub.
Statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.PDF
It can be installed using conda
1 2 |
conda install statsmodels |
Or PIP
1 2 |
pip install --upgrade --no-deps statsmodels |
Or you can build from source.
There is extensive Documentation
License; open source Modified BSD (3-clause)
Source code: https://github.com/statsmodels/statsmodels
Mailing list : https://groups.google.com/forum/#!forum/pystatsmodels
pyjanitor
pyjanitor is a project that extends Pandas with a verb-based API, providing convenient data cleaning routines for repetitive tasks.
It can be installed using conda
1 2 |
conda install pyjanitor -c conda-forge |
Or PIP
1 2 |
pip install pyjanitor |
There is extensive Documentation, including a section on cleaning chemistry data.
Matplotlib
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits. There are nearly 800 contributors on GitHub. NOTE: The current master branch is now Python 3 only. Python 2 support is being dropped.
It can be installed using PIP
1 2 |
pip install -U matplotlib |
Matplotlib requires the following dependencies:
Python (>= 3.5) FreeType (>= 2.3) libpng (>= 1.2) NumPy (>= 1.10.0) setuptools cycler (>= 0.10.0) dateutil (>= 2.1) kiwisolver (>= 1.0.0) pyparsing
License: Python Software Foundation (PSF) license.
Source code: https://github.com/matplotlib/matplotlib
Mailing list: matplotlib-users@python.org
Seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics, it is closely integrated with pandas.
There is a really comprehensive set of tutorials
It can be installed using conda
1 2 |
conda install seaborn |
Or PIP
1 2 |
pip install seaborn |
Seaborn requires: numpy (>= 1.9.3) scipy (>= 0.14.0) matplotlib (>= 1.4.3) pandas (>= 0.15.2)
License: BSD 3-clause license Source code: https://github.com/mwaskom/seaborn Mailing List: https://stackoverflow.com/questions/tagged/seaborn
Jupyter
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. While Jupyter runs code in many programming languages, Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook itself
It can be installed using conda
1 2 |
conda instal jupyter |
Or PIP
1 2 |
pip install jupyter |
There is extensive documentation
License: modified BSD license
Stackoverflow: https://stackoverflow.com/questions/tagged/jupyter
I thought I’d also mention collections it is in the standard library but I seem to use default_dict regularly.
Last Updated 9 January 2020