Introduction

This short primer is an introduction to the scientific Python stack for Data Science. It is designed as a tour around the major Python packages used for the main computational tasks encountered in the sexiest job of the 21st century. At the end of this tour, you'll have a broad overview of the available libraries as well as why and how they are used for each task. This notebook aims at answering the following question: which tool should I use for which task and how. Before starting, two remarks:

  1. There exists better / faster ways to accomplish the presented computations. The goal is to present the packages and get a sense of which problems they solve.
  2. It is not meant to teach you (scientific) Python. I however tried to include the main constructions and idioms of the language and packages. A good ressource to learn scientific Python is a set of lectures from J.R. Janson.

1 Data Science

This notebook will walk you through a typical Data Science process:

  1. Data acquisition
    1. Importation
    2. Cleaning
    3. Exploration
  2. Data exploitation
    1. Pre-processing
    2. (Feature extraction)
    3. Modeling
    4. (Algorithm design)
    5. Evaluation

Our motivating example: predict whether a credit card client will default.

  • It is a binary classification task: client will default or not ($y=1$ if yes; $y=0$ if no).
  • We have data for 30'000 real clients from Taiwan.
  • There is 23 numerical & categorical explanatory variables:
    1. $x_1$: amount of the given credit.
    2. $x_2$: gender (1 = male; 2 = female).
    3. $x_3$: education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
    4. $x_4$: marital status (1 = married; 2 = single; 3 = others).
    5. $x_5$: age (year).
    6. $x_6$ to $x_{11}$: history of past payment (monthly from September to April, 2005) (-1 = pay duly; 1 = payment delay for one month; ...; 9 = payment delay for nine months and above).
    7. $x_{12}$ to $x_{17}$: amount of bill statement (monthly from September to April, 2005).
    8. $x_{18}$ to $x_{23}$: amount of previous payment (monthly from September to April, 2005).
  • The data comes from the UCI ML repository.

2 Python

Before taking our tour, let's briefly talk about Python. First thing first, the general characteristics of the language:

  • General purpose: not built for a particular usage, it works as well for scientific computing as for web and application development. It features high-level data structures and supports multiple paradigms: procedural, object-oriented and functional.
  • Elegant syntax: easy-to-read and intuitive code, easy-to-learn minimalistic syntax, quick to write (low boilerplate / verbosity), maintainability scales well with size of projects.
  • Expressive language: fewer lines of code, fewer bugs, easier to maintain.

Technical details:

  • Dynamically typed: no need to define the type of variables, function arguments or return types. Everything is an object and can be modified at runtime.
  • Automatic memory management (garbage collector): no need to explicitly allocate and deallocate memory for variables and data arrays. No memory leak bugs.
  • Interpreted (JIT is coming): No need to compile the code. The Python interpreter reads and executes the python code directly. It also means that a single Python source runs anywhere a runtime is available, like on Windows, Mac, Linux and in the Cloud.

From those characteristics emerge the following advantages:

  • The main advantage is ease of programming, minimizing the time required to develop, debug and maintain the code.
  • The well designed language encourages many good programming practices:
    • Modular and object-oriented programming, good system for packaging and re-use of code. This often results in more transparent, maintainable and bug-free code.
    • Documentation tightly integrated with the code.
  • A large community geared toward open-source, an extensive standard library and a large collection of add-on packages and development tools.

And the following disadvantages:

  • There is two versions of Python in general use: 2 and 3. While Python 3 is around since 2008, there are still libraries which only support Python 2. While you should generally go for Python 3, a specific library or legacy code can hold you on Python 2.
  • Due to its interpreted and dynamic nature, the execution of Python code can be slow compared to compiled statically typed programming languages, such as C and Fortran. That is however almost solved, see the available solutions at the end of this notebook.
  • There is no compiler to catch your errors. Solutions include unit / integration tests or the use of a linter such as pyflakes, Pylint or PyChecker. Flake8 combines static analysis with style checking.

3 Why Python for Data Science

Let's state why is Python a language of choice for Data Scientists. Viable alternatives include matlab, R and Julia, and, for more statistical jobs, the SAS and SPSS statistical packages. The strenghs of Python are:

  • Minimal development time.
    • Rapid prototyping for data exploration.
    • Same language and framework for R&D and production.
  • A strong position in scientific computing.
    • Large community of users, easy to find help and documentation.
    • Extensive ecosystem of open-source scientific libraries and environments.
  • Easy integration.
    • Many libraries to access data from files, databases or web scraping.
    • Many wrappers to legacy code, e.g. C, Fortran or Matlab.
  • Available and suitable for High Performance Computing (HPC)
    • Close integration with time-tested and highly optimized libraries for fast numerical mathematics like BLAS, LAPACK, ATLAS, OpenBLAS, ARPACK, MKL, etc.
    • JIT and AOT compilers.
    • Good support for parallel processing with processes and threads, interprocess communication (MPI) and GPU computing (OpenCL and CUDA).

4 Why Jupyter

Jupyter notebook is an HTML-based notebook which allows you to create and share documents that contain live code, equations, visualizations and explanatory text. It allows a clean presentation of computational results as HTML or PDF reports and is well suited for interactive tasks surch as data cleaning, transformation and exploration, numerical simulation, statistical modeling, machine learning and more. It runs everywhere (Window, Mac, Linux, Cloud) and supports multiple languages through various kernels, e.g. Python, R, Julia, Matlab.

While Jupyter is itself becoming an Integreted Development Environment (IDE), alternative scientific IDEs include Spyder and Rodeo. Non-scientific IDEs include IDLE and PyCharm. Vim and Emacs lovers (or more recently Atom and Sublime Text) will find full support of Python in their editor of choice. An interactive prompt, useful for experimentations or as a calculator, is offered by Python itself or by IPython, the Jupyter kernel for Python.

5 Installation

During this tour, we'll need the packages shown below.

In [10]:
%%script sh
cat ./requirements.txt
numpy
scipy
matplotlib
scikit-learn
networkx

pandas
xlrd
xlwt
tables

keras
theano
# tensorflow: see https://www.tensorflow.org/versions/master/get_started/os_setup.html
# for the correct target. Something like below.
https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.11.0-cp35-cp35m-linux_x86_64.whl

jupyter
ipython

grip
In [11]:
# Windows
# !type ..\requirements.txt

The statements starting with % or %% are built-in magic commands, i.e. commands interpreted by the IPython kernel. E.g. %%script sh tells IPython to run the cell with the shell sh (like the #! line at the beginning of script).

6 Environment

6.1 Python

The Python prompt is what you get when typing python in your terminal. It is useful to test commands and check your installation. We however prefer IPython for interactive work, see below.

Python files, with the extension .py, are either scripts or modules. A Python script is a file which gets executed with either python myscript.py or ./myscript.py if it has execution permissions as well as a shabang (#!) indicating which interpreter should be used.

6.2 IPython

The IPython prompt is what you get when running ipython in your terminal. It is more convenient than the Python prompt and is useful for interactive work like small experiments or as a powerful calculator.

6.3 Jupyter

The Jupyter notebook is the web interface you get when running jupyter notebook. It features a file explorer, various kernels (for Python, R, Julia) and can export any notebook to HTML / PDF (via jupyter nbconvert). The basic document is a notebook which is composed of cells who are either code, results or markdown text / math. The Jupyter notebook is the interface we'll use for most of the course.

Markdown is a lightweight markup language which is very much used to generate HTML documents (e.g. on GitHub or with static website generators). See this cheatsheet as a very short introduction. Or simply edit the cells in this notebook. Markdown can include Latex math such as $y = 2x$.

6.4 Installing and managing packages

Most of the packages, i.e. reusable pieces of code, are posted on PyPI, the Python Package Index, by their authors. The Python package manager, pip, is a command-line tool to search and download packages from PyPI.

Note that some packages, like NumPy, requires native, i.e. compiled, dependencies. That is why installing with pip install may fail, as it only manages Python packages. In that case you need to install those dependencies by hand or with the help of a package manager like brew for Mac or whatever your Linux distribution uses.

Searching for a package goes like this (can be typed in your terminal):

In [12]:
!pip search music21
contourviz (0.2.4)      - A package that charts musical contours into a web-
                          based interactive using music21 and D3.js.
music21 (5.0.3a1)       - A Toolkit for Computer-Aided Musical Analysis.
OutputLilyPond (1.0.0)  - Produce a LilyPond file from a music21 Score.
Installing a package goes like this. Note that it warns you if the package is installed already.
In [13]:
!pip install numpy
Requirement already satisfied: numpy in /Library/Python/2.7/site-packages

You can get the list of installed packages with pip freeze. These are all the packages that are installed and available on your system. They could have been installed by pip install packname (maybe as a dependancy), by conda install packname or by your system's package manager.

In [14]:
!pip freeze
altgraph==0.10.2
appnope==0.1.0
backports-abc==0.5
backports.shutil-get-terminal-size==1.0.0
backports.weakref==1.0rc1
bdist-mpkg==0.5.0
bleach==1.5.0
bonjour-py==0.3
certifi==2017.7.27.1
decorator==4.1.2
enum34==1.1.6
funcsigs==1.0.2
html5lib==0.9999999
ipython==5.4.1
ipython-genutils==0.2.0
macholib==1.5.1
Markdown==2.6.8
matplotlib==1.3.1
mock==2.0.0
modulegraph==0.10.4
networkx==1.11
nose==1.3.7
numpy==1.13.1
pandas==0.20.3
pathlib2==2.3.0
pbr==3.1.1
pexpect==4.2.1
pickleshare==0.7.4
Pillow==3.3.1
prompt-toolkit==1.0.15
protobuf==3.3.0
ptyprocess==0.5.2
py2app==0.7.3
Pygments==2.2.0
pyobjc-core==2.5.1
pyobjc-framework-Accounts==2.5.1
pyobjc-framework-AddressBook==2.5.1
pyobjc-framework-AppleScriptKit==2.5.1
pyobjc-framework-AppleScriptObjC==2.5.1
pyobjc-framework-Automator==2.5.1
pyobjc-framework-CFNetwork==2.5.1
pyobjc-framework-Cocoa==2.5.1
pyobjc-framework-Collaboration==2.5.1
pyobjc-framework-CoreData==2.5.1
pyobjc-framework-CoreLocation==2.5.1
pyobjc-framework-CoreText==2.5.1
pyobjc-framework-DictionaryServices==2.5.1
pyobjc-framework-EventKit==2.5.1
pyobjc-framework-ExceptionHandling==2.5.1
pyobjc-framework-FSEvents==2.5.1
pyobjc-framework-InputMethodKit==2.5.1
pyobjc-framework-InstallerPlugins==2.5.1
pyobjc-framework-InstantMessage==2.5.1
pyobjc-framework-LatentSemanticMapping==2.5.1
pyobjc-framework-LaunchServices==2.5.1
pyobjc-framework-Message==2.5.1
pyobjc-framework-OpenDirectory==2.5.1
pyobjc-framework-PreferencePanes==2.5.1
pyobjc-framework-PubSub==2.5.1
pyobjc-framework-QTKit==2.5.1
pyobjc-framework-Quartz==2.5.1
pyobjc-framework-ScreenSaver==2.5.1
pyobjc-framework-ScriptingBridge==2.5.1
pyobjc-framework-SearchKit==2.5.1
pyobjc-framework-ServiceManagement==2.5.1
pyobjc-framework-Social==2.5.1
pyobjc-framework-SyncServices==2.5.1
pyobjc-framework-SystemConfiguration==2.5.1
pyobjc-framework-WebKit==2.5.1
pyOpenSSL==0.13.1
pyparsing==2.0.1
python-dateutil==1.5
pytz==2013.7
pyzmq==16.0.2
scandir==1.5
scipy==0.13.0b1
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.10.0
tensorflow==1.2.1
Theano==0.8.2
tornado==4.5.1
traitlets==4.3.2
wcwidth==0.1.7
Werkzeug==0.12.2
xattr==0.6.4
zope.interface==4.1.1

Copyright (c) 2016 Michaƫl Defferrard

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: