This short primer is an introduction to the scientific Python stack for Data Science. It is designed as a tour around the major Python packages used for the main computational tasks encountered in the sexiest job of the 21st century. At the end of this tour, you'll have a broad overview of the available libraries as well as why and how they are used for each task. This notebook aims at answering the following question: which tool should I use for which task and how. Before starting, two remarks:
This notebook will walk you through a typical Data Science process:
Our motivating example: predict whether a credit card client will default.
Before taking our tour, let's briefly talk about Python. First thing first, the general characteristics of the language:
Technical details:
From those characteristics emerge the following advantages:
And the following disadvantages:
Let's state why is Python a language of choice for Data Scientists. Viable alternatives include matlab, R and Julia, and, for more statistical jobs, the SAS and SPSS statistical packages. The strenghs of Python are:
Jupyter notebook is an HTML-based notebook which allows you to create and share documents that contain live code, equations, visualizations and explanatory text. It allows a clean presentation of computational results as HTML or PDF reports and is well suited for interactive tasks surch as data cleaning, transformation and exploration, numerical simulation, statistical modeling, machine learning and more. It runs everywhere (Window, Mac, Linux, Cloud) and supports multiple languages through various kernels, e.g. Python, R, Julia, Matlab.
While Jupyter is itself becoming an Integreted Development Environment (IDE), alternative scientific IDEs include Spyder and Rodeo. Non-scientific IDEs include IDLE and PyCharm. Vim and Emacs lovers (or more recently Atom and Sublime Text) will find full support of Python in their editor of choice. An interactive prompt, useful for experimentations or as a calculator, is offered by Python itself or by IPython, the Jupyter kernel for Python.
During this tour, we'll need the packages shown below.
%%script sh
cat ./requirements.txt
# Windows
# !type ..\requirements.txt
The statements starting with %
or %%
are built-in magic commands, i.e. commands interpreted by the IPython kernel. E.g. %%script sh
tells IPython to run the cell with the shell sh
(like the #!
line at the beginning of script).
The Python prompt is what you get when typing python
in your terminal. It is useful to test commands and check your installation. We however prefer IPython for interactive work, see below.
Python files, with the extension .py
, are either scripts or modules. A Python script is a file which gets executed with either python myscript.py
or ./myscript.py
if it has execution permissions as well as a shabang (#!) indicating which interpreter should be used.
The IPython prompt is what you get when running ipython
in your terminal. It is more convenient than the Python prompt and is useful for interactive work like small experiments or as a powerful calculator.
The Jupyter notebook is the web interface you get when running jupyter notebook
. It features a file explorer, various kernels (for Python, R, Julia) and can export any notebook to HTML / PDF (via jupyter nbconvert
). The basic document is a notebook which is composed of cells who are either code, results or markdown text / math. The Jupyter notebook is the interface we'll use for most of the course.
Markdown is a lightweight markup language which is very much used to generate HTML documents (e.g. on GitHub or with static website generators). See this cheatsheet as a very short introduction. Or simply edit the cells in this notebook. Markdown can include Latex math such as $y = 2x$.
Most of the packages, i.e. reusable pieces of code, are posted on PyPI, the Python Package Index, by their authors. The Python package manager, pip
, is a command-line tool to search and download packages from PyPI.
Note that some packages, like NumPy, requires native, i.e. compiled, dependencies. That is why installing with
pip install
may fail, as it only manages Python packages. In that case you need to install those dependencies by hand or with the help of a package manager likebrew
for Mac or whatever your Linux distribution uses.
Searching for a package goes like this (can be typed in your terminal):
!pip search music21
!pip install numpy
You can get the list of installed packages with pip freeze
. These are all the packages that are installed and available on your system. They could have been installed by pip install packname
(maybe as a dependancy), by conda install packname
or by your system's package manager.
!pip freeze
Copyright (c) 2016 Michaƫl Defferrard
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: