IPython Notebooks are one of the portals through which data scientists and enterprise architects will soon be able to utilize the xPatterns analytics framework. While Python is steadily gaining popularity among data scientists, and is becoming increasingly popular for enterprise applications, we recognize that some of our target users may not yet be familiar or proficient with the language.
To help bridge this gap, we’ve developed an IPython Notebook-based tutorial (primer) on Python for Data Science, designed to accomplish two goals:
- To help enterprise architects (and other technologists and/or businessfolk) understand some of the basic concepts of data science.
- To help programmers with experience in other programming languages learn enough about Python to enable their use of the open source and proprietary Python-based data science tools, such as the xPatterns analytics framework.
- The Cross-Industry Standard Process for Data Mining (CRISP-DM)
- A Data Science Workflow
- Basic data science terminology
- An example of supervised classification, the UCI Mushroom Data Set
The Python concepts and constructs covered throughout the rest of the notebook are designed to lead up to the creation and use of a specific type of supervised classifier, a simplified decision tree, on the mushroom dataset. There are 10 exercises for those who want to actively practice and extend their learning, and solutions to the exercises are provided in separate files.
The notebook – or, more precisely, collection of notebooks – can be found in a public GitHub repository.
[Edit: The easiest way to view the notebook in a browser (if you are not running a local IPython Notebook server) is via the version rendered on the IPython nbviewer site. Links to the ipynb and html files below are not directly viewable in a browser.]
The following is an excerpt from the repository’s README.md file:
The primer is spread across a collection of IPython Notebooks, and the easiest way to use the primer is to install IPython Notebook on your computer. You can also install Python, and manually copy and paste the pieces of sample code into the Python interpreter, as the primer only makes use of the Python standard libraries.
There are three versions of the primer. Two versions contain the entire primer in a single notebook:
- Single IPython Notebook: Python_for_Data_Science_all.ipynb
- Single web page (HTML): Python_for_Data_Science_all.html
The other version divides the primer into 5 separate notebooks:
- Data Science: Basic Concepts
- Python: Basic Concepts
- Using Python to Build and Use a Simple Decision Tree Classifier
- Next Steps
There are several exercises included in the notebooks. Sample solutions to those exercises can be found in two Python source files:
simple_ml.py: a collection of simple machine learning utility functions
SimpleDecisionTree.py: a Python class to encapsulate a simplified version of a popular machine learning model
There are also 2 data files, based on the mushroom dataset in the UCI Machine Learning Repository, used for coding examples, exploratory data analysis and building and evaluating decision trees in Python:
We hope the notebook will prove useful to data scientists and enterprise architects who are interested in using Python-based data science tools – ideally including the xPatterns analytics framework – to develop data-driven solutions to their business problems. We plan to develop and post other IPython Notebooks in the near future that are more focused on how to exercise different capabilities of the xPatterns analytics framework.
Meanwhile, we welcome any feedback on this notebook, or suggestions of other topics we might include in future notebooks.