Organization and Packaging of Python Projects
Organization and Packaging of Python Projects¶
A complex research project often relies and many different programs and software packages to accomplish the research goals. An important part of scientific computing is deciding how to organize and structure the code you use for research. A well-structured project can make you a more efficient and effective researcher. It is also a key component of scientific reproducibility.
Just putting all of your code into git repositories won’t magically turn a mess of scripts into a beautiful, well-organized project. More deliberate effort is required.
Types of Projects¶
Not all projects are created equal. Based on my experience, I categorize three different types of “research code” scenarios commonly encountered in geosciences.
Exploratory analyses: When exploring a new idea, a single notebook or script is often all we need.
A Single Paper: The “paper” is a standard unit of scientific output. The code related to a single paper usually belongs together.
Reusable software elements: In the course of our research computing, we often identify specialized routines that we want to package for reuse in other projects, or by other scientists. This is where “scripts” become “software.”
This lecture outlines some suggested practices for each category.
When starting something new, we are often motivated to just start coding and get some results quick. This is fine! Jupyter notebooks are an ideal format for open-ended exploratory analysis, since they are totally self-contained: they encapsulate text, code, and figures. If we find someting cool or useful, it is important to preserve these exploratory notebooks.
A dedicated github repository can be overkill for a single file. Instead, I recommend github’s “gist” mechanism for saving and sharing such “one-off” notebooks and code snippets. Gists are like mini repos you can easily share and embed. (You can create one right now by going to https://gist.github.com/.)
You can upload any file (including an
.ipynb notebook file) by dragging and dropping it into the gist website.
You have the choice of making you gist public or secret. (There is no private option, but a secret gist can only be seen by others if you give them the URL.)
GitHub’s rendering of Gists is a bit buggy. For a more consistent rendering experience, you can share your gist via http://nbviewer.ipython.org/.
A Single Paper¶
Reproducibility is a cornerstone of the scientific process. However, today one often reads that science is in the midst of a reproducibility crisis. This crisis may be due to increasing complexity and cost of scientific analysis, together with mounting pressure to publish as much and as quickly as possible.
Today almost all earth science relies on some form of computation, from simple statistical analysis and curve fitting to advanced numerical simulation. In principle, computational science should be highly reproducible. However, it also brings unique challenges. A great overview of the challenges and best practices is given in Barba, Lorena A. (2017): Barba-group Reproducibility Syllabus. figshare. https://doi.org/10.6084/m9.figshare.4879928.v1. Many of the suggestions in this lecture are adopted or paraphrased from Barba (2017).
Keep in mind that the audience for a reproducibile project is not just other scientists…it’s you, a year from now, or whenever you need to repeat and / or build on earlier work. Most scientists build on their Ph.D. work for a decade following graduation. Extra time spent on reproducibility now will make you more productive in the long run.
We begin with an important observation.
An article about computational science … is not the scholarship itself, it’s merely scholarship advertisement. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
Donoho, D. et al. (2009), Reproducible research in computational harmonic analysis, Comp. Sci. Eng. 11(1):8–18, doi: 10.1109/MCSE.2009.15
Sandve et al. (2013) give some specific recommmendations for computational reproducibility.
For every result, keep track of how it was produced
Avoid manual data-manipulation steps
Archive the exact versions of all external programs used
Version-control all custom scripts
Record all intermediate results, when possible in standard formats
For analyses that include randomness, note underlying random seeds
Always store raw data behind plots
Generate hierarchical analysis output, allowing layers of increasing detail to be inspected
Connect textual statements to underlying results
Provide public access to scripts, runs, and results
These recommendations suggest a certain structure for a project.
A reproducible single-paper project directory structure might look something like this
README.md LICENSE environment.yml data/intermediate_results.csv notebooks/process_raw_data.ipynb notebooks/figure1.ipynb notebooks/figure2.ipynb notebooks/helper.py manuscript/manuscript.tex
A great example of such a paper is Cesar Rocha’s Upper Ocean Seasonality project: https://github.com/crocha700/UpperOceanSeasonality.
Reuseable Software Elements¶
Scientific software can perhaps be grouped into two categories: single-use “scripts” that are used in a very specific context to do a very specific thing (e.g.~to generate a specific figure for a paper), and reuseable components which encapsulate a more generic workflow. Once you find yourself repeating the same chunks of code in many different scripts or projects, it’s time to start composing reusable software elements.
The basic element of reusability in python is the module.
A module is a
.py file which contains python objects which can be imported by other scripts or notebooks.
Let’s illustrate how modules work with a simple example.
A common task in geoscience is to calculate the great-circle distance between two points on the globe. There are several pacakges that could do this for you, but let’s write our own as an example of a module.
The formula for great circle distance is
(Note that this formula requires 64-bit precision for adequate accuracy.)
Let’s write a module to do this calculation. Open a file called
gcdistance.py in a text editor. (The file should be in the same directory as the notebook you are working in now.) Populate it with the following code:
""" A python module for computing great circle distance """ import numpy as np # approximate radius of Earth R = 6.371e6 def great_circle_distance(point1, point2): """Calculate great-circle distance between two points. PARAMETERS ---------- point1 : tuple A (lat, lon) pair of coordinates in degrees point2 : tuple A (lat, lon) pair of coordinates in degrees RETURNS ------- distance : float """ # unpack coordinates lat1, lon1 = point1 lat2, lon2 = point2 # unpack and convert everything to radians phi1, lambda1, phi2, lambda2 = [np.deg2rad(v) for v in (point1 + point2)] # apply formula # https://en.wikipedia.org/wiki/Great-circle_distance return R*np.arccos( np.sin(phi1)*np.sin(phi2) + np.cos(phi1)*np.cos(phi2)*np.cos(lambda2 - lambda1))
The module begins with a docstring explaining what it does. Then it contains some data (just a constant
R) and a single function.
Now let’s import our module
import gcdistance help(gcdistance)
Help on module gcdistance: NAME gcdistance - A python module for computing great circle distance FUNCTIONS great_circle_distance(point1, point2) Calculate great-circle distance between two points. PARAMETERS ---------- point1 : tuple A (lat, lon) pair of coordinates in degrees point2 : tuple A (lat, lon) pair of coordinates in degrees RETURNS ------- distance : float DATA R = 6371000.0 FILE /Users/rpa/Teaching/research_computing/content/lectures/python/gcdistance.py
And let’s try using it to make a calculation
gcdistance.great_circle_distance((60, 0), (50, 15))
We could just import the function we need
from gcdistance import R, great_circle_distance R
If we change the module, we need to either restart our kernel or else reload the module. (Note that functions imported via
from module import func cannot be reloaded.)
from importlib import reload reload(gcdistance)
<module 'gcdistance' from '/Users/rpa/Teaching/research_computing/content/lectures/python/gcdistance.py'>
Modules are a simple way to share code between different scripts or notebooks in the same project. Module files must reside in the same directory as any script which imports them! This is a big limitation; it means you can’t share modules between different projects.
Once you have a piece of code that is general-purpose enough to share between projects, you need to create a package.
Aside: Python Style¶
There are few absolute rules for python code style, but there is a detailed recommended style guide. Some especially relevant points are:
Line length should not exceed 79 characters
Module names should be
Function and variable names should be
Class names should be
Packages are python’s way of encapsulating reusable code elements for sharing with others. Packaging is a huge and complicated topic. We will just scratch the surface.
We have already interacted with many packages already. Browse some of their github repositories to explore the structure of a large python package:
An example of a smaller, more understandable package is our group’s xrft package:
These packages all have a common basic structure. Imagine we wanted to turn our great-circle distance module into a package. It would look like this.
README.md LICENSE environment.yml requirements.txt setup.py gcdistance/__init__.py gcdistance/gcdistance.py gcdistance/tests/__init__.py gcdistancs/tests/test_gcdistance.py
The actual package is contained in the
gcdistance subdirectory. The other files are auxilliary files which help others understand and install your package. Here is an overview of what they do
Explain what the package is for
Defines the legal terms under which other can use the package. Open source is encouraged!
A conda environment which describes the package’s dependencies (more info)
A file which describes the package’s dependences for pip. (more info)
A special python script which installs your package. (more info)
The actual package¶
gcdistance is the actual package. Any directory that contains an
__init__.py file is recognized by python as a package. This file can be blank, bu it needs to be present. From the root directory, we can import a module from the package as follows
from gcdistance import gcdistance
Yes, this is a bit redundant. That’s because the
gcdistance.py module has the same name as the
gcdistance package directory.
However, this import will only work from the parent directory. It is not globally accessible from your python environment.
setup.py is the magic file that makes your package installable and accessible anywhere. Here is an extremely basic
from setuptools import setup setup( name = "gcdistance", version = "0.1.0", author = "Ryan Abernathey", packages=['gcdistance'], install_requires=['numpy'], )
To run the setup script, we call the following from the command line
python setup.py install
The package files are copied to our python library directory. If we plan to keep developing the package, we can install it in “developer mode” as
python setup.py develop
In this case, the files are symlinked rather than copied.
Tests don’t have to be complicated. They are simply a check to verify that your code does what it is supposed to do.
To add tests to our project, we create create the file
gcdistance/tests/test_gcdistance.py. (We also need an
__init__.py file in the
tests directory.) The example below shows an example of a test function for our package.
import numpy as np import pytest from gcdistance.gcdistance import great_circle_distance def test_great_circle_distance(): # some known results # distance between two same points should be zero assert great_circle_distance((20., 30.), (20., 30.)) == 0 # check distance between new york and london new_york = 40.7128, -74.0060 london = 51.5074, 0.1278 dist_nyc_london = great_circle_distance(new_york, london) # very strict, doesn't actually work # assert dist_nyc_london == 5.587e6 # an approximate version of the above np.testing.assert_allclose(dist_nyc_london, 5.587e6, rtol=1e-5) # now check that we can't pass the wrong number of arguments with pytest.raises(TypeError): great_circle_distance(1, 2, 3, 4)
We will use pytest to run our tests. If you don’t have pytest installed in your active python environment, take a minute to run
pip install pytest from the command line. Now run
from the root directory of your project. You should see a notification that the tests passed. Try playing around with the tests to cause something to fail.
Continuous Integration with Travis CI¶
You can configure automatic testing of your package by integrating github with Travis-CI. Travis-CI is a free “continuous integration” service: it automatically downloads your package and runs your tests in the cloud every time you commit to your repository. The travis getting started guide gives a great overview of how to use the service.
For us to use travis with our project, the steps are simple:
Push the repo to github (repo must be public)
Log in to https://travis-ci.org and click the switch to enable your repo
.travis.ymlfile to your project with the following contents:
language: python python: - 3.6 script: - pytest
Add the file, commit, and push to github
Go to https://travis-ci.org and watch the magic happen!