Reproducibility

If you’ve used GerryChain to do some analysis or research, you may want to ensure that your analysis is completely repeatable by anyone else on their own computer. This guide will walk you through the steps required to make that possible.

Share your code on GitHub

Before anyone can run your code, they’ll need to find it. We strongly recommend publishing your source code as a GitHub repository, and not as a .zip file on your personal website. GitHub has a desktop client that makes this easy.

Use the same versions of all of your dependencies

You will want to make sure that anyone who tries to repeat your analysis by running your code will have the exact same versions of all of the software and packages that you use, including the same version of Python.

The easiest way to do this is to use conda to manage all of your dependencies. You can use conda to export an environment.yml file that anyone can use to replicate your environment by running the command conda env create -f environment.yml. For instructions on how to do this, see Sharing your environment and Creating an environment from an environment.yml file in the conda documentation.

If you’ve published your code on GitHub, it is a good idea to include your environment.yml file in the root folder of your code repository.

Import random from gerrychain.random

The submodule gerrychain.random is the single place where GerryChain imports the built-in Python module random and sets a random seed. This makes sure that all randomness is used after the seed is set. If you use the random module anywhere in your own code (say, in your own proposal function), replace the line import random with from gerrychain.random import random. This will ensure that your code uses the same random seed as GerryChain.

GerryChain sets a random seed of 2018 after it imports random. If you wish to use a different random seed, set it immediately after importing random from gerrychain.random, and before you import anything else. That will look like this:

from gerrychain.random import random
random.seed(12345678)

from gerrychain import MarkovChain, Partition
# and so on...

Set PYTHONHASHSEED=0

In addition to the randomness provided by the random module, Python uses a random seed for its hashing algorithm, which affects how objects are stored in sets and dictionaries. This must happen the same way every time in order for GerryChain runs to be repeatable.

The way to accomplish this is to set the environment variable PYTHONHASHSEED to 0.

If you are using conda for managing packages, dependencies, and environments, you can save environment variables in your conda environment.

Otherwise, in macOS or Linux environments you can accomplish this by running the command export PYTHONHASHSEED=0 in the Terminal or bash shell before running your code.

In a Windows 10 environment using PowerShell, you can accomplish this by running $env:PYTHONHASHSEED=0 before running your code.