Chapter 6 Managing Workflows with Make
I hope that by now, you have come to appreciate that the command line is a very convenient environment for exploratory data analysis. You may have noticed that, as a consequence of working with the command line, we:
- Invoke many different commands.
- Create custom command-line tools.
- Obtain and generate many (intermediate) files.
As this process is of an exploratory nature, our workflow tends to be rather chaotic, which makes it difficult to keep track of what we’ve done. It is very important that our steps can be reproduced, whether that is by ourselves or by others. When we, for example, continue with a project from a few weeks earlier, chances are that we have forgotten which commands we have ran, on which files, in which order, and with which parameters. Imagine the difficulty passing on your analysis to a collaborator.
You may recover some lost commands by digging into your Bash history, but this is, of course, not a good approach. A somewhat better approach would be to save your commands to a shell script run.sh. This allows you and your collaborators to at least reproduce the analysis. A shell script is, however, a sub-optimal approach because:
- It is difficult to read and to maintain.
- Dependencies between steps are unclear.
- Every step gets executed every time, which is inefficient and sometimes undesirable.
This is where Make comes in handy (???). Make is command-line tool that allows you to:
- Formalize your data workflow steps in terms of input and output dependencies.
- Run specific steps of your workflow from the command line.
- Use inline code.
- Store and retrieve data from external sources.
Managing your data workflow with Make is the main topic of this chapter. As such, you’ll learn about:
- Defining your workflow with a so-called Makefile.
- Thinking about your workflow in terms of input and output dependencies.
- Build specific targets.
6.2 Introducing Make
Make organizes command execution around data and its dependencies. Your data processing steps are formalized in a separate text file (a workflow). Each step usually has zero, one, or more inputs and outputs. Make automatically resolves their dependencies and determines which commands need to be run and in which order.
This means that when you have, say, an SQL query that takes ten minutes, it only has to be executed when the result is missing or when the query has changed afterwards. Also, if want to (re-)run a specific step, Make only considers to (re-)run the steps on which it depends. This can save you a lot of time.
The benefit of having a formalized workflow allows you to easily pick up your project after a few weeks and to collaborate with others. We strongly advise you to do this, even when you think this will be a one-off project, because you’ll never know when to run certain steps again, or when you want to reuse certain steps in another project.
6.3 A Glorified Task Runner
By default, Make searches for a file called makefile or Makefile in the current directory. I recommend calling your file the latter so that it appears at the top of a directory listing.
Let’s start with a small Makefile in a new directory:
$ cd ~ $ mkdir ch06 && cd ch06 $ pwd /home/dst/ch06 $ $ cat << 'EOF' > Makefile > numbers: > seq 7 > EOF
EOFare syntax used to create Makefile and do not end up being part of the file itself. This way of creating files is known as a Heredoc.
This Makefile contains one target called
numbers. The line below,
seq 7, is known as a rule.
The white space in front of the rule is a tab character. Make is picky when it comes to white space. Some editors insert spaces when you press the TAB key, known as a soft tab, which will cause Make to produce an error. We can verify that our Makefile is correct with
cat -t, which displays tab characters as
$ cat -t Makefile numbers: ^Iseq 7
If we invoke
make with the name of the target:
$ make numbers seq 7 1 2 3 4 5 6 7
then we see that Make first prints the rule itself, and then the output generated by the rule. This is known as building a target. Make was originally created to ease the compilation of source code, which explains some of this jargon.
makewill build the first target specified in the Makefile.
In this case, we’re not actually building anything, as in, we’re not creating any new files. You could say that we’re using Make as a glorified task runner. That already provides value, because we can use a Makefile to keep project-specific shortcuts.
For me, when I use Docker for a project, I often put a target called
docker in a Makefile. For example, the following rule, which is from an actual Makefile I use, launches Jupyter Lab in a Docker container. (Yes, it’s perfectly fine to put long-running commands in a Makefile.)
$ cat some-real-project/Makefile | grep docker docker: docker run --rm -it -p 9999:9999 -v "$$(pwd)/notebooks":/opt/notebooks c ontinuumio/anaconda3 /bin/bash -c "/opt/conda/bin/jupyter lab --notebook-dir=/op t/notebooks --ip="0.0.0.0" --port=9999 --no-browser --allow-root"
This way, I don’t need to remember what incantation I used for this project nor do I don’t need to search my history. But Make can do much more for us!
6.4 Building Targets
Let’s modify our Makefile such the output of the rule is written to a file
$ cat << 'EOF' > Makefile > numbers: > seq 7 > numbers > EOF
Now it makes more sense to speak of building a target:
$ make numbers seq 7 > numbers
What’s more, if we run
$ make numbers make: 'numbers' is up to date.
We see that Make reports that target
numbers is up-to-date. In other words, there’s no need to rebuild the target
numbers because the file
numbers already exists. That’s great because Make is saving us time by not doing unnecessary work.
In Make, it’s all about files. But keep in mind that Make only cares about the name of the target. It does not check whether a file of the same name actually gets created by the rule. If we were to write to a file called
nummers, which is Dutch for “numbers”, and the target would still be called
numbers, then Make would always build this target. Vice versa, if the file
numbers would be created by some other process, whether it’s automated or manual, then Make would still consider that target up-to-date.
We can improve a rule by using the automatic variable
$@, which gets expanded to the name of the target:
$ cat << 'EOF' > Makefile > numbers: > seq 7 > $@ > EOF
Let’s verify this by removing the file
numbers and calling Make again:
$ rm numbers $ make numbers seq 7 > numbers $ cat numbers 1 2 3 4 5 6 7
Another reason for Make to rebuild a target are its dependencies, so let’s discuss that next.
6.5 Adding Dependencies between Targets
So far, we’ve looked at targets which exist in isolation. In a typical data science workflow, steps may depend on other steps. In order to properly talk about dependencies in a Makefile, let’s consider two tasks that work with a data set about Star Wars characters. Here’s an excerpt of that data set:
$ curl -sL 'https://raw.githubusercontent.com/tidyverse/dplyr/master/data-raw/st arwars.csv' | > xsv select name,height,mass,sex,homeworld,species | > head | > csvlook | name | height | mass | sex | homeworld | species | | ------------------ | ------ | ---- | ------ | --------- | ------- | | Luke Skywalker | 172 | 77 | male | Tatooine | Human | | C-3PO | 167 | 75 | | Tatooine | Droid | | R2-D2 | 96 | 32 | | Naboo | Droid | | Darth Vader | 202 | 136 | male | Tatooine | Human | | Leia Organa | 150 | 49 | female | Alderaan | Human | | Owen Lars | 178 | 120 | male | Tatooine | Human | | Beru Whitesun lars | 165 | 75 | female | Tatooine | Human | | R5-D4 | 97 | 32 | | Tatooine | Droid | | Biggs Darklighter | 183 | 84 | male | Tatooine | Human |
The first task computes the ten tallest humans:
$ curl -sL 'https://raw.githubusercontent.com/tidyverse/dplyr/master/data-raw/st arwars.csv' | > grep Human | > cut -d, -f 1,2 | > sort -t, -k2 -nr | > head Darth Vader,202 Qui-Gon Jinn,193 Dooku,193 Bail Prestor Organa,191 Raymus Antilles,188 Mace Windu,188 Anakin Skywalker,188 Gregar Typho,185 Jango Fett,183 Cliegg Lars,183
The second tasks creates a box plot showing the distribution of heights per species:
$ curl -sL 'https://raw.githubusercontent.com/tidyverse/dplyr/master/data-raw/st arwars.csv' | > Rio -ge 'g + geom_boxplot(aes(x = height, y = species))' > heights.png $ display heights.png
We can put these two tasks into a Makefile. Instead of doing this incrementally, I’d first like to show what a complete Makefile looks like and then explain all the syntax step by step.
$ cat << 'EOF' > Makefile > SHELL := bash > .ONESHELL: > .SHELLFLAGS := -eu -o pipefail -c > > URL = "https://raw.githubusercontent.com/tidyverse/dplyr/master/data-raw/starw ars.csv" > > .PHONY: all top10 > > all: top10 heights.png > > data: > mkdir $@ > > data/starwars.csv: data > curl -sL $(URL) > $@ > > top10: data/starwars.csv > grep Human $< | > cut -d, -f 1,2 | > sort -t, -k2 -nr | > head > > heights.png: data/starwars.csv > < $< Rio -ge 'g + geom_boxplot(aes(x = height, y = species))' > $@ > EOF
Let’s go through this Makefile step by step The first three lines are there to change some default settings related to Make itself:
- All rules are executed in a shell, which by default, is
sh. With the
SHELLvariable we can change this to another shell, like
bash. This way we can use everything that Bash has to offer such as for loops. Moreover, it resembles the shell that we are working in, which makes for a more consistent experience.
- By default, every line in a rule is sent separately to the shell. With the special target
.ONESHELLwe can override this so the rule for target
.SHELLFLAGSline makes Bash more strict, which is considered a best practice.2 For example, because of this, the pipeline in the rule for target
top10now stops as soon as there is an error.
We define a custom variable called
URL. Even though this is only used once, I find it helpful to put information like this near the beginning of the file.
With the special target
.PHONY we can indicate which targets are not represented by files. In our case that holds for targets
top10. These targets will now be executed regardless of whether the directory contains files with the same name.
There are five targets:
heights.png. Let’s discuss each of them in turn:
- The target
allhas two dependencies but no rule. This is like a shortcut to execute one or more targets in the order in which they are specified. In this case:
heights.png. The target
allappears as the first target in the Makefile, which means that if we simply run
make, this target will be built.
- The target
datacreate the directory
data. Earlier I said that Make is all about files. Well, it’s also about directories. This target will only be executed when the directory
datadoesn’t yet exist.
- The target
data/starwars.csvdepends on the target
data. If there’s no
datadirectory, it will first be created. Once all dependencies are satisfied, the rule will be executed, which involves downloading a file and saving it to a file with the same name as the target.
- The target
top10is marked as phony, so it will always be built if specified. It depends on the
data/starwars.csvtarget. It makes use of a special variable,
$<which expands to the name of the first prerequisite, namely
- The target
heights.png, like target
data/starwars.csvand makes use of both automatic variables we’ve seen in this chapter. See https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html if you’d like to learn about other automatic variables.
Last but not least, let’s verify that this Makefile works:
$ make mkdir data curl -sL "https://raw.githubusercontent.com/tidyverse/dplyr/master/data-raw/star wars.csv" > data/starwars.csv grep Human data/starwars.csv | cut -d, -f 1,2 | sort -t, -k2 -nr | head Darth Vader,202 Qui-Gon Jinn,193 Dooku,193 Bail Prestor Organa,191 Raymus Antilles,188 Mace Windu,188 Anakin Skywalker,188 Gregar Typho,185 Jango Fett,183 Cliegg Lars,183 < data/starwars.csv Rio -ge 'g + geom_boxplot(aes(x = height, y = species))' > h eights.png
No surprises here. Because we didn’t specify any target, the
all target will be built, which, in turn, causes both the
heights.png targets to be built. The output of the former is printed to standard output and the latter creates a file
data directory is created only once, just like the csv file is only downloaded once.
One of the beauties of the command line is that allows you to play with your data. You can easily execute different commands and process different data files. It is a very interactive and iterative process. After a while, it is easy to forget which steps you have taken to get the desired result. It is therefore very important to document your steps every once in a while. This way, if you or one of your colleagues picks up your project after some time, the same result can be produced again by executing the same steps.
I have shown you that just putting every command in one bash script is suboptimal. I have proposed to use Make as a command-line tool to manage your data workflow. By using a running example, I have shown you how to define steps and the dependencies between them. I’ve also discussed how to use variables and tags.
There’s nothing more fun than just playing with your data and forget everything else. But you have to trust me when I say that it’s worthwhile to keep a record of what you have done using a Makefile. Not only will it make your life easier, but you will also start thinking about your data workflow in terms of steps. Just as with your command-line toolbox, which you expand over time. It makes you more efficient over time, the same holds for Make workflows. The more steps you have defined, The easier it gets to keep doing it, because very often you can reuse certain steps. I hope that you will get used to Make, and that it will make your life easier.
See http://redsymbol.net/articles/unofficial-bash-strict-mode/ for more information.↩︎