Chapter 6 Managing Workflows with Make

I hope that by now, you have come to appreciate that the command line is a very convenient environment for exploratory data analysis. You may have noticed that, as a consequence of working with the command line, we:

  • Invoke many different commands.
  • Create custom command-line tools.
  • Obtain and generate many (intermediate) files.

As this process is of an exploratory nature, our workflow tends to be rather chaotic, which makes it difficult to keep track of what we’ve done. It is very important that our steps can be reproduced, whether that is by ourselves or by others. When we, for example, continue with a project from a few weeks earlier, chances are that we have forgotten which commands we have ran, on which files, in which order, and with which parameters. Imagine the difficulty passing on your analysis to a collaborator.

You may recover some lost commands by digging into your Bash history, but this is, of course, not a good approach. A somewhat better approach would be to save your commands to a shell script run.sh. This allows you and your collaborators to at least reproduce the analysis. A shell script is, however, a sub-optimal approach because:

  • It is difficult to read and to maintain.
  • Dependencies between steps are unclear.
  • Every step gets executed every time, which is inefficient and sometimes undesirable.

This is where Make comes in handy (???). Make is command-line tool that allows you to:

  • Formalize your data workflow steps in terms of input and output dependencies.
  • Run specific steps of your workflow from the command line.
  • Use inline code.
  • Store and retrieve data from external sources.
In the first edition, this chapter used Drake (Factual 2014) instead of Make. Drake was supposed to be a successor to Make with additional features to work with data. Unfortunately, it appears that Drake has been abandoned by its creators a couple of years ago with a couple of unresolved bugs. That’s why I’ve decided to use Make instead.

6.1 Overview

Managing your data workflow with Make is the main topic of this chapter. As such, you’ll learn about:

  • Defining your workflow with a so-called Makefile.
  • Thinking about your workflow in terms of input and output dependencies.
  • Build specific targets.

6.2 Introducing Make

Make organizes command execution around data and its dependencies. Your data processing steps are formalized in a separate text file (a workflow). Each step usually has zero, one, or more inputs and outputs. Make automatically resolves their dependencies and determines which commands need to be run and in which order.

This means that when you have, say, an SQL query that takes ten minutes, it only has to be executed when the result is missing or when the query has changed afterwards. Also, if want to (re-)run a specific step, Make only considers to (re-)run the steps on which it depends. This can save you a lot of time.

The benefit of having a formalized workflow allows you to easily pick up your project after a few weeks and to collaborate with others. We strongly advise you to do this, even when you think this will be a one-off project, because you’ll never know when to run certain steps again, or when you want to reuse certain steps in another project.

6.3 A Glorified Task Runner

By default, Make searches for a file called makefile or Makefile in the current directory. I recommend calling your file the latter so that it appears at the top of a directory listing.

Let’s start with a small Makefile in a new directory:

$ cd ~
$ mkdir ch06 && cd ch06
$ pwd
/home/dst/ch06
$
$ cat << 'EOF' > Makefile
> numbers:
>       seq 7
> EOF
Note that the lines that starting with cat and EOF are syntax used to create Makefile and do not end up being part of the file itself. This way of creating files is known as a Heredoc.

This Makefile contains one target called numbers. The line below, seq 7, is known as a rule.

The white space in front of the rule is a tab character. Make is picky when it comes to white space. Some editors insert spaces when you press the TAB key, known as a soft tab, which will cause Make to produce an error. We can verify that our Makefile is correct with cat -t, which displays tab characters as ^I:

$ cat -t Makefile
numbers:
^Iseq 7

If we invoke make with the name of the target:

$ make numbers
seq 7
1
2
3
4
5
6
7

then we see that Make first prints the rule itself, and then the output generated by the rule. This is known as building a target. Make was originally created to ease the compilation of source code, which explains some of this jargon.

If you don’t specify the name of a target, then make will build the first target specified in the Makefile.

In this case, we’re not actually building anything, as in, we’re not creating any new files. You could say that we’re using Make as a glorified task runner. That already provides value, because we can use a Makefile to keep project-specific shortcuts.

For me, when I use Docker for a project, I often put a target called docker in a Makefile. For example, the following rule, which is from an actual Makefile I use, launches Jupyter Lab in a Docker container. (Yes, it’s perfectly fine to put long-running commands in a Makefile.)

$ cat some-real-project/Makefile | grep docker
docker:
        docker run --rm -it -p 9999:9999 -v "$$(pwd)/notebooks":/opt/notebooks c
ontinuumio/anaconda3 /bin/bash -c "/opt/conda/bin/jupyter lab --notebook-dir=/op
t/notebooks --ip="0.0.0.0" --port=9999 --no-browser --allow-root"

This way, I don’t need to remember what incantation I used for this project nor do I don’t need to search my history. But Make can do much more for us!

6.4 Building Targets

Let’s modify our Makefile such the output of the rule is written to a file numbers.

$ cat << 'EOF' > Makefile
> numbers:
>       seq 7 > numbers
> EOF

Now it makes more sense to speak of building a target:

$ make numbers
seq 7 > numbers

What’s more, if we run make again:

$ make numbers
make: 'numbers' is up to date.

We see that Make reports that target numbers is up-to-date. In other words, there’s no need to rebuild the target numbers because the file numbers already exists. That’s great because Make is saving us time by not doing unnecessary work.

In Make, it’s all about files. But keep in mind that Make only cares about the name of the target. It does not check whether a file of the same name actually gets created by the rule. If we were to write to a file called nummers, which is Dutch for “numbers”, and the target would still be called numbers, then Make would always build this target. Vice versa, if the file numbers would be created by some other process, whether it’s automated or manual, then Make would still consider that target up-to-date.

We can improve a rule by using the automatic variable $@, which gets expanded to the name of the target:

$ cat << 'EOF' > Makefile
> numbers:
>       seq 7 > $@
> EOF

Let’s verify this by removing the file numbers and calling Make again:

$ rm numbers
$ make numbers
seq 7 > numbers
$ cat numbers
1
2
3
4
5
6
7

Another reason for Make to rebuild a target are its dependencies, so let’s discuss that next.

6.5 Adding Dependencies between Targets

So far, we’ve looked at targets which exist in isolation. In a typical data science workflow, steps may depend on other steps. In order to properly talk about dependencies in a Makefile, let’s consider two tasks that work with a data set about Star Wars characters. Here’s an excerpt of that data set:

$ curl -sL 'https://raw.githubusercontent.com/tidyverse/dplyr/master/data-raw/st
arwars.csv' |
> xsv select name,height,mass,sex,homeworld,species |
> head |
> csvlook
| name               | height | mass | sex    | homeworld | species |
| ------------------ | ------ | ---- | ------ | --------- | ------- |
| Luke Skywalker     |    172 |   77 | male   | Tatooine  | Human   |
| C-3PO              |    167 |   75 |        | Tatooine  | Droid   |
| R2-D2              |     96 |   32 |        | Naboo     | Droid   |
| Darth Vader        |    202 |  136 | male   | Tatooine  | Human   |
| Leia Organa        |    150 |   49 | female | Alderaan  | Human   |
| Owen Lars          |    178 |  120 | male   | Tatooine  | Human   |
| Beru Whitesun lars |    165 |   75 | female | Tatooine  | Human   |
| R5-D4              |     97 |   32 |        | Tatooine  | Droid   |
| Biggs Darklighter  |    183 |   84 | male   | Tatooine  | Human   |

The first task computes the ten tallest humans:

$ curl -sL 'https://raw.githubusercontent.com/tidyverse/dplyr/master/data-raw/st
arwars.csv' |
> grep Human |
> cut -d, -f 1,2 |
> sort -t, -k2 -nr |
> head
Darth Vader,202
Qui-Gon Jinn,193
Dooku,193
Bail Prestor Organa,191
Raymus Antilles,188
Mace Windu,188
Anakin Skywalker,188
Gregar Typho,185
Jango Fett,183
Cliegg Lars,183

The second tasks creates a box plot showing the distribution of heights per species:

$ curl -sL 'https://raw.githubusercontent.com/tidyverse/dplyr/master/data-raw/st
arwars.csv' |
> Rio -ge 'g + geom_boxplot(aes(x = height, y = species))' > heights.png
$ display heights.png
Distribution of heights per species in Star Wars

Figure 6.1: Distribution of heights per species in Star Wars

We can put these two tasks into a Makefile. Instead of doing this incrementally, I’d first like to show what a complete Makefile looks like and then explain all the syntax step by step.

$ cat << 'EOF' > Makefile
> SHELL := bash
> .ONESHELL:
> .SHELLFLAGS := -eu -o pipefail -c
>
> URL = "https://raw.githubusercontent.com/tidyverse/dplyr/master/data-raw/starw
ars.csv"
>
> .PHONY: all top10
>
> all: top10 heights.png
>
> data:
>       mkdir $@
>
> data/starwars.csv: data
>       curl -sL $(URL) > $@
>
> top10: data/starwars.csv
>       grep Human $< |
>       cut -d, -f 1,2 |
>       sort -t, -k2 -nr |
>       head
>
> heights.png: data/starwars.csv
>       < $< Rio -ge 'g + geom_boxplot(aes(x = height, y = species))' > $@
> EOF

Let’s go through this Makefile step by step The first three lines are there to change some default settings related to Make itself:

  1. All rules are executed in a shell, which by default, is sh. With the SHELL variable we can change this to another shell, like bash. This way we can use everything that Bash has to offer such as for loops. Moreover, it resembles the shell that we are working in, which makes for a more consistent experience.
  2. By default, every line in a rule is sent separately to the shell. With the special target .ONESHELL we can override this so the rule for target top10 works.
  3. The .SHELLFLAGS line makes Bash more strict, which is considered a best practice.2 For example, because of this, the pipeline in the rule for target top10 now stops as soon as there is an error.

We define a custom variable called URL. Even though this is only used once, I find it helpful to put information like this near the beginning of the file.

With the special target .PHONY we can indicate which targets are not represented by files. In our case that holds for targets all and top10. These targets will now be executed regardless of whether the directory contains files with the same name.

There are five targets: all, data, data/starwars.csv, top10, and heights.png. Let’s discuss each of them in turn:

  1. The target all has two dependencies but no rule. This is like a shortcut to execute one or more targets in the order in which they are specified. In this case: top10 and heights.png. The target all appears as the first target in the Makefile, which means that if we simply run make, this target will be built.
  2. The target data create the directory data. Earlier I said that Make is all about files. Well, it’s also about directories. This target will only be executed when the directory data doesn’t yet exist.
  3. The target data/starwars.csv depends on the target data. If there’s no data directory, it will first be created. Once all dependencies are satisfied, the rule will be executed, which involves downloading a file and saving it to a file with the same name as the target.
  4. The target top10 is marked as phony, so it will always be built if specified. It depends on the data/starwars.csv target. It makes use of a special variable, $< which expands to the name of the first prerequisite, namely data/starwars.csv.
  5. The target heights.png, like target top10 depends data/starwars.csv and makes use of both automatic variables we’ve seen in this chapter. See https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html if you’d like to learn about other automatic variables.

Last but not least, let’s verify that this Makefile works:

$ make
mkdir data
curl -sL "https://raw.githubusercontent.com/tidyverse/dplyr/master/data-raw/star
wars.csv" > data/starwars.csv
grep Human data/starwars.csv |
cut -d, -f 1,2 |
sort -t, -k2 -nr |
head
Darth Vader,202
Qui-Gon Jinn,193
Dooku,193
Bail Prestor Organa,191
Raymus Antilles,188
Mace Windu,188
Anakin Skywalker,188
Gregar Typho,185
Jango Fett,183
Cliegg Lars,183
< data/starwars.csv Rio -ge 'g + geom_boxplot(aes(x = height, y = species))' > h
eights.png

No surprises here. Because we didn’t specify any target, the all target will be built, which, in turn, causes both the top10 and heights.png targets to be built. The output of the former is printed to standard output and the latter creates a file heights.png. The data directory is created only once, just like the csv file is only downloaded once.

6.6 Discussion

One of the beauties of the command line is that allows you to play with your data. You can easily execute different commands and process different data files. It is a very interactive and iterative process. After a while, it is easy to forget which steps you have taken to get the desired result. It is therefore very important to document your steps every once in a while. This way, if you or one of your colleagues picks up your project after some time, the same result can be produced again by executing the same steps.

I have shown you that just putting every command in one bash script is suboptimal. I have proposed to use Make as a command-line tool to manage your data workflow. By using a running example, I have shown you how to define steps and the dependencies between them. I’ve also discussed how to use variables and tags.

There’s nothing more fun than just playing with your data and forget everything else. But you have to trust me when I say that it’s worthwhile to keep a record of what you have done using a Makefile. Not only will it make your life easier, but you will also start thinking about your data workflow in terms of steps. Just as with your command-line toolbox, which you expand over time. It makes you more efficient over time, the same holds for Make workflows. The more steps you have defined, The easier it gets to keep doing it, because very often you can reuse certain steps. I hope that you will get used to Make, and that it will make your life easier.