Chapter 2 Getting Started

In this chapter we are going to make sure that you have all the prerequisites for doing data science at the command line. The prerequisites fall into two parts: (1) having a proper environment with all the command-line tools that we employ in this book installed, and (2) understanding the essential concepts that come into play when using the command line.

First, we describe how to install the Docker image, which is a virtual environment based on Linux that contains all the necessary command-line tools. Subsequently, we explain the essential command-line concepts through examples.

By the end of this chapter, you’ll have everything you need in order to continue with the first step of doing data science, namely obtaining data.

2.1 Overview

In this chapter, you’ll learn:

  • How to install the Docker image.
  • Essential concepts and tools necessary to perform data science at the command line.

2.2 Installing the Docker Image

This section used to be called Setting up the Data Science Toolbox and described how to install a Vagrant box containing all the command-line tools. This Vagrant box was created in 2014, and because technology around virtualisation and containerisation has moved on, it became high time for an update. So now, instead of a Vagrant box, we use a Docker image.

In this book we use many different command-line tools. Linux often comes with a whole bunch of command-line tools pre-installed. Moreover, Linux offers many packages that contain other, relevant tools. Installing these packages yourself is not too difficult. However, we also use tools that are not available as packages and require a more manual, and more involved, installation. In order to acquire the necessary command-line tools without having to go through the involved installation process of each, we encourage you to install a Docker image that was created specifically for this book.

If you still prefer to run the command-line tools natively rather than inside a Docker image, then you can, of course, install the command-line tools individually yourself. Please be aware that this is a very time-consuming process. The Appendix lists all the command-line tools used in the book. The installation instructions are for Ubuntu only. The scripts and data sets used in the book can be obtained by cloning this book’s GitHub repository.

To install the Docker image, you first need to download and install Docker itself from the Docker website. Once Docker is installed, you invoke the following command on your terminal or command prompt to download the Docker image (don’t type the dollar sign):

You can run the Docker image as follows:

It’s a good idea to remember from which directory you run the Docker container. In the above command, the option -v instructs docker to map the current directory to the /data directory inside the container, so this is the place to get data in and out of the Docker container.

You’re now inside an isolated Linux environment—known as a Docker container—with all the necessary command-line tools installed. If the following command produces an enthusiastic cow, then you know everything is working correctly:

If you would like to know more about the Docker image you can visit it on Docker Hub.

2.3 Essential GNU/Linux Concepts

In Chapter 1, we briefly showed you what the command line is. Now that you are running the Docker image, we can really get started. In this section, we discuss several concepts and tools that you will need to know in order to feel comfortable doing data science at the command line. If, up to now, you have been mainly working with graphical user interfaces, then this might be quite a change. But don’t worry, we’ll start at the beginning, and very gradually go to more advanced topics.

This section is not a complete course in GNU/Linux. We will only explain the concepts and tools that are relevant for to doing data science. One of the advantages of the Docker image is that a lot is already set up. If you wish to know more about GNU/Linux, consult the Further Reading Section at the end of this chapter.

2.3.1 The Environment

So you’ve just logged into a brand new environment. Before we do anything, it is worthwhile to get a high-level understanding of this environment. The environment is roughly defined by four layers, which we briefly discuss from the top down.

Command-line tools

First and foremost, there are the command-line tools that you work with. We use them by typing their corresponding commands. There are different types of command-line tools, which we will discuss in the next section. Examples of tools are: ls (Stallman and MacKenzie 2012a), cat (Granlund and Stallman 2012a), and jq (Dolan 2014).

Terminal

The terminal, which is the second layer, is the application where we type our commands in. If you see the following text:

then you would type seq 3 into your terminal and press <Enter>. (The command-line tool seq (Drepper 2012) generates a sequence of numbers.) You do not type the dollar sign. It is just there to tell you that this a command you can type in the terminal. This dollar sign is known as the prompt. The text below seq 3 is the output of the command. In Chapter 1, we showed you two screenshots of how the default terminal looks like in macOS and Ubuntu with various commands and their output.

Shell

The third layer is the shell. Once we have typed in our command and pressed <Enter>, the terminal sends that command to the shell. The shell is a program that interprets the command. The Docker image uses Bash as the shell, but there are many others available. Once you have become a bit more proficient at the command line, you may want to look into a shell called the Z shell. It offers many additional features that may increase your productivity at the command line.

Operating system

The fourth layer is the operating system, which is GNU/Linux in our case. Linux is the name of the kernel, which is the heart of the operating system. The kernel is in direct contact with the CPU, disks, and other hardware. The kernel also executes our command-line tools. GNU, which stands for GNU’s not UNIX, refers to the set of basic tools. The Docker image is based on a particular Linux distribution called Alpine Linux.

2.3.2 Executing a Command-line Tool

Now that you have a basic understanding of the environment, it is high time that you try out some commands. Type the following in your terminal (without the dollar sign) and press <Enter>:

Sometimes we are using commands and pipelines that are too long to fit on the page. In that case you’ll see something like the following:

The greater-than sign is the continuation prompt, which indicates that this line is a continuation of the previous one. A long command can be broken up with either a backslash or a pipe symbol. Be sure to first match any quotation marks. The following command is exactly the same:

This is as simple as it gets. You just executed a command that contained a single command-line tool. The command-line tool pwd (Meyering 2012b) prints the name of the directory where you currently are. By default, when you login, this is your home directory. You can view the contents of this directory with ls (Stallman and MacKenzie 2012a):

The command-line tool cd, which is a Bash builtin, allows you to navigate to a different directory:

The part after cd specifies to which directory you want to navigate to. Values that come after the command are called command-line arguments or options. The two dots refer to the parent directory. Let’s try a different command:

Here we pass three command-line arguments to head (MacKenzie and Meyering 2012b). The first one is an option. The second one is a value that belongs to the option. The third one is a filename. This particular command outputs the first three lines of file book/ch02/data/movies.txt.

2.3.3 Five Types of Command-line Tools

We employ the term command-line tool a lot, but so far, we have not yet explained what we actually mean by it. We use it as an umbrella term for anything that can be executed from the command line. Under the hood, each command-line tool is one of the following five types:

  • A binary executable.
  • A shell builtin.
  • An interpreted script.
  • A shell function.
  • An alias.

It’s good to know the difference between the types. The command-line tools that come pre-installed with the Docker image mostly comprise of the first two types (binary executable and shell builtin). The other three types (interpreted script, shell function, and alias) allow us to further build up our data science toolbox and become more efficient and more productive data scientists.

Binary Executable

Binary executables are programs in the classical sense. A binary executable is created by compiling source code to machine code. This means that when you open the file in a text editor you cannot read it.

Shell Builtin

Shell builtins are command-line tools provided by the shell, which is Bash in our case. Examples include cd and help. These cannot be changed. Shell builtins may differ between shells. Like binary executables, they cannot be easily inspected or changed.

Interpreted Script

An interpreted script is a text file that is executed by a binary executable. Examples include: Python, R, and Bash scripts. One great advantage of an interpreted script is that you can read and change it. Example 2.1 shows a script named ~/book/ch02/fac.py. This script is interpreted by Python not because of the file extension .py, but because the first line of the script defines the binary that should execute it.

Example 2.1 (Python script that computes the factorial of an integer)

This script computes the factorial of the integer that we pass as a parameter. It can be invoked from the command line as follows:

In Chapter 4, we’ll discuss in great detail how to create reusable command-line tools using interpreted scripts.

Shell Function

A shell function is a function that is, in our case executed by Bash. They provide similar functionality to a Bash script, but they are usually (but not necessarily) smaller than scripts. They also tend to be more personal. The following command defines a function called fac, which, just like the interpreted Python script above, computes the factorial of the integer we pass as a parameter. It does by generating a list of numbers using seq, putting those numbers on one line with * as the delimiter using paste (Ihnat and MacKenzie 2012), and passing this equation into bc (Nelson 2006), which is evaluates it and outputs the result.

The file .bashrc, which is a configuration file for Bash, is a good place to define your shell functions, so that they are always available.

Alias

Aliases are like macros. If you often find yourself executing a certain command with the same parameters (or a part of it), you can define an alias for this. Aliases are also very useful when you continue to misspell a certain command (see https://github.com/chrishwiggins/mise/blob/master/sh/aliases-public.sh for a long list of useful aliases). The following command defines such an alias:

Now, if you type the following on the command line, the shell will replace each alias it finds with its value:

Aliases are simpler than shell functions as they don’t allow parameters. The function fac could not have been defined using an alias because of the parameter. Still, aliases allow you to save lots of keystrokes. Like shell functions, aliases are often defined in .bashrc or .bash_aliases configuration files, which are located in your home directory. To see all aliases currently defined, you simply run alias without arguments. Try it, what do you see?

In this book we will focus mostly on the last three types of command-line tools: interpreted scripts, shell functions, and aliases. This is because these can easily be changed. The purpose of a command-line tool is to make your life on the command line easier, and to make you a more productive and more efficient data scientist. You can find out the type of a command-line tool with type (which is itself a shell builtin):

type returns two command-line tools for pwd. In that case, the first reported command-line tool is used when you type pwd. In the next section we will look at how to combine command-line tools.

2.3.4 Combining Command-line Tools

Because most command-line tools adhere to the UNIX philosophy, they are designed to do only thing, and do it really well. For example, the command-line tool grep (Meyering 2012a) can filter lines, wc (Rubin and MacKenzie 2012) can count lines, and sort (Haertel and Eggert 2012) can sort lines. The power of the command line comes from its ability to combine these small, yet powerful command-line tools. The most important way of combining command-line tools is through a so-called pipe. The output from the first tool is passed to the second tool. There are virtually no limits to this.

Consider, for example, the command-line tool seq, which generates a sequence of numbers. Let us generate a sequence of five numbers.

The output of a command-line tool is by default passed on to the terminal, which displays it on our screen. We can pipe the ouput of seq to a second tool, called grep, which can be used to filter lines. Imagine that we only want to see numbers that contain a “3”. We can combine seq and grep as follows:

And if we wanted to know how many numbers between 1 and 100 that contain a three, we can use wc, that is very good at counting things:

The option -l specifies that wc should only output the number of lines that are pass into it. By default it also returns the number of characters and words.

You may start to see that combining command-line tools is a very powerful concept. In the rest of the book you will be introduced to many more tools and the functionality they offer when combining them.

2.3.5 Redirecting Input and Output

We mentioned that, by default, the output of the last command-line tool in the pipeline is outputted to the terminal. You can also save this output to a file. This is called output redirection, and works as follows:

Here, we save the output of the seq tool to a file named ten-numbers in the directory ~/book/ch02/data. If this file does not exist yet, it is created. If this file already did exist, its contents would have been overwritten. You can also append the output to a file with >>, meaning the output is put after the original contents:

The tool echo just outputs the value you specify. The -n option specifies that echo should not output a trailing newline.

Saving the output to a file is useful if you need to store intermediate results, for example for continuing with your analysis at a later stage. To use the contents of the file hello-world again, we can use cat (Granlund and Stallman 2012a), which reads a file prints it.

(Note that the -w option indicates wc to only count words.) The same result can be achieved with the following notation:

This way, you are directly passing the file to the standard input of wc without running an additional process. If the command-line tool also allows files to be specified as command-line arguments, which many do, you can also do the following for wc:

2.3.6 Working With Files

As data scientists, we work with a lot of data. This data is often stored in files. It is important to know how to work with files (and the directories they live in) on the command line. Every action that you can do using a graphical user interface, you can do with command-line tools (and much more). In this section we introduce the most important ones to create, move, copy, rename, and delete files and directories.

You have already seen how we can create new files by redirecting the output with either > or >>. In case you need to move a file to a different directory you can use mv (Parker, MacKenzie, and Meyering 2012):

You can also rename files with mv:

You can also rename or move entire directories. In case you no longer need a file, you delete (or remove) it with rm (Rubin, MacKenzie, Stallman, et al. 2012):

In case you want to remove an entire directory with all its contents, specify the -r option, which stands for recursive:

In case you want to copy a file, use cp (Granlund, MacKenzie, and Meyering 2012). This is useful for creating backups:

You can create directories using mkdir (MacKenzie 2012):

Using the command-line tools to manage your files can be scary at first, because you have no graphical overview of the filesystem to provide immediate feedback.

All of the above command-line tools accept the -v option, which stands for verbose, so that they output what’s going on. All but mkdir accept the -i option, which stands for interactive, and causes the tools to ask you for confirmation.

2.3.7 Help!

As you are finding your way around the command-line, it may happen that you need help. Even the most-seasoned Linux users need help at some point. It is impossible to remember all the different command-line tools and their possible arguments. Fortunately, the command line offers severals ways to get help.

The most important command to get help is perhaps man (Eaton and Watson 2014), which is short for manual. It contains information for most command-line tools. Imagine that we forgot the different options to the tool cat. You can access its man page using:

Sometimes you’ll see us use head, fold, or cut at the end of a command. This is only to ensure that the output of the command fits on the page; you don’t have to type these. For example, head -n 5 only prints the first five lines, fold wraps long lines to 80 characters, and cut -c1-80 trims lines that are long than 80 characters.

Not every command-line tool has a man page. For shell builtins, such as cd, you need to use the help command-line tool:

This help also covers other topics of Bash, in case you are interested (try help without command-line arguments for a list of topics). Remember that you can use type to figure out the kind of a specific command-line tool.

Newer tools that can be used from the command-line, often lack a man page as well. In that case, your best bet is to invoke the tool with the --help option (and sometimes the -h option). For example:

Specifying the --help option also works for the GNU command-line tools such as cat. However, the corresponding man page often provides more information. If, after trying these three approaches, you are still stuck, then it is perfectly acceptable to consult the Internet. In the appendix, there’s a list of all command-line tools used in this book. Besides how each command-line tool can be installed, it also shows how you can get help.

2.4 Further Reading