Chapter 2 Getting Started
In this chapter we are going to make sure that you have all the prerequisites for doing data science at the command line. The prerequisites fall into two parts: (1) having a proper environment with all the command-line tools that we employ in this book installed, and (2) understanding the essential concepts that come into play when using the command line.
First, we describe how to install the Docker image, which is a virtual environment based on Linux that contains all the necessary command-line tools. Subsequently, we explain the essential command-line concepts through examples.
By the end of this chapter, you’ll have everything you need in order to continue with the first step of doing data science, namely obtaining data.
In this chapter, you’ll learn:
- How to install the Docker image.
- Essential concepts and tools necessary to perform data science at the command line.
2.2 Installing the Docker Image
In this book we use many different command-line tools. Linux often comes with a whole bunch of command-line tools pre-installed. Moreover, Linux offers many packages that contain other, relevant tools. Installing these packages yourself is not too difficult. However, we also use tools that are not available as packages and require a more manual, and more involved, installation. In order to acquire the necessary command-line tools without having to go through the involved installation process of each, we encourage you to install a Docker image that was created specifically for this book.
To install the Docker image, you first need to download and install Docker itself from the Docker website. Once Docker is installed, you invoke the following command on your terminal or command prompt to download the Docker image (don’t type the dollar sign):
You can run the Docker image as follows:
You’re now inside an isolated Linux environment—known as a Docker container—with all the necessary command-line tools installed. If the following command produces an enthusiastic cow, then you know everything is working correctly:
exit to exit the container. If you want to get data in and out of the container, you can add a volume, which means that a local directory gets mapped to a directory inside the container. We recommend that you create a new directory, navigate to this new directory, and then run the following when you’re on macOS or Linux:
Or the following when you’re on Windows and using the command line:
Or the following when you’re using Windows PowerShell:
In the above commands, the option
docker to map the current directory to the /data directory inside the container, so this is the place to get data in and out of the Docker container.
2.3 Essential GNU/Linux Concepts
In Chapter 1, we briefly showed you what the command line is. Now that you are running the Docker image, we can really get started. In this section, we discuss several concepts and tools that you will need to know in order to feel comfortable doing data science at the command line. If, up to now, you have been mainly working with graphical user interfaces, then this might be quite a change. But don’t worry, we’ll start at the beginning, and very gradually go to more advanced topics.
2.3.1 The Environment
So you’ve just logged into a brand new environment. Before we do anything, it is worthwhile to get a high-level understanding of this environment. The environment is roughly defined by four layers, which we briefly discuss from the top down.
- Command-line tools
First and foremost, there are the command-line tools that you work with. We use them by typing their corresponding commands. There are different types of command-line tools, which we will discuss in the next section. Examples of tools are:
ls(Stallman and MacKenzie 2012a),
cat(Granlund and Stallman 2012a), and
The terminal, which is the second layer, is the application where we type our commands in. If you see the following text:
then you would type
seq 3into your terminal and press
<Enter>. (The command-line tool
seq(Drepper 2012) generates a sequence of numbers.) You do not type the dollar sign. It is just there to tell you that this a command you can type in the terminal. This dollar sign is known as the prompt. The text below
seq 3is the output of the command. In Chapter 1, we showed you two screenshots of how the default terminal looks like in macOS and Ubuntu with various commands and their output.
The third layer is the shell. Once we have typed in our command and pressed
<Enter>, the terminal sends that command to the shell. The shell is a program that interprets the command. The Docker image uses Bash as the shell, but there are many others available. Once you have become a bit more proficient at the command line, you may want to look into a shell called the Z shell. It offers many additional features that may increase your productivity at the command line.
- Operating system
The fourth layer is the operating system, which is GNU/Linux in our case. Linux is the name of the kernel, which is the heart of the operating system. The kernel is in direct contact with the CPU, disks, and other hardware. The kernel also executes our command-line tools. GNU, which stands for GNU’s not UNIX, refers to the set of basic tools. The Docker image is based on a particular Linux distribution called Alpine Linux.
2.3.2 Executing a Command-line Tool
Now that you have a basic understanding of the environment, it is high time that you try out some commands. Type the following in your terminal (without the dollar sign) and press
Sometimes we are using commands and pipelines that are too long to fit on the page. In that case you’ll see something like the following:
The greater-than sign is the continuation prompt, which indicates that this line is a continuation of the previous one. A long command can be broken up with either a backslash or a pipe symbol. Be sure to first match any quotation marks. The following command is exactly the same:
This is as simple as it gets. You just executed a command that contained a single command-line tool. The command-line tool
pwd (Meyering 2012b) prints the name of the directory where you currently are. By default, when you login, this is your home directory. You can view the contents of this directory with
ls (Stallman and MacKenzie 2012a):
The command-line tool
cd, which is a Bash builtin, allows you to navigate to a different directory:
The part after
cd specifies to which directory you want to navigate to. Values that come after the command are called command-line arguments or options. The two dots refer to the parent directory. Let’s try a different command:
Here we pass three command-line arguments to
head (MacKenzie and Meyering 2012b). The first one is an option. The second one is a value that belongs to the option. The third one is a filename. This particular command outputs the first three lines of file book/ch02/data/movies.txt.
2.3.3 Five Types of Command-line Tools
We employ the term command-line tool a lot, but so far, we have not yet explained what we actually mean by it. We use it as an umbrella term for anything that can be executed from the command line. Under the hood, each command-line tool is one of the following five types:
- A binary executable.
- A shell builtin.
- An interpreted script.
- A shell function.
- An alias.
It’s good to know the difference between the types. The command-line tools that come pre-installed with the Docker image mostly comprise of the first two types (binary executable and shell builtin). The other three types (interpreted script, shell function, and alias) allow us to further build up our data science toolbox and become more efficient and more productive data scientists.
- Binary Executable
Binary executables are programs in the classical sense. A binary executable is created by compiling source code to machine code. This means that when you open the file in a text editor you cannot read it.
- Shell Builtin
Shell builtins are command-line tools provided by the shell, which is Bash in our case. Examples include
help. These cannot be changed. Shell builtins may differ between shells. Like binary executables, they cannot be easily inspected or changed.
- Interpreted Script
An interpreted script is a text file that is executed by a binary executable. Examples include: Python, R, and Bash scripts. One great advantage of an interpreted script is that you can read and change it. Example 2.1 shows a script named ~/book/ch02/fac.py. This script is interpreted by Python not because of the file extension .py, but because the first line of the script defines the binary that should execute it.Example 2.1 (Python script that computes the factorial of an integer)
This script computes the factorial of the integer that we pass as a parameter. It can be invoked from the command line as follows:
In Chapter 4, we’ll discuss in great detail how to create reusable command-line tools using interpreted scripts.
- Shell Function
A shell function is a function that is, in our case executed by Bash. They provide similar functionality to a Bash script, but they are usually (but not necessarily) smaller than scripts. They also tend to be more personal. The following command defines a function called
fac, which, just like the interpreted Python script above, computes the factorial of the integer we pass as a parameter. It does by generating a list of numbers using
seq, putting those numbers on one line with
*as the delimiter using
paste(Ihnat and MacKenzie 2012), and passing this equation into
bc(Nelson 2006), which is evaluates it and outputs the result.
The file .bashrc, which is a configuration file for Bash, is a good place to define your shell functions, so that they are always available.
Aliases are like macros. If you often find yourself executing a certain command with the same parameters (or a part of it), you can define an alias for this. Aliases are also very useful when you continue to misspell a certain command (see https://github.com/chrishwiggins/mise/blob/master/sh/aliases-public.sh for a long list of useful aliases). The following command defines such an alias:
Now, if you type the following on the command line, the shell will replace each alias it finds with its value:
Aliases are simpler than shell functions as they don’t allow parameters. The function
faccould not have been defined using an alias because of the parameter. Still, aliases allow you to save lots of keystrokes. Like shell functions, aliases are often defined in .bashrc or .bash_aliases configuration files, which are located in your home directory. To see all aliases currently defined, you simply run
aliaswithout arguments. Try it, what do you see?
In this book we will focus mostly on the last three types of command-line tools: interpreted scripts, shell functions, and aliases. This is because these can easily be changed. The purpose of a command-line tool is to make your life on the command line easier, and to make you a more productive and more efficient data scientist. You can find out the type of a command-line tool with
type (which is itself a shell builtin):
type returns two command-line tools for
pwd. In that case, the first reported command-line tool is used when you type
pwd. In the next section we will look at how to combine command-line tools.
2.3.4 Combining Command-line Tools
Because most command-line tools adhere to the UNIX philosophy, they are designed to do only thing, and do it really well. For example, the command-line tool
grep (Meyering 2012a) can filter lines,
wc (Rubin and MacKenzie 2012) can count lines, and
sort (Haertel and Eggert 2012) can sort lines. The power of the command line comes from its ability to combine these small, yet powerful command-line tools. The most important way of combining command-line tools is through a so-called pipe. The output from the first tool is passed to the second tool. There are virtually no limits to this.
Consider, for example, the command-line tool
seq, which generates a sequence of numbers. Let us generate a sequence of five numbers.
The output of a command-line tool is by default passed on to the terminal, which displays it on our screen. We can pipe the ouput of
seq to a second tool, called
grep, which can be used to filter lines. Imagine that we only want to see numbers that contain a “3”. We can combine
grep as follows:
And if we wanted to know how many numbers between 1 and 100 that contain a three, we can use
wc, that is very good at counting things:
-l specifies that
wc should only output the number of lines that are pass into it. By default it also returns the number of characters and words.
You may start to see that combining command-line tools is a very powerful concept. In the rest of the book you will be introduced to many more tools and the functionality they offer when combining them.
2.3.5 Redirecting Input and Output
We mentioned that, by default, the output of the last command-line tool in the pipeline is outputted to the terminal. You can also save this output to a file. This is called output redirection, and works as follows:
Here, we save the output of the
seq tool to a file named ten-numbers in the directory ~/book/ch02/data. If this file does not exist yet, it is created. If this file already did exist, its contents would have been overwritten. You can also append the output to a file with
>>, meaning the output is put after the original contents:
The tool echo just outputs the value you specify. The
-n option specifies that
echo should not output a trailing newline.
Saving the output to a file is useful if you need to store intermediate results, for example for continuing with your analysis at a later stage. To use the contents of the file hello-world again, we can use
cat (Granlund and Stallman 2012a), which reads a file prints it.
(Note that the
-w option indicates
wc to only count words.) The same result can be achieved with the following notation:
This way, you are directly passing the file to the standard input of
wc without running an additional process. If the command-line tool also allows files to be specified as command-line arguments, which many do, you can also do the following for
2.3.6 Working With Files
As data scientists, we work with a lot of data. This data is often stored in files. It is important to know how to work with files (and the directories they live in) on the command line. Every action that you can do using a graphical user interface, you can do with command-line tools (and much more). In this section we introduce the most important ones to create, move, copy, rename, and delete files and directories.
You have already seen how we can create new files by redirecting the output with either
>>. In case you need to move a file to a different directory you can use
mv (Parker, MacKenzie, and Meyering 2012):
You can also rename files with
You can also rename or move entire directories. In case you no longer need a file, you delete (or remove) it with
rm (Rubin, MacKenzie, Stallman, et al. 2012):
In case you want to remove an entire directory with all its contents, specify the
-r option, which stands for recursive:
In case you want to copy a file, use
cp (Granlund, MacKenzie, and Meyering 2012). This is useful for creating backups:
You can create directories using
mkdir (MacKenzie 2012):
Using the command-line tools to manage your files can be scary at first, because you have no graphical overview of the filesystem to provide immediate feedback.
All of the above command-line tools accept the
-v option, which stands for verbose, so that they output what’s going on. All but
mkdir accept the
-i option, which stands for interactive, and causes the tools to ask you for confirmation.
As you are finding your way around the command-line, it may happen that you need help. Even the most-seasoned Linux users need help at some point. It is impossible to remember all the different command-line tools and their possible arguments. Fortunately, the command line offers severals ways to get help.
The most important command to get help is perhaps
man (Eaton and Watson 2014), which is short for manual. It contains information for most command-line tools. Imagine that we forgot the different options to the tool
cat. You can access its man page using:
$ man cat | head -n 20 CAT(1) User Commands CAT(1) NAME cat - concatenate files and print on the standard output SYNOPSIS cat [OPTION]... [FILE]... DESCRIPTION Concatenate FILE(s), or standard input, to standard output. -A, --show-all equivalent to -vET -b, --number-nonblank number nonempty output lines, overrides -n -e equivalent to -vE
cutat the end of a command. This is only to ensure that the output of the command fits on the page; you don’t have to type these. For example,
head -n 5only prints the first five lines,
foldwraps long lines to 80 characters, and
cut -c1-80trims lines that are long than 80 characters.
Not every command-line tool has a man page. For shell builtins, such as
cd, you need to use the
help command-line tool:
$ help cd | head -n 20 cd: cd [-L|[-P [-e]] [-@]] [dir] Change the shell working directory. Change the current directory to DIR. The default DIR is the value of the HOME shell variable. The variable CDPATH defines the search path for the directory containing DIR. Alternative directory names in CDPATH are separated by a colon (:). A null directory name is the same as the current directory. If DIR begins with a slash (/), then CDPATH is not used. If the directory is not found, and the shell option `cdable_vars' is set, the word is assumed to be a variable name. If that variable has a value, its value is used for DIR. Options: -L force symbolic links to be followed: resolve symbolic links in DIR after processing instances of `..' -P use the physical directory structure without following symbolic links: resolve symbolic links in DIR before processing instances
This help also covers other topics of Bash, in case you are interested (try
help without command-line arguments for a list of topics). Remember that you can use
type to figure out the kind of a specific command-line tool.
Newer tools that can be used from the command-line, often lack a man page as well. In that case, your best bet is to invoke the tool with the
--help option (and sometimes the
-h option). For example:
--help option also works for the GNU command-line tools such as
cat. However, the corresponding man page often provides more information. If, after trying these three approaches, you are still stuck, then it is perfectly acceptable to consult the Internet. In the appendix, there’s a list of all command-line tools used in this book. Besides how each command-line tool can be installed, it also shows how you can get help.
2.4 Further Reading
- Heddings, Lowell. 2006. “Keyboard Shortcuts for Bash.” http://www.howtogeek.com/howto/ubuntu/keyboard-shortcuts-for-bash-command-shell-for-ubuntu-debian-suse-redhat-linux-etc.
- Peek, Jerry, Shelley Powers, Tim O’Reilly, and Mike Loukides. 2002. Unix Power Tools. 3rd Ed. O’Reilly Media.