Chapter 3 Obtaining Data

This chapter deals with the first step of the OSEMN model: obtaining data. After all, without any data, there is not much data science that we can do. We assume that the data that is needed to solve the data science problem at hand already exists at some location in some form. Our goal is to get this data onto your computer (or into your Data Science Toolbox) in a form that we can work with.

According to the Unix philosophy, text is a universal interface. Almost every command-line tool takes text as input, produces text as output, or both. This is the main reason why command-line tools can work so well together. However, as we’ll see, even just text can come in multiple forms.

Data can be obtained in several ways—for example by downloading it from a server, by querying a database, or by connecting to a Web API. Sometimes, the data comes in a compressed form or in a binary format such as Microsoft Excel. In this chapter, we discuss several tools that help tackle this from the command line, including: curl (Stenberg 2012), in2csv (Groskopf 2014b), sql2csv (Groskopf 2014c), and tar (Bailey, Eggert, and Poznyakoff 2014).

3.1 Overview

In this chapter, you’ll learn how to:

  • Obtain data from the Internet
  • Query databases
  • Connect to Web APIs
  • Decompress files
  • Convert Microsoft Excel spreadsheets into usable data

3.2 Copying Local Files to the Data Science Toolbox

A common situation is that you already have the necessary files on your own computer. This section explains how you can get those files onto the local or remote version of the Data Science Toolbox.

3.2.1 Local Version of Data Science Toolbox

We mentioned in Chapter 2 that the Vagrant version of the Data Science Toolbox is an isolated virtual environment. Luckily there is one exception to that: files can be transfered in and out the Data Science Toolbox. The local directory from which you ran vagrant up (which is the one that contains the file Vagrantfile), is mapped to a directory in the Data Science Toolbox. This directory is called /vagrant. Please note that this is not your home directory. Let us check the contents of this directory:

If you have a file on your local computer, and you want to apply some command-line tools to it, all you have to do is copy or move the file to that directory. Let’s assume that you have a file called logs.csv on your Desktop. If you are running Linux or macOS, execute the following command on your operating system (and not inside the Data Science Toolbox):

And if you are running Windows, you can run the following commands on the command prompt:

You may also drag-and-drop the file into the directory using Windows Explorer.

The file is now located in the directory /vagrant. It is a good idea to keep your data in a separate directory, like we have ~/book/ch03/data. So, after you have copied the file, you can move it by running:

3.2.2 Remote Version of Data Science Toolbox

If you are running Linux or macOS, you can use scp (Rinne and Ylonen 2014), which stands for secure copy, to copy files onto the EC2 instance. You will need the same key pair file that you used to login to the EC2 instance.

Replace the host name in the example ec2-184-73-72-150.compute-1.amazonaws.com with the value you see on the EC2 overview page in the AWS console.

3.3 Decompressing Files

If the original data set is very large or it’s a collection of many files, the file may be a (compressed) archive. Data sets which contain many repeated values (such as the words in a text file or the keys in a JSON file) are especially well suited for compression.

Common file extensions of compressed archives are: .tar.gz, .zip, and .rar. To decompress these, you would use the command-line tools tar (Bailey, Eggert, and Poznyakoff 2014), unzip (Smith 2009), and unrar (Asselstine, Scheurer, and Winkelmann 2014), respectively. There exists a few more, though less common, file extensions for which you would need yet other tools. For example, in order to extract a file named logs.tar.gz, you would use:

Indeed, tar is notorious for its many command-line arguments. In this case, the four command-line arguments x, z, v, and f specify that tar should extract files from an archive, use gzip as the decompression algorithm, be verbose and use file logs.tar.gz. In time, you’ll get used to typing these four characters, but there’s a more convenient way.

Rather than remembering the different command-line tools and their options, there’s a handy script called unpack (Brisbin 2013), which will decompress many different formats. unpack looks at the extension of the file that you want to decompress, and calls the appropriate command-line tool.

The unpack tool is part of the Data Science Toolbox. Remember that you can look up how it can be installed in the appendix. Example 3.1 shows the source of unpack. Although Bash scripting is not the focus of this book, it’s still useful to take a moment to figure out how it works.

Example 3.1 (Decompress various file formats)

Now, in order to decompress this same file, you would simply use:

3.4 Converting Microsoft Excel Spreadsheets

For many people, Microsoft Excel offers an intuitive way to work with small data sets and perform calculations on them. As a result, a lot of data is embedded into Microsoft Excel spreadsheets. These spreadsheets are, depending on the extension of the filename, stored in either a proprietary binary format (.xls) or as a collection of compressed XML files (.xlsx). In both cases, the data is not readily usable by most command-line tools. It would be a shame if we could not use those valuable data sets just because they are stored this way.

Luckily, there is a command-line tool called in2csv (Groskopf 2014b), which is able to convert Microsoft Excel spreadsheets to CSV files. CSV stands for comma-separated values. Working with CSV can be tricky because it lacks a formal specification. RFC 4180 defines the CSV format according to the following three points:

  1. Each record is located on a separate line, delimited by a line break (CRLF). For example:

    aaa,bbb,ccc CRLF
    zzz,yyy,xxx CRLF
  2. The last record in the file may or may not have an ending line break. For example:

    aaa,bbb,ccc CRLF
    zzz,yyy,xxx
  3. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional header parameter of this MIME type). For example:

    field_name,field_name,field_name CRLF
    aaa,bbb,ccc CRLF
    zzz,yyy,xxx CRLF

Let’s demonstrate in2csv using a spreadsheet that contains the top 250 movies from the Internet Movie Database (IMDb). The file is named imdb-250.xlsx and can be obtained from http://www.overthinkingit.com/2011/10/11/imdb-top-250-movies-4th-edition/2. To extract its data, we invoke in2csv as follows:

The format of the file is automatically determined by the extension, .xlsx in this case. If we were to pipe the data into in2csv, we would have to specify the format explicitly. Let’s look at the data:

As you can see, CSV by default is not too readable. You can pipe the data to a tool called csvlook (Groskopf 2014d), which will nicely format the data into a table. Here, we’ll display a subset of the columns using csvcut such that the table fits on the page:

A spreadsheet can contain multiple worksheets. By default, in2csv extracts the first worksheet. To extract a different worksheet, you need to pass the name of worksheet to the --sheet option.

The tools in2csv, csvcut, and csvlook are actually part of Csvkit, which is collection of command-line tools to work with CSV data. Csvkit will be used quite often in this book because it has so many valuable tools. If you’re running the Data Science Toolbox, you already have Csvkit installed. Otherwise, see the appendix for instructions on how to install it.

An alternative approach to in2csv is to open the spreadsheet in Microsoft Excel or an open source variant such as LibreOffice Calc, and manually export it to CSV. While this works as a one-off solution, the disadvantage is that it does not scale well to multiple files and is not automatable. Furthermore, when you are working on the command line of a remote server, chances are that you don’t have such an application available.

3.5 Querying Relational Databases

Most companies store their data in a relational database. Examples of relational databases are MySQL, PostgreSQL, and SQLite. These databases all have a slightly different way of interfacing with them. Some provide a command-line tool or a command-line interface, while others do not. Moreover, they are not very consistent when it comes to their usage and output.

Fortunately, there is a command-line tool called sql2csv, which is part of the Csvkit suite. Because it leverages the Python SQLAlchemy package, we only have to use one tool to execute queries on many different databases through a common interface, including MySQL, Oracle, PostgreSQL, SQLite, Microsoft SQL Server, and Sybase. The output of sql2csv is, as its name suggests, in CSV format.

We can obtain data from relational databases by executing a SELECT query on them. (sql2csv also support INSERT, UPDATE, and DELETE queries, but that’s not the purpose of this chapter.) To select a specific set of data from an SQLite database named iris.db, sql2csv can be invoked as follows:

Here, we are selecting all rows where sepal\_length is larger than 7.5. The --db option specifies the database URL, of which the typical form is: dialect+driver://username:password@host:port/database.

3.6 Downloading from the Internet

The Internet provides without a doubt the largest resource for data. This data is available in various forms, using various protocols. The command-line tool cURL (Stenberg 2012) can be considered the command line’s Swiss Army knife when it comes to downloading data from the Internet.

When you access a URL, which stands for uniform resource locator, through your browser, the data that is being downloaded can be interpreted. For example, an HTML file is rendered as a website, an MP3 file may be automatically played, and a PDF file may be automatically downloaded or opened by a viewer. However, when cURL is used to access a URL, the data is downloaded as is printed to standard output. Other command-line tools may then be used to process this data further.

The easiest invocation of cURL is to simply specify a URL as a command-line argument. For example, to download the book Adventures of Huckleberry Finn by Mark Twain from Project Gutenberg, we can run the following command:

By default, cURL outputs a progress meter that shows how the download rate and the expected time of completion. If you are piping the output directly to another command-line tool, such as head, be sure to specify the -s command-line argument, which stands for silent, so that the progress meter is disabled. Compare, for example, the output with the following command:

Note that the output of the second command, where we do not disable the progress meter, contains the unwanted text and even an error message. If you save the data to a file, then you do not need to necessarily specify the -s option:

You can also save the data by explicitly specifying the output file with the -o option:

When downloading data from the Internet, the URL will most likely use the protocols HTTP or HTTPS. To download from an FTP server, which stands for File Transfer Protocol, you use cURL in exactly the same way. When the URL is password protected, you can specify a username and a password as follows:

If the specified URL is a directory, curl will list the contents of that directory.

When you access a shortened URL, such as the ones that start with http://bit.ly/* or http://t.co/*, your browser automatically redirects you to the correct location. With curl, however, you need to specify the -L or --location option in order to be redirected:

If you do not specify the -L or --location option, you may get something like:

By specifying the -I or --head option, curl fetches only the HTTP header of the response:

The first line indicates the HTTP status code, which is 301 (moved permanently) in this case. You can also see the location this URL redirects to: http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio. Inspecting the header and getting the status code is a useful debugging tool in case curl does not give you the expected result. Other common HTTP status codes include 404 (not found) and 403 (forbidden). This page lists all HTTP status codes: http://en.wikipedia.org/wiki/List_of_HTTP_status_codes.

To conclude this section, cURL is a straight-forward command-line tool for downloading data from the Internet. Its three most common command-line arguments are -s to suppress the progress meter, -u to specify a username and password, and -L to automatically follow redirects. See its man page for more information.

3.7 Calling a Web API

In the previous section we explained how to download individual files from the Internet. Another way data can come from the Internet is through a web API, which stands for Application Protocol Interface. The number of APIs that are being offered by organizations is growing at increasing rate, which means a lot of interesting data for us data scientists.

Web APIs are not meant to be presented in nice layout, such as websites. Instead, most web APIs return data in a structured format, such as JSON or XML. Having data in a structured form has the advantage that the data can be easily processed by other tools, such as jq. For example, the API from http://randomuser.me returns data in the following JSON structure.

The data is piped to a command-line tool jq in order to display it in a nice way. jq has many more possibilities that we will explore in Chapter 5.

Some web APIs return data in a streaming manner. This means that once you connect to it, the data will continue to pour in forever. A well-known example is the Twitter “firehose”, which constantly streams all the tweets being sent around the world. Luckily, most command-line tools that we use also operate in a streaming matter, so that we also use this kind of data.

Some APIs require you to log in using the OAuth protocol. There is a handy command-line tool called curlicue (Foster 2014) that assists in performing the so-called “OAuth dance”. Once this has been set up, it curlicue will call curl with the correct headers. First, you set things up once for a particular API with curlicue-setup, and then you can call that API using curlicue. For example, to use curlicue with the Twitter API you would run:

For more popular APIs, there are specialized command-line tools available. These are wrappers that provide a convenient way to connect to the API. In Chapter 9, for example, we’ll be using the command-line tool bigmler that only connects to BigML’s prediction API.

3.8 Further Reading