Chapter 9 Modeling Data

In this chapter we’re going to perform the fourth and last step of the OSEMN model that we can do on a computer: modeling data. Generally speaking, to model data is to create an abstract or higher-level description of your data. Just like with creating visualizations, it’s like taking a step back from the individual data points.

However, visualizations, on the one hand, are characterized by shapes, positions, and colors such that we can interpret them by looking at them. Models, on the other hand, are internally characterized by a bunch of numbers, which means that computers can use them, for example, to make predictions about a new data points. (We can still visualize models so that we can try to understand them and see how they are performing.)

In this chapter we’ll consider four common types of algorithms to model data:

  • Dimensionality reduction.
  • Clustering.
  • Regression.
  • Classification.

These four algorithms come from the field of machine learning. As such, we’re going to change our vocabulary a bit. Let’s assume that we have a CSV file, also known as a data set. Each row, except for the header, is considered to be a data point. For simplicity we assume that each column that contains numerical values is an input feature. If a data point also contains a non-numerical field, such as the species column in the Iris data set, then that is known as the data point’s label.

The first two types of algorithms (dimensionality reduction and clustering) are most often unsupervised, which means that they create a model based on the features of the data set only. The last two types of algorithms (regression and classification) are by definition supervised algorithms, which means that they also incorporate the labels into the model.

This is by no means an introduction to machine learning. That implies that we must skim over many details. We strongly advise that you become familiar with an algorithm before applying it blindly to your data.

9.1 Overview

In this chapter, you’ll learn how to:

  • Reduce the dimensionality of your data set.
  • Identify groups of data points with three clustering algorithms.
  • Predict the quality of white wine using regression.
  • Classify wine as red or white via a prediction API.

9.2 More Wine Please!

In this chapter, we’ll be using a data set of wine tastings. Specifically, red and white Portuguese “Vinho Verde” wine. Each data point represents a wine, and consists of 11 physicochemical properties: (1) fixed acidity, (2) volatile acidity, (3) citric acid, (4) residual sugar, (5) chlorides, (6) free sulfur dioxide, (7) total sulfur dioxide, (8) density, (9) pH, (10) sulphates, and (11) alcohol. There is also a quality score. This score lies between 0 (very bad) and 10 (excellent) and is the median of at least three evaluation by wine experts. More information about this data set is available at http://archive.ics.uci.edu/ml/datasets/Wine+Quality.

There are two data sets: one for white wine and one for red wine. The very first step is to obtain the two data sets using curl (and of course parallel because we haven’t got all day):

The triple colon is yet another way we can pass data to parallel. Let’s inspect both data sets using head and count the number of rows using wc -l:

At first sight this data appears to be very clean already. Still, let’s scrub this data a little bit so that it conforms more with what most command-line tools are expecting. Specifically, we’ll:

  • Convert the header to lowercase.
  • Convert the semi-colons to commas.
  • Convert spaces to underscores.
  • Remove unnecessary quotes.

These things can all be taken care of by ‘tr`. Let’s use a for loop this time—for old times’ sake—to process both data sets:

Let’s also create a data set by combining the two data sets. We’ll use csvstack to add a column named “type” which will be “red” for rows of the first file, and “white” for rows of the second file:

The new column type is added to the beginning of the table. Because some of the command-line tools that we’ll use in this chapter assume that the class label is the last column, we’ll rearrange the columns by using csvcut. Instead of typing all 13 columns, we temporary store the desired header into a variable $HEADER before we call csvstack.

It’s good to check whether there are any missing values in this data set:

Excellent! Just out of curiosity, let’s see what the how the distribution of quality looks like for both red and white wines.

From the density plot we can see the quality of white wine is distributed more towards higher values. Does this mean that white wines are overall better than red wines, or that the white wine experts more easily give higher scores than red wine experts? That’s something that the data doesn’t tell us. Or is there perhaps a correlation between alcohol and quality? Let’s use Rio and ggplot again to find out:

Eureka! Ahem, let’s carry on with some modeling, shall we?

9.3 Dimensionality Reduction with Tapkee

The goal of dimensionality reduction is to map high-dimensional data points onto a lower dimensional mapping. The challenge is to keep similar data points close together on the lower-dimensional mapping. As we’ve seen in the previous section, our wine data set contained 13 features. We’ll stick with two dimensions because that’s straight forward to visualize.

Dimensionality reduction is often regarded as being part of exploring step. It’s useful for when there are too many features for plotting. You could do a scatter-plot matrix, but that only shows you two features at a time. It’s also useful as a pre-processing step for other machine learning algorithms.

Most dimensionality reduction algorithms are unsupervised. This means that they don’t employ the labels of the data points in order to construct the lower-dimensional mapping.

In this section we’ll look at two techniques: PCA, which stands for Principal Components Analysis (Pearson 1901) and t-SNE, which stands for t-distributed Stochastic Neighbor Embedding (Maaten and Hinton 2008).

9.3.1 Introducing Tapkee

Tapkee is a C++ template library for dimensionality reduction (Lisitsyn, Widmer, and Garcia 2013). The library contains implementations of many dimensionality reduction algorithms, including:

  • Locally Linear Embedding
  • Isomap
  • Multidimensional scaling
  • PCA
  • t-SNE

Tapkee’s website: http://tapkee.lisitsyn.me/, contains more information about these algorithms. Although Tapkee is mainly a library that can be included in other applications, it also offers a command-line tool. We’ll use this to perform dimensionality reduction on our wine data set.

9.3.2 Installing Tapkee

If you aren’t running the Data Science Toolbox, you’ll need to download and compile Tapkee yourself. First make sure that you have CMake installed. On Ubuntu, you simply run:

Please consult Tapkee’s website for instructions for other operating systems. Then execute the following commands to download the source and compile it:

This creates a binary executable named tapkee.

9.3.3 Linear and Non-linear Mappings

First, we’ll scale the features using standardization such that each feature is equally important. This generally leads to better results when applying machine learning algorithms.

To scale we use a combination of cols and Rio:

Now we apply both dimensionality reduction techniques and visualize the mapping using Rio-scatter:

PCA

Figure 9.1: PCA

t-SNE

Figure 9.2: t-SNE

Note that there’s not a single GNU core util (i.e., classic command-line tool) in this one-liner. Now that’s the power of the command line!

9.4 Clustering with Weka

In this section we’ll be clustering our wine data set into groups. Like, dimensionality reduction, clustering is usually unsupervised. It can be used go gain an understanding of how your data is structured. Once the data has been clustered, you can visualize the result by coloring the data points according to their cluster assignment. For most algorithms you specify upfront how many groups you want the data to be clustered in; some algorithms are able to determine a suitable number of groups.

For this task we’ll use Weka, which is being maintained by the Machine Learning Group at the University of Waikato (Hall et al. 2009). If you already know Weka, then you probably know it as a software with a graphical user interface. However, as you’ll see, Weka can also be used from the command line (albeit with some modifications). Besides clustering, Weka can also do classification and regression, but we’re going to be using other tools for those machine learning tasks.

9.4.1 Introducing Weka

You may ask, surely there are better command-line tools for clustering? And you are right. One reason we include Weka in this chapter is to show you how you can work around these imperfections by building additional command-line tools. As you spend more time on the command line and try out other command-line tools, chances are that you come across one that seems very promising at first, but does not work as you expected. A common imperfection is the command-line tool does not handle standard in or standard out correctly. In the next section we’ll point out these imperfections and demonstrate how we work around them.

9.4.2 Taming Weka on the Command Line

Weka can be invoked from the command line, but it’s definitely not straightforward or user friendly. Weka is programmed in Java, which means that you have to run java, specify the location of the weka.jar file, and specify the individual class you want to call. For example, Weka has a class called MexicanHat, which generates a toy data set. To generate 10 data points using this class, you would run:

Don’t worry about the output of this command, we’ll discuss that later. At this moment, we’re concerned with the usage of Weka. There are a couple of things to note here:

  • You need run java, which is counter-intuitive.
  • The jar file contains over 2000 classes, and only about 300 of those can be used from the command line directly. How do you know which ones?
  • You need to specify entire namespace of the class: weka.datagenerators.classifiers.regression.MexicanHat. How are you supposed to remember that?

Does this mean that we’re going to give up on Weka? Of course not! Since Weka does contain a lot of useful functionality, we’re going to tackle these issues in the next three subsections.

9.4.2.1 An Improved Command-line Tool for Weka

Save the following snippet as a new file called weka and put it somewhere on your PATH:

Subsequently, add the following line to your .bashrc file so that weka can be called from anywhere:

We can now call the previous example with:

9.4.2.2 Usable Weka Classes

As mentioned, the file weka.jar contains over 2000 classes. Many of them cannot be used from the command line directly. We consider a class usable from the command line when it provides us with a help message if we invoke it with -h. For example:

Now that’s usable. This, for example, is not a usable class:

The following pipeline runs weka with every class in weka.jar and -h and saves the standard output and standard error to a file with the same name as the class:

We now have 749 files. With the following command we save the filename of every files which does not contain the string Exception to weka.classes:

This still comes down to 332 classes! Here are a few classes that might be of interest):

  • attributeSelection.PrincipalComponents
  • classifiers.bayes.NaiveBayes
  • classifiers.evaluation.ConfusionMatrix
  • classifiers.functions.SimpleLinearRegression
  • classifiers.meta.AdaBoostM1
  • classifiers.trees.RandomForest

  • clusterers.EM
  • filters.unsupervised.attribute.Normalize

As you can see, weka offers a whole range of classes and functionality.

9.4.2.3 Adding Tab Completion

At this moment, you still need to type in the entire class name yourself. You can add so-called tab completion by adding the following snippet to your .bashrc file after you export WEKAPATH:

This function makes use of the weka.classes file we generated earlier. If you now type: weka clu<Tab><Tab><Tab> on the command line, you are presented with a list of all classes that have to do with clustering:

$ weka clusterers.
clusterers.CheckClusterer
clusterers.CLOPE
clusterers.ClusterEvaluation
clusterers.Cobweb
clusterers.DBSCAN
clusterers.EM
clusterers.FarthestFirst
clusterers.FilteredClusterer
clusterers.forOPTICSAndDBScan.OPTICS_GUI.OPTICS_Visualizer
clusterers.HierarchicalClusterer
clusterers.MakeDensityBasedClusterer
clusterers.OPTICS
clusterers.sIB
clusterers.SimpleKMeans
clusterers.XMeans

Creating a command-line tool weka and adding tab completion makes sure that Weka is a little bit more friendly to use on the command line.

9.4.3 Converting between CSV to ARFF Data Formats

Weka uses ARFF as a file format. This is basically CSV with additional information about the columns. We’ll use two convenient command-line tools to convert between CSV and ARFF, namely csv2arff (see Example 9.1 ) and arff2csv (see Example 9.2).

Example 9.1 (Convert CSV to ARFF)
Example 9.2 (Convert ARFF to CSV)

9.4.4 Comparing Three Cluster Algorithms

Unfortunately, in order to cluster data using Weka, we need yet another command-line tool to help us with this. The AddCluster class is needed to assign data points to the learned clusters. Unfortunately, this class does not accept data from standard input, not even when we specify -i /dev/stdin because it expects a file with the .arff extension. We consider this to be bad design. The source code of weka-cluster is:

Now we can apply the EM clustering algorithm and save the assignment as follows:

  • Use the scaled features, and don’t use the features quality and type for the cluster.
  • Apply the algorithm using weka-cluster.
  • Only save the cluster assignment.

We’ll run the same command again for SimpleKMeans and Cobweb algorithms. Now we have three files with cluster assignments. Let’s create a t-SNE mapping in order to visualize the cluster assignments:

Next, the cluster assignments are combined with the t-SNE mapping using paste and a scatter plot is created using Rio-scatter:

EM

Figure 9.3: EM

SimpleKMeans

Figure 9.4: SimpleKMeans

Cobweb

Figure 7.8: Cobweb

Admittedly, we have through a lot of trouble taming Weka. The exercise was worth it, because some day you may run into a command-line tool that works different from what you expect. Now you know that there are always ways to work around such command-line tools.

9.5 Regression with SciKit-Learn Laboratory

In this section, we’ll be predicting the quality of the white wine, based on their physicochemical properties. Because the quality is a number between 0 and 10, we can consider predicting the quality as a regression task. Generally speaking, using training data points, we train three regression models using three different algorithms.

We’ll be using the SciKit-Learn Laboratory (or SKLL) package for this. If you’re not using the Data Science Toolbox, you can install SKLL using pip:

If you’re running Python 2.7, you also need to install the following packages:

9.5.1 Preparing the Data

SKLL expects that the train and test data have the same filenames, located in separate directories. However, in this example, we’re going to use cross-validation, meaning that we only need to specify a training data set. Cross-validation is a technique that splits up the whole data set into a certain number of subsets. These subsets are called folds. (Usually, five or ten folds are used.)

We need to add an identifier to each row so that we can easily identify the data points later (the predictions are not in the same order as the original data set):

9.5.2 Running the Experiment

Create a configuration file called predict-quality.cfg:

We run the experiment using the run_experiment command-line tool \[cite:run\_experiment\]:

The -l command-line argument indicates that we’re running in local mode. SKLL also offers the possibility to run experiments on clusters. The time it takes to run the experiment depends on the complexity of the chosen algorithms.

9.5.3 Parsing the Results

Once all algorithms are done, the results can now be found in the directory output:

SKLL generates four files for each learner: one log, two with results, and one with predictions. Moreover, SKLL generates a summary file, which contains a lot of information about each individual fold (too much to show here). We can extract the relevant metrics using the following SQL query:

The relevant column here is pearson, which indicates the Pearson’s ranking correlation. This is value between -1 and 1 that indicates the correlation between the true ranking (of quality scores) and the predicted ranking. Let’s paste all the predictions back to the data set:

And create a plot using Rio:

9.6 Classification with BigML

In this fourth and last modeling section we’re going to classify wines as either red or wine. For this we’ll be using a solution called BigML, which provides a prediction API. This means that the actual modeling and predicting takes place in the cloud, which is useful if you need a bit more power than your own computer can offer.

Although prediction APIs are relatively young, they are upcoming, which is why we’ve included one in this chapter. Other providers of prediction APIs are Google (see https://developers.google.com/prediction) and PredictionIO (see http://prediction.io). One advantage of BigML is that they offer a convenient command-line tool called bigmler (BigML 2014) that interfaces with their API. We can use this command-line like any other presented in this book, but behind the scenes, our data set is being sent to BigML’s servers, which perform the classification and send back the results.

9.6.1 Creating Balanced Train and Test Data Sets

First, we create a balanced data set to ensure that both class are represented equally. For this, we use csvstack (Groskopf 2014h), shuf (Eggert 2012), head, and csvcut:

This long command breaks down as follows:

  • csvstack is used to combine multiple data sets. It creates a new column type, which has the value red for all rows coming from the first file wine-red-clean.csv and white for all rows coming from the second file.
  • The second file is passed to csvstack using file redirection. This allows us to create a temporary file using shuf, which creates a random permutation of the wine-white-clean.csv and head which only selects the header and the first 1559 rows.
  • Finally, we reorder the columns of this data set using csvcut because by default, bigmler assumes that the last column is the label.

Let’s verify that wine-balanced.csv is actually balanced by counting the number of instances per class using parallel and grep:

As you can see, the data set wine-balanced.csv contains both 1599 red and 1599 white wines. Next we split into train and test data sets using split (Granlund and Stallman 2012b):

This is another long command that deserves to be broken down:

  • Get the header using header and save it to a temporary file named wine-header.csv
  • Mix up the red and white wines using tail and shuf and split it into two files named x00 and x01 using a round robin distribution.
  • Use cat to combine the header saved in wine-header.csv and the rows stored in x00 to save it as wine-train.csv; similarly for x01 and wine-test.csv. The --xapply command-line argument tells parallel to loop over the two input sources in tandem.

Let’s check again number of instances per class in both wine-train.csv and wine-test.csv:

That looks like are data sets are well balanced. We’re now ready to call the prediction API using bigmler.

9.6.2 Calling the API

You can obtain a BigML username and API key at https://bigml.com/developers. Be sure to set the variables BIGML_USERNAME and BIGML_API_KEY in .bashrc with the appropriate values.

The API call is quite straightforward, and the meaning of each command-line argument is obvious from it’s name.

The file wine-test-blind.csv is just wine-test with the type column (so the label) removed. After this call is finished, the results can be found in the output directory:

9.6.3 Inspecting the Results

The file which is of most interest is output/predictions.csv:

We can compare these predicted labels with the labels in our test data set. Let’s count the number of misclassifications:

  • First, we combine the type columns of both data/wine-test.csv and output/predictions.csv.
  • Then, we use awk to keep count of when the two columns differ in value.

As you can see, BigML’s API misclassified 766 wines out of 1599. This isn’t a good result, but please note that we just blindly applied an algorithm to a data set, which we normally wouldn’t do.

9.6.4 Conclusion

BigML’s prediction API has proven to be easy to use. As with many of the command-line tools discussed in this book, we’ve barely scratched the surface with BigML. For completeness, we should mention that:

  • BigML’s command-line tool also allows for local computations, which is useful for debugging.
  • Results can also be inspected using BigML’s web interface.
  • BigML can also perform regression tasks.

Please see https://bigml.com/developers for a complete overview of BigML’s features.

Although we’ve only been able to experiment with one prediction API, we do believe that prediction APIs in general are worthwhile to consider for doing data science.

9.7 Further Reading

  • Cortez, P., A. Cerdeira, F. Almeida, T. Matos, and J. Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4). Elsevier:547–53.
  • Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. “The WEKA Data Mining Software: An Update.” SIGKDD Explorations 11 (1). ACM.
  • Pearson, K. 1901. “On Lines and Planes of Closest Fit to Systems of Points in Space.” Philosophical Magazine 2 (11):559–72.
  • Maaten, Laurens van der, and Geoffrey Everest Hinton. 2008. “Visualizing Data Using T-SNE.” Journal of Machine Learning Research 9:2579–2605.