Chapter 9 Modeling Data

In this chapter we’re going to perform the fourth and last step of the OSEMN model that we can do on a computer: modeling data. Generally speaking, to model data is to create an abstract or higher-level description of your data. Just like with creating visualizations, it’s like taking a step back from the individual data points.

However, visualizations, on the one hand, are characterized by shapes, positions, and colors such that we can interpret them by looking at them. Models, on the other hand, are internally characterized by a bunch of numbers, which means that computers can use them, for example, to make predictions about a new data points. (We can still visualize models so that we can try to understand them and see how they are performing.)

In this chapter we’ll consider four common types of algorithms to model data:

  • Dimensionality reduction.
  • Clustering.
  • Regression.
  • Classification.

These four algorithms come from the field of machine learning. As such, we’re going to change our vocabulary a bit. Let’s assume that we have a CSV file, also known as a data set. Each row, except for the header, is considered to be a data point. For simplicity we assume that each column that contains numerical values is an input feature. If a data point also contains a non-numerical field, such as the species column in the Iris data set, then that is known as the data point’s label.

The first two types of algorithms (dimensionality reduction and clustering) are most often unsupervised, which means that they create a model based on the features of the data set only. The last two types of algorithms (regression and classification) are by definition supervised algorithms, which means that they also incorporate the labels into the model.

This is by no means an introduction to machine learning. That implies that we must skim over many details. We strongly advise that you become familiar with an algorithm before applying it blindly to your data.

9.1 Overview

In this chapter, you’ll learn how to:

  • Reduce the dimensionality of your data set.
  • Identify groups of data points with three clustering algorithms.
  • Predict the quality of white wine using regression.
  • Classify wine as red or white via a prediction API.

9.2 More Wine Please!

In this chapter, we’ll be using a data set of wine tastings. Specifically, red and white Portuguese “Vinho Verde” wine. Each data point represents a wine, and consists of 11 physicochemical properties: (1) fixed acidity, (2) volatile acidity, (3) citric acid, (4) residual sugar, (5) chlorides, (6) free sulfur dioxide, (7) total sulfur dioxide, (8) density, (9) pH, (10) sulphates, and (11) alcohol. There is also a quality score. This score lies between 0 (very bad) and 10 (excellent) and is the median of at least three evaluation by wine experts. More information about this data set is available at http://archive.ics.uci.edu/ml/datasets/Wine+Quality.

There are two data sets: one for white wine and one for red wine. The very first step is to obtain the two data sets using curl (and of course parallel because we haven’t got all day):

$ cd ~/book/ch09
$ parallel "curl -sL http://archive.ics.uci.edu/ml/machine-learning-databases"\
> "/wine-quality/winequality-{}.csv > data/wine-{}.csv" ::: red white

The triple colon is yet another way we can pass data to parallel. Let’s inspect both data sets using head and count the number of rows using wc -l:

$ head -n 5 wine-{red,white}.csv | fold
==> wine-red.csv <==
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"f
ree sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";
"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6

==> wine-white.csv <==
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"f
ree sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";
"quality"
7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9.5;6
8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;10.1;6
7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6
$ wc -l wine-{red,white}.csv
  1600 wine-red.csv
  4899 wine-white.csv
  6499 total

At first sight this data appears to be very clean already. Still, let’s scrub this data a little bit so that it conforms more with what most command-line tools are expecting. Specifically, we’ll:

  • Convert the header to lowercase.
  • Convert the semi-colons to commas.
  • Convert spaces to underscores.
  • Remove unnecessary quotes.

These things can all be taken care of by ‘tr`. Let’s use a for loop this time—for old times’ sake—to process both data sets:

for T in red white; do
< wine-$T.csv tr '[A-Z]; ' '[a-z],_' | tr -d \" > wine-${T}-clean.csv
done

Let’s also create a data set by combining the two data sets. We’ll use csvstack to add a column named “type” which will be “red” for rows of the first file, and “white” for rows of the second file:

$ HEADER="$(head -n 1 wine-red-clean.csv),type"
$ csvstack -g red,white -n type wine-{red,white}-clean.csv |
> csvcut -c $HEADER > wine-both-clean.csv

The new column type is added to the beginning of the table. Because some of the command-line tools that we’ll use in this chapter assume that the class label is the last column, we’ll rearrange the columns by using csvcut. Instead of typing all 13 columns, we temporary store the desired header into a variable $HEADER before we call csvstack.

It’s good to check whether there are any missing values in this data set:

$ csvstat wine-both-clean.csv --nulls
  1. fixed_acidity: False
  2. volatile_acidity: False
  3. citric_acid: False
  4. residual_sugar: False
  5. chlorides: False
  6. free_sulfur_dioxide: False
  7. total_sulfur_dioxide: False
  8. density: False
  9. ph: False
 10. sulphates: False
 11. alcohol: False
 12. quality: False
 13. type: False

Excellent! Just out of curiosity, let’s see what the how the distribution of quality looks like for both red and white wines.

$ < wine-both-clean.csv Rio -ge 'g+geom_density(aes(quality, '\
'fill=type), adjust=3, alpha=0.5)' | display

From the density plot we can see the quality of white wine is distributed more towards higher values. Does this mean that white wines are overall better than red wines, or that the white wine experts more easily give higher scores than red wine experts? That’s something that the data doesn’t tell us. Or is there perhaps a correlation between alcohol and quality? Let’s use Rio and ggplot again to find out:

$ < wine-both-clean.csv Rio -ge 'ggplot(df, aes(x=alcohol, y=quality, '\
> 'color=type)) + geom_point(position="jitter", alpha=0.2) + '\
> 'geom_smooth(method="lm")' | display

Eureka! Ahem, let’s carry on with some modeling, shall we?

9.3 Dimensionality Reduction with Tapkee

The goal of dimensionality reduction is to map high-dimensional data points onto a lower dimensional mapping. The challenge is to keep similar data points close together on the lower-dimensional mapping. As we’ve seen in the previous section, our wine data set contained 13 features. We’ll stick with two dimensions because that’s straight forward to visualize.

Dimensionality reduction is often regarded as being part of exploring step. It’s useful for when there are too many features for plotting. You could do a scatter-plot matrix, but that only shows you two features at a time. It’s also useful as a pre-processing step for other machine learning algorithms.

Most dimensionality reduction algorithms are unsupervised. This means that they don’t employ the labels of the data points in order to construct the lower-dimensional mapping.

In this section we’ll look at two techniques: PCA, which stands for Principal Components Analysis (Pearson 1901) and t-SNE, which stands for t-distributed Stochastic Neighbor Embedding (Maaten and Hinton 2008).

9.3.1 Introducing Tapkee

Tapkee is a C++ template library for dimensionality reduction (Lisitsyn, Widmer, and Garcia 2013). The library contains implementations of many dimensionality reduction algorithms, including:

  • Locally Linear Embedding
  • Isomap
  • Multidimensional scaling
  • PCA
  • t-SNE

Tapkee’s website: http://tapkee.lisitsyn.me/, contains more information about these algorithms. Although Tapkee is mainly a library that can be included in other applications, it also offers a command-line tool. We’ll use this to perform dimensionality reduction on our wine data set.

9.3.2 Installing Tapkee

If you aren’t running the Data Science Toolbox, you’ll need to download and compile Tapkee yourself. First make sure that you have CMake installed. On Ubuntu, you simply run:

$ sudo apt-get install cmake

Please consult Tapkee’s website for instructions for other operating systems. Then execute the following commands to download the source and compile it:

$ curl -sL https://github.com/lisitsyn/tapkee/archive/master.tar.gz > \
> tapkee-master.tar.gz
$ tar -xzf tapkee-master.tar.gz
$ cd tapkee-master
$ mkdir build && cd build
$ cmake ..
$ make

This creates a binary executable named tapkee.

9.3.3 Linear and Non-linear Mappings

First, we’ll scale the features using standardization such that each feature is equally important. This generally leads to better results when applying machine learning algorithms.

To scale we use a combination of cols and Rio:

$ < wine-both.csv cols -C type Rio -f scale > wine-both-scaled.csv

Now we apply both dimensionality reduction techniques and visualize the mapping using Rio-scatter:

$ < wine-both-scaled.csv cols -C type body tapkee --method pca |
> header -r x,y,type | Rio-scatter x y type |
> tee tapkee-wine-pca.png | display
PCA

Figure 9.1: PCA

$ < wine-both-scaled.csv cols -C type body tapkee --method t-sne |
> header -r x,y,type | Rio-scatter x y type |
> tee tapkee-wine-t-sne.png | display
t-SNE

Figure 9.2: t-SNE

Note that there’s not a single GNU core util (i.e., classic command-line tool) in this one-liner. Now that’s the power of the command line!

9.4 Clustering with Weka

In this section we’ll be clustering our wine data set into groups. Like, dimensionality reduction, clustering is usually unsupervised. It can be used go gain an understanding of how your data is structured. Once the data has been clustered, you can visualize the result by coloring the data points according to their cluster assignment. For most algorithms you specify upfront how many groups you want the data to be clustered in; some algorithms are able to determine a suitable number of groups.

For this task we’ll use Weka, which is being maintained by the Machine Learning Group at the University of Waikato (Hall et al. 2009). If you already know Weka, then you probably know it as a software with a graphical user interface. However, as you’ll see, Weka can also be used from the command line (albeit with some modifications). Besides clustering, Weka can also do classification and regression, but we’re going to be using other tools for those machine learning tasks.

9.4.1 Introducing Weka

You may ask, surely there are better command-line tools for clustering? And you are right. One reason we include Weka in this chapter is to show you how you can work around these imperfections by building additional command-line tools. As you spend more time on the command line and try out other command-line tools, chances are that you come across one that seems very promising at first, but does not work as you expected. A common imperfection is the command-line tool does not handle standard in or standard out correctly. In the next section we’ll point out these imperfections and demonstrate how we work around them.

9.4.2 Taming Weka on the Command Line

Weka can be invoked from the command line, but it’s definitely not straightforward or user friendly. Weka is programmed in Java, which means that you have to run java, specify the location of the weka.jar file, and specify the individual class you want to call. For example, Weka has a class called MexicanHat, which generates a toy data set. To generate 10 data points using this class, you would run:

$ java -cp ~/bin/weka.jar weka.datagenerators.classifiers.regression.MexicanHat\
>  -n 10 | fold
%
% Commandline
%
% weka.datagenerators.classifiers.regression.MexicanHat -r weka.datagenerators.c
lassifiers.regression.MexicanHat-S_1_-n_10_-A_1.0_-R_-10..10_-N_0.0_-V_1.0 -S 1
-n 10 -A 1.0 -R -10..10 -N 0.0 -V 1.0
%
@relation weka.datagenerators.classifiers.regression.MexicanHat-S_1_-n_10_-A_1.0
_-R_-10..10_-N_0.0_-V_1.0

@attribute x numeric
@attribute y numeric

@data

4.617564,-0.215591
-1.798384,0.541716
-5.845703,-0.072474
-3.345659,-0.060572
9.355118,0.00744
-9.877656,-0.044298
9.274096,0.016186
8.797308,0.066736
8.943898,0.051718
8.741643,0.072209

Don’t worry about the output of this command, we’ll discuss that later. At this moment, we’re concerned with the usage of Weka. There are a couple of things to note here:

  • You need run java, which is counter-intuitive.
  • The jar file contains over 2000 classes, and only about 300 of those can be used from the command line directly. How do you know which ones?
  • You need to specify entire namespace of the class: weka.datagenerators.classifiers.regression.MexicanHat. How are you supposed to remember that?

Does this mean that we’re going to give up on Weka? Of course not! Since Weka does contain a lot of useful functionality, we’re going to tackle these issues in the next three subsections.

9.4.2.1 An Improved Command-line Tool for Weka

Save the following snippet as a new file called weka and put it somewhere on your PATH:

#!/usr/bin/env bash
java -Xmx1024M -cp ${WEKAPATH}/weka.jar "weka.$@"

Subsequently, add the following line to your .bashrc file so that weka can be called from anywhere:

$ export WEKAPATH=/home/vagrant/repos/weka

We can now call the previous example with:

$ weka datagenerators.classifiers.regression.MexicanHat -n 10

9.4.2.2 Usable Weka Classes

As mentioned, the file weka.jar contains over 2000 classes. Many of them cannot be used from the command line directly. We consider a class usable from the command line when it provides us with a help message if we invoke it with -h. For example:

$ weka datagenerators.classifiers.regression.MexicanHat -h

Data Generator options:

-h
        Prints this help.
-o <file>
        The name of the output file, otherwise the generated data is
        printed to stdout.
-r <name>
        The name of the relation.
-d
        Whether to print debug informations.
-S
        The seed for random function (default 1)
-n <num>
        The number of examples to generate (default 100)
-A <num>
        The amplitude multiplier (default 1.0).
-R <num>..<num>
        The range x is randomly drawn from (default -10.0..10.0).
-N <num>
        The noise rate (default 0.0).
-V <num>
        The noise variance (default 1.0).

Now that’s usable. This, for example, is not a usable class:

$ weka filters.SimpleFilter -h
java.lang.ClassNotFoundException: -h
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:171)
        at weka.filters.Filter.main(Filter.java:1344)
-h

The following pipeline runs weka with every class in weka.jar and -h and saves the standard output and standard error to a file with the same name as the class:

$ unzip -l $WEKAPATH/weka.jar |
> sed -rne 's/.*(weka)\/([^g])([^$]*)\.class$/\2\3/p' |
> tr '/' '.' |
> parallel --timeout 1 -j4 -v "weka {} -h > {} 2>&1"

We now have 749 files. With the following command we save the filename of every files which does not contain the string Exception to weka.classes:

$ grep -L 'Exception' * | tee $WEKAPATH/weka.classes

This still comes down to 332 classes! Here are a few classes that might be of interest):

  • attributeSelection.PrincipalComponents
  • classifiers.bayes.NaiveBayes
  • classifiers.evaluation.ConfusionMatrix
  • classifiers.functions.SimpleLinearRegression
  • classifiers.meta.AdaBoostM1
  • classifiers.trees.RandomForest

  • clusterers.EM
  • filters.unsupervised.attribute.Normalize

As you can see, weka offers a whole range of classes and functionality.

9.4.2.3 Adding Tab Completion

At this moment, you still need to type in the entire class name yourself. You can add so-called tab completion by adding the following snippet to your .bashrc file after you export WEKAPATH:

_completeweka() {
  local curw=${COMP_WORDS[COMP_CWORD]}
  local wordlist=$(cat $WEKAPATH/weka.classes)
  COMPREPLY=($(compgen -W '${wordlist[@]}' -- "$curw"))
  return 0
}
complete -o nospace -F _completeweka weka

This function makes use of the weka.classes file we generated earlier. If you now type: weka clu<Tab><Tab><Tab> on the command line, you are presented with a list of all classes that have to do with clustering:

$ weka clusterers.
clusterers.CheckClusterer
clusterers.CLOPE
clusterers.ClusterEvaluation
clusterers.Cobweb
clusterers.DBSCAN
clusterers.EM
clusterers.FarthestFirst
clusterers.FilteredClusterer
clusterers.forOPTICSAndDBScan.OPTICS_GUI.OPTICS_Visualizer
clusterers.HierarchicalClusterer
clusterers.MakeDensityBasedClusterer
clusterers.OPTICS
clusterers.sIB
clusterers.SimpleKMeans
clusterers.XMeans

Creating a command-line tool weka and adding tab completion makes sure that Weka is a little bit more friendly to use on the command line.

9.4.3 Converting between CSV to ARFF Data Formats

Weka uses ARFF as a file format. This is basically CSV with additional information about the columns. We’ll use two convenient command-line tools to convert between CSV and ARFF, namely csv2arff (see Example 9.1 ) and arff2csv (see Example 9.2).

Example 9.1 (Convert CSV to ARFF)
#!/usr/bin/env bash
weka core.converters.CSVLoader /dev/stdin
Example 9.2 (Convert ARFF to CSV)
#!/usr/bin/env bash
weka core.converters.CSVSaver -i /dev/stdin

9.4.4 Comparing Three Cluster Algorithms

Unfortunately, in order to cluster data using Weka, we need yet another command-line tool to help us with this. The AddCluster class is needed to assign data points to the learned clusters. Unfortunately, this class does not accept data from standard input, not even when we specify -i /dev/stdin because it expects a file with the .arff extension. We consider this to be bad design. The source code of weka-cluster is:

#!/usr/bin/env bash
ALGO="$@"
IN=$(mktemp --tmpdir weka-cluster-XXXXXXXX).arff

finish () {
        rm -f $IN
}
trap finish EXIT

csv2arff > $IN
weka filters.unsupervised.attribute.AddCluster -W "weka.${ALGO}" -i $IN \
-o /dev/stdout | arff2csv

Now we can apply the EM clustering algorithm and save the assignment as follows:

$ cd data
$ < wine-both-scaled.csv csvcut -C quality,type |          
> weka-cluster clusterers.EM -N 5 |                        
> csvcut -c cluster > data/wine-both-cluster-em.csv        
  • Use the scaled features, and don’t use the features quality and type for the cluster.
  • Apply the algorithm using weka-cluster.
  • Only save the cluster assignment.

We’ll run the same command again for SimpleKMeans and Cobweb algorithms. Now we have three files with cluster assignments. Let’s create a t-SNE mapping in order to visualize the cluster assignments:

$ < wine-both-scaled.csv csvcut -C quality,type | body tapkee --method t-sne |
> header -r x,y > wine-both-xy.csv

Next, the cluster assignments are combined with the t-SNE mapping using paste and a scatter plot is created using Rio-scatter:

$ parallel -j1 "paste -d, wine-both-xy.csv wine-both-cluster-{}.csv | "\
> "Rio-scatter x y cluster | display" ::: em simplekmeans cobweb
EM

Figure 9.3: EM

SimpleKMeans

Figure 9.4: SimpleKMeans

Cobweb

Figure 7.8: Cobweb

Admittedly, we have through a lot of trouble taming Weka. The exercise was worth it, because some day you may run into a command-line tool that works different from what you expect. Now you know that there are always ways to work around such command-line tools.

9.5 Regression with SciKit-Learn Laboratory

In this section, we’ll be predicting the quality of the white wine, based on their physicochemical properties. Because the quality is a number between 0 and 10, we can consider predicting the quality as a regression task. Generally speaking, using training data points, we train three regression models using three different algorithms.

We’ll be using the SciKit-Learn Laboratory (or SKLL) package for this. If you’re not using the Data Science Toolbox, you can install SKLL using pip:

$ pip install skll

If you’re running Python 2.7, you also need to install the following packages:

$ pip install configparser futures logutils

9.5.1 Preparing the Data

SKLL expects that the train and test data have the same filenames, located in separate directories. However, in this example, we’re going to use cross-validation, meaning that we only need to specify a training data set. Cross-validation is a technique that splits up the whole data set into a certain number of subsets. These subsets are called folds. (Usually, five or ten folds are used.)

We need to add an identifier to each row so that we can easily identify the data points later (the predictions are not in the same order as the original data set):

$ mkdir train
$ wine-white-clean.csv nl -s, -w1 -v0 | sed '1s/0,/id,/' > train/features.csv

9.5.2 Running the Experiment

Create a configuration file called predict-quality.cfg:

[General]
experiment_name = Wine
task = cross_validate

[Input]
train_location = train
featuresets = [["features.csv"]]
learners = ["LinearRegression","GradientBoostingRegressor","RandomForestRegressor"]
label_col = quality

[Tuning]
grid_search = false
feature_scaling = both
objective = r2

[Output]
log = output
results = output
predictions = output

We run the experiment using the run_experiment command-line tool \[cite:run\_experiment\]:

$ run_experiment -l evaluate.cfg

The -l command-line argument indicates that we’re running in local mode. SKLL also offers the possibility to run experiments on clusters. The time it takes to run the experiment depends on the complexity of the chosen algorithms.

9.5.3 Parsing the Results

Once all algorithms are done, the results can now be found in the directory output:

$ cd output
$ ls -1
Wine_features.csv_GradientBoostingRegressor.log
Wine_features.csv_GradientBoostingRegressor.predictions
Wine_features.csv_GradientBoostingRegressor.results
Wine_features.csv_GradientBoostingRegressor.results.json
Wine_features.csv_LinearRegression.log
Wine_features.csv_LinearRegression.predictions
Wine_features.csv_LinearRegression.results
Wine_features.csv_LinearRegression.results.json
Wine_features.csv_RandomForestRegressor.log
Wine_features.csv_RandomForestRegressor.predictions
Wine_features.csv_RandomForestRegressor.results
Wine_features.csv_RandomForestRegressor.results.json
Wine_summary.tsv

SKLL generates four files for each learner: one log, two with results, and one with predictions. Moreover, SKLL generates a summary file, which contains a lot of information about each individual fold (too much to show here). We can extract the relevant metrics using the following SQL query:

$ < Wine_summary.tsv csvsql --query "SELECT learner_name, pearson FROM stdin "\
> "WHERE fold = 'average' ORDER BY pearson DESC" | csvlook
|----------------------------+----------------|
|  learner_name              | pearson        |
|----------------------------+----------------|
|  RandomForestRegressor     | 0.741860521533 |
|  GradientBoostingRegressor | 0.661957860603 |
|  LinearRegression          | 0.524144785555 |
|----------------------------+----------------|

The relevant column here is pearson, which indicates the Pearson’s ranking correlation. This is value between -1 and 1 that indicates the correlation between the true ranking (of quality scores) and the predicted ranking. Let’s paste all the predictions back to the data set:

$ parallel "csvjoin -c id train/features.csv <(< output/Wine_features.csv_{}"\
> ".predictions | tr '\t' ',') | csvcut -c id,quality,prediction > {}" ::: \
> RandomForestRegressor GradientBoostingRegressor LinearRegression
$ csvstack *Regres* -n learner --filenames > predictions.csv

And create a plot using Rio:

$ < predictions.csv Rio -ge 'g+geom_point(aes(quality, round(prediction), '\
> 'color=learner), position="jitter", alpha=0.1) + facet_wrap(~ learner) + '\
> 'theme(aspect.ratio=1) + xlim(3,9) + ylim(3,9) + guides(colour=FALSE) + '\
> 'geom_smooth(aes(quality, prediction), method="lm", color="black") + '\
> 'ylab("prediction")' | display

9.6 Classification with BigML

In this fourth and last modeling section we’re going to classify wines as either red or wine. For this we’ll be using a solution called BigML, which provides a prediction API. This means that the actual modeling and predicting takes place in the cloud, which is useful if you need a bit more power than your own computer can offer.

Although prediction APIs are relatively young, they are upcoming, which is why we’ve included one in this chapter. Other providers of prediction APIs are Google (see https://developers.google.com/prediction) and PredictionIO (see http://prediction.io). One advantage of BigML is that they offer a convenient command-line tool called bigmler (BigML 2014) that interfaces with their API. We can use this command-line like any other presented in this book, but behind the scenes, our data set is being sent to BigML’s servers, which perform the classification and send back the results.

9.6.1 Creating Balanced Train and Test Data Sets

First, we create a balanced data set to ensure that both class are represented equally. For this, we use csvstack (Groskopf 2014h), shuf (Eggert 2012), head, and csvcut:

$ csvstack -n type -g red,white wine-red-clean.csv \                   
> <(< wine-white-clean.csv body shuf | head -n 1600) |                 
> csvcut -c fixed_acidity,volatile_acidity,citric_acid,\               
> residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,\
> density,ph,sulphates,alcohol,type > wine-balanced.csv

This long command breaks down as follows:

  • csvstack is used to combine multiple data sets. It creates a new column type, which has the value red for all rows coming from the first file wine-red-clean.csv and white for all rows coming from the second file.
  • The second file is passed to csvstack using file redirection. This allows us to create a temporary file using shuf, which creates a random permutation of the wine-white-clean.csv and head which only selects the header and the first 1559 rows.
  • Finally, we reorder the columns of this data set using csvcut because by default, bigmler assumes that the last column is the label.

Let’s verify that wine-balanced.csv is actually balanced by counting the number of instances per class using parallel and grep:

$ parallel --tag grep -c {} wine-balanced.csv ::: red white
red      1599
white    1599

As you can see, the data set wine-balanced.csv contains both 1599 red and 1599 white wines. Next we split into train and test data sets using split (Granlund and Stallman 2012b):

$ < wine-balanced.csv header > wine-header.csv                   
$ tail -n +2 wine-balanced.csv | shuf | split -d -n r/2          
$ parallel --xapply "cat wine-header.csv x0{1} > wine-{2}.csv" \ 
> ::: 0 1 ::: train test

This is another long command that deserves to be broken down:

  • Get the header using header and save it to a temporary file named wine-header.csv
  • Mix up the red and white wines using tail and shuf and split it into two files named x00 and x01 using a round robin distribution.
  • Use cat to combine the header saved in wine-header.csv and the rows stored in x00 to save it as wine-train.csv; similarly for x01 and wine-test.csv. The --xapply command-line argument tells parallel to loop over the two input sources in tandem.

Let’s check again number of instances per class in both wine-train.csv and wine-test.csv:

$ parallel --tag grep -c {2} wine-{1}.csv ::: train test ::: red white
train red       821
train white     778
test white      821
test red        778

That looks like are data sets are well balanced. We’re now ready to call the prediction API using bigmler.

9.6.2 Calling the API

You can obtain a BigML username and API key at https://bigml.com/developers. Be sure to set the variables BIGML_USERNAME and BIGML_API_KEY in .bashrc with the appropriate values.

The API call is quite straightforward, and the meaning of each command-line argument is obvious from it’s name.

$ bigmler --train data/wine-train.csv \
> --test data/wine-test-blind.csv \
> --prediction-info full \
> --prediction-header \
> --output-dir output \
> --tag wine \
> --remote

The file wine-test-blind.csv is just wine-test with the type column (so the label) removed. After this call is finished, the results can be found in the output directory:

$ tree output
output
├── batch_prediction
├── bigmler_sessions
├── dataset
├── dataset_test
├── models
├── predictions.csv
├── source
└── source_test

0 directories, 8 files

9.6.3 Inspecting the Results

The file which is of most interest is output/predictions.csv:

$ csvcut output/predictions.csv -c type | head
type
white
white
red
red
white
red
red
white
red

We can compare these predicted labels with the labels in our test data set. Let’s count the number of misclassifications:

$ paste -d, <(csvcut -c type data/wine-test.csv) \        
> <(csvcut -c type output/predictions.csv) |
> awk -F, '{ if ($1 != $2) {sum+=1 } } END { print sum }' 
766
  • First, we combine the type columns of both data/wine-test.csv and output/predictions.csv.
  • Then, we use awk to keep count of when the two columns differ in value.

As you can see, BigML’s API misclassified 766 wines out of 1599. This isn’t a good result, but please note that we just blindly applied an algorithm to a data set, which we normally wouldn’t do.

9.6.4 Conclusion

BigML’s prediction API has proven to be easy to use. As with many of the command-line tools discussed in this book, we’ve barely scratched the surface with BigML. For completeness, we should mention that:

  • BigML’s command-line tool also allows for local computations, which is useful for debugging.
  • Results can also be inspected using BigML’s web interface.
  • BigML can also perform regression tasks.

Please see https://bigml.com/developers for a complete overview of BigML’s features.

Although we’ve only been able to experiment with one prediction API, we do believe that prediction APIs in general are worthwhile to consider for doing data science.

9.7 Further Reading

  • Cortez, P., A. Cerdeira, F. Almeida, T. Matos, and J. Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4). Elsevier:547–53.
  • Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. “The WEKA Data Mining Software: An Update.” SIGKDD Explorations 11 (1). ACM.
  • Pearson, K. 1901. “On Lines and Planes of Closest Fit to Systems of Points in Space.” Philosophical Magazine 2 (11):559–72.
  • Maaten, Laurens van der, and Geoffrey Everest Hinton. 2008. “Visualizing Data Using T-SNE.” Journal of Machine Learning Research 9:2579–2605.