Chapter 9 TODO: Use rich and tidyverse in this example

You can sum the counts in each CSV file using Rio and the aggregate function in R:

$ cat *.csv | header -a borough,count |
> Rio -e 'aggregate(count ~ borough, df, sum)' |
> csvsort -rc count | csvlook
|----------------|--------|
|  borough       | count  |
|----------------|--------|
|  unspecified   | 467    |
|  manhattan     | 274    |
|  brooklyn      | 103    |
|  queens        | 77     |
|  bronx         | 44     |
|  staten_island | 35     |
|----------------|--------|

Or, if you prefer to use SQL to aggregate results, you can use csvsql as discussed in Chapter 5:

$ cat *.csv | header -a borough,count |
> csvsql --query 'SELECT borough, SUM(count) AS count FROM stdin '\
> 'GROUP BY borough ORDER BY count DESC' | csvlook
|----------------|--------|
|  borough       | count  |
|----------------|--------|
|  unspecified   | 467    |
|  manhattan     | 274    |
|  brooklyn      | 103    |
|  queens        | 77     |
|  bronx         | 44     |
|  staten_island | 35     |
|----------------|--------|

9.1 Discussion

As data scientists, we work with data, and sometimes a lot of data. This means that sometimes you need to run a command multiple times or distribute data-intensive commands over multiple cores. In this chapter I have shown you how easy it is to parallelize commands. parallel is a very powerful and flexible tool to speed up ordinary command-line tools and distribute them over multiple cores and remote machines. It offers a lot of functionality and in this chapter I’ve only been able to scratch the surface. Some features of parallel are that I haven’t covered:

  • Different ways of specifying input
  • Keep a log of all the jobs
  • Only start new jobs when the machine is under a certain load
  • Timeout, resume, and retry jobs

Once you have a basic understanding of parallel and its most important options, I recommend that you take a look at its tutorial listed in the Further Reading section.

9.2 Further Reading