Chapter 9 TODO: Use rich and tidyverse in this example
You can sum the counts in each CSV file using Rio and the aggregate function in R:
$ cat *.csv | header -a borough,count | > Rio -e 'aggregate(count ~ borough, df, sum)' | > csvsort -rc count | csvlook |----------------|--------| | borough | count | |----------------|--------| | unspecified | 467 | | manhattan | 274 | | brooklyn | 103 | | queens | 77 | | bronx | 44 | | staten_island | 35 | |----------------|--------|
Or, if you prefer to use SQL to aggregate results, you can use csvsql
as discussed in Chapter 5:
$ cat *.csv | header -a borough,count | > csvsql --query 'SELECT borough, SUM(count) AS count FROM stdin '\ > 'GROUP BY borough ORDER BY count DESC' | csvlook |----------------|--------| | borough | count | |----------------|--------| | unspecified | 467 | | manhattan | 274 | | brooklyn | 103 | | queens | 77 | | bronx | 44 | | staten_island | 35 | |----------------|--------|
9.1 Discussion
As data scientists, we work with data, and sometimes a lot of data.
This means that sometimes you need to run a command multiple times or distribute data-intensive commands over multiple cores.
In this chapter I have shown you how easy it is to parallelize commands.
parallel
is a very powerful and flexible tool to speed up ordinary command-line tools and distribute them over multiple cores and remote machines. It offers a lot of functionality and in this chapter I’ve only been able to scratch the surface. Some features of parallel
are that I haven’t covered:
- Different ways of specifying input
- Keep a log of all the jobs
- Only start new jobs when the machine is under a certain load
- Timeout, resume, and retry jobs
Once you have a basic understanding of parallel
and its most important options, I recommend that you take a look at its tutorial listed in the Further Reading section.
9.2 Further Reading
- Tange, O. 2011. “GNU Parallel - the Command-Line Power Tool.”;Login: The USENIX Magazine 36 (1). Frederiksberg, Denmark:42–47. http://www.gnu.org/s/parallel.
- Tange, Ole. 2014. “GNU Parallel.” http://www.gnu.org/software/parallel.
- Services, Amazon Web. 2014. “AWS Command Line Interface.” http://aws.amazon.com/cli.