Chapter 11 Conclusion
In this final chapter, the book comes to a close. I’ll first recap what I’ve discussed in the previous nine chapters, and will then offer you three pieces of advice and provide some resources to further explore the related topics we touched upon. Finally, in case you have any questions, comments, or new command-line tools to share, I provide a few ways to get in touch with me.
11.1 Let’s Recap
This book explored the power of using the command line to perform data science tasks. It is an interesting observation that the challenges posed by this relatively young field can be tackled by such a time-tested technology. It is our hope that you now see what the command line is capable of. The many command-line tools offer all sorts of possibilities that are well suited to the variety of tasks encompassing data science.
There are many definitions for data science available. In Chapter 1, I introduced the OSEMN model as defined by Mason and Wiggens, because it is a very practical one that translates to very specific tasks. The acronym OSEMN stands for obtaining, scrubbing, exploring, modeling, and interpreting data. Chapter 1 also explained why the command line is very suitable for doing these data science tasks.
In Chapter 2, I explained how you can set up your own Data Science Toolbox and install the bundle that is associated with this book. Chapter 2 also provided an introduction to the essential tools and concepts of the command line.
The four OSEMN model chapters focused on performing those practical tasks using the command line. I haven’t devoted a chapter to the fifth step, interpreting data, because, quite frankly, the computer, let alone the command line, is of very little use here. I have, however, provided some pointers for further reading on this topic.
In the three intermezzo chapters, we looked at some broader topics of doing data science at the command line, topics which are not really specific to one particular step. In Chapter 4, I explained how you can turn one-liners and existing code into reusable command-line tools. In Chapter 6, I described how you can manage your data workflow using a command-line tool called Drake. In Chapter 8, I demonstrated how ordinary command-line tools and pipelines can be run in parallel using GNU Parallel. These topics can be applied at any point in your data workflow.
It is impossible to demonstrate all command-line tools that are available and relevant for doing data science. New command-line tools are created on a daily basis. As you may have come to understand by now, this book is more about the idea of using the command line, rather than giving you an exhaustive list of tools.
11.2 Three Pieces of Advice
You probably spent quite some time reading these chapters and perhaps also following along with the code examples. In the hope that it maximizes the return on this investment and increases the probability that you’ll continue to incorporate the command line into your data science workflow, I would like to offer you three pieces of advice: (1) be patient, (2) be creative, and (3) be practical. In the next three subsections I elaborate on each piece of advice.
11.2.1 Be Patient
The first piece of advice that I can give is to be patient. Working with data on the command line is different from using a programming language, and therefore it requires a different mindset.
Moreover, the command-line tools themselves are not without their quirks and inconsistencies. This is partly because they have been developed by many different people, over the course of multiple decades. If you ever find yourself at a loss regarding their mind-dazzling options, don’t forget to use –help, man, or your favorite search engine to learn more.
Still, especially in the beginning, it can be a frustrating experience. Trust us, you will become more proficient as you practice using the command line and its tools. The command line has been around for many decades, and will be around for many more to come. It is a worthwhile investment.
11.2.2 Be Creative
The second, related piece of advice is to be creative. The command line is very flexible. By combining the command-line tools, you can accomplish more than you might think.
I encourage you to not immediately fall back onto your programming language. And when you do have to use a programming language, think about whether the code can be generalized or reused in some way. If so, consider creating your own command-line tool with that code using the steps I discussed in Chapter 4. If you believe your command-line tool may be beneficial for others, you could even go one step further by making it open source.
11.2.3 Be Practical
The third piece of advice is to be practical. Being practical is related to being creative, but deserves a separate explanation. In the previous subsection, I mentioned that you should not immediately fall back to a programming language. Of course, the command line has its limits. Throughout the book, I have emphasized that the command line should be regarded as a companion approach to doing data science.
I’ve discussed four steps for doing data science at the command line. In practice, the applicability of the command line is higher for step 1 than it is for step 4. You should use whatever approach works best for the task at hand. And it’s perfectly fine to mix and match approaches at any point in your workflow. The command line is wonderful at being integrated with other approaches, programming languages, and statistical environments. There’s a certain trade-off with each approach, and part of becoming proficient at the command line is to learn when to use which.
In conclusion, when you’re patient, creative, and practical, the command line will make you a more efficient and productive data scientist.
11.3 Where To Go From Here?
As this book is on the intersection of the command line and data science, many related topics have only been touched upon. Now, it’s up to you to further explore these topics. The following subsections provide a list of topics and suggested resources to consult.
- Russell, Matthew. 2013. Mining the Social Web. 2nd Ed. O’Reilly Media.
- Warden, Pete. 2011. Data Source Handbook. O’Reilly Media.
11.3.2 Shell Programming
- Winterbottom, David. 2014. “Commandlinefu.com.” http://www.commandlinefu.com.
- Peek, Jerry, Shelley Powers, Tim O’Reilly, and Mike Loukides. 2002. Unix Power Tools. 3rd Ed. O’Reilly Media.
- Goyvaerts, Jan, and Steven Levithan. 2012. Regular Expressions Cookbook. 2nd Ed. O’Reilly Media.
- Cooper, Mendel. 2014. “Advanced Bash-Scripting Guide.” http://www.tldp.org/LDP/abs/html.
- Robbins, Arnold, and Nelson H. F. Beebe. 2005. Classic Shell Scripting. O’Reilly Media.
11.3.3 Python, R, and SQL
- Wickham, Hadley. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer.
- McKinney, Wes. 2012. Python for Data Analysis. O’Reilly Media.
- Rossant, Cyrille. 2013. Learning Ipython for Interactive Computing and Data Visualization. Packt Publishing.
11.3.4 Interpreting Data
- Shron, Max. 2014. Thinking with Data. O’Reilly Media.
- Patil, DJ. 2012. Data Jujitsu. O’Reilly Media.
11.4 Getting in Touch
This book would not have been possible without the many people who created the command line and the numerous command-line tools. It’s safe to say that the current ecosystem of command-line tools for data science is a community effort. I have only been able to give you a glimpse of the many command-line tools available. New ones are created everyday, and perhaps some day you will create one yourself. In that case, I would love to hear from you. I’d also appreciate it if you would drop us a line whenever you have a question, comment, or suggestion. There are a couple of ways to get in touch:
- Email: firstname.lastname@example.org
- Twitter: @jeroenhjanssens
- Book website: https://datascienceatthecommandline.com/
- Book GitHub repository: https://github.com/jeroenjanssens/data-science-at-the-command-line
Cooper, Mendel. 2014. “Advanced Bash-Scripting Guide.” http://www.tldp.org/LDP/abs/html.
Cortez, P., A. Cerdeira, F. Almeida, T. Matos, and J. Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4). Elsevier: 547–53.
Docopt. 2014. “Command-Line Interface Description Language.” http://docopt.org.
Dolan, Stephen. 2014. “Jq.” http://stedolan.github.com/jq.
Dougherty, Dale, and Arnold Robbins. 1997. Sed & Awk. 2nd Ed. O’Reilly Media.
Drepper, Ulrich. 2012. “Seq.” http://www.gnu.org/software/coreutils.
Eaton, John W., and Colin Watson. 2014. “Man.”
Factual. 2014. “Drake.” https://github.com/Factual/drake.
Goyvaerts, Jan, and Steven Levithan. 2012. Regular Expressions Cookbook. 2nd Ed. O’Reilly Media.
Granlund, Torbjorn, David MacKenzie, and Jim Meyering. 2012. “Cp.” http://www.gnu.org/software/coreutils.
Granlund, Torbjorn, and Richard M. Stallman. 2012. “Cat.” http://www.gnu.org/software/coreutils.
Groskopf, Christopher. 2014. “Csvkit.” http://csvkit.readthedocs.org.
Haertel, Mike, and Paul Eggert. 2012. “Sort.” http://www.gnu.org/software/coreutils.
Heddings, Lowell. 2006. “Keyboard Shortcuts for Bash.” http://www.howtogeek.com/howto/ubuntu/keyboard-shortcuts-for-bash-command-shell-for-ubuntu-debian-suse-redhat-linux-etc.
Ihnat, David M., and David MacKenzie. 2012. “Paste.” http://www.gnu.org/software/coreutils.
Janssens, Jeroen H.M. 2014. “Rio.” https://github.com/jeroenjanssens/data-science-at-the-command-line.
Kogan, Dima. 2014. “Feedgnuplot.” http://search.cpan.org/perldoc?feedgnuplot.
MacKenzie, David. 2012. “Mkdir.” http://www.gnu.org/software/coreutils.
MacKenzie, David, and Jim Meyering. 2012. “Head.” http://www.gnu.org/software/coreutils.
Mason, Hilary, and Chris H. Wiggins. 2010. “A Taxonomy of Data Science.” http://www.dataists.com/2010/09/a-taxonomy-of-data-science.
Meyering, Jim. 2012a. “Grep.” http://www.gnu.org/software/grep.
———. 2012b. “Pwd.” http://www.gnu.org/software/coreutils.
Molinaro, Anthony. 2005. SQL Cookbook. O’Reilly Media.
Nelson, Philip A. 2006. “Bc.” http://www.gnu.org/software/bc.
Parker, Mike, David MacKenzie, and Jim Meyering. 2012. “Mv.” http://www.gnu.org/software/coreutils.
Patil, DJ. 2012. Data Jujitsu. O’Reilly Media.
Peek, Jerry, Shelley Powers, Tim O’Reilly, and Mike Loukides. 2002. Unix Power Tools. 3rd Ed. O’Reilly Media.
Raymond, Eric Steven. 2014. “Basics of the Unix Philosophy.” http://www.faqs.org/docs/artu/ch01s06.html.
Rossant, Cyrille. 2013. Learning Ipython for Interactive Computing and Data Visualization. Packt Publishing.
Rubin, Paul, and David MacKenzie. 2012. “Wc.” http://www.gnu.org/software/coreutils.
Rubin, Paul, David MacKenzie, Richard M. Stallman, and Jim Meyering. 2012. “Rm.” http://www.gnu.org/software/coreutils.
Russell, Matthew. 2013. Mining the Social Web. 2nd Ed. O’Reilly Media.
Schutt, Rachel, and Cathy O’Neil. 2013. Doing Data Science. O’Reilly Media.
Shron, Max. 2014. Thinking with Data. O’Reilly Media.
Stallman, Richard M., and David MacKenzie. 2012. “Ls.” http://www.gnu.org/software/coreutils.
Tange, O. 2011. “GNU Parallel - the Command-Line Power Tool.”;Login: The USENIX Magazine 36 (1). Frederiksberg, Denmark: 42–47. http://www.gnu.org/s/parallel.
Tange, Ole. 2014. “GNU Parallel.” http://www.gnu.org/software/parallel.
Torvalds, Linus, and Junio C. Hamano. 2014. “Git.” http://git-scm.com.
Warden, Pete. 2011. Data Source Handbook. O’Reilly Media.
Wikipedia. 2014. “List of Http Status Codes.” http://en.wikipedia.org/wiki/List_of_HTTP_status_codes.
Winterbottom, David. 2014. “Commandlinefu.com.” http://www.commandlinefu.com.
Wirzenius, Lars. 2013. “Writing Manual Pages.” http://liw.fi/manpages/.