Chapter 10 Conclusion

In this final chapter, the book comes to a close. We’ll first recap what we have discussed in the previous nine chapters, and will then offer you three pieces of advice and provide some resources to further explore the related topics we touched upon. Finally, in case you have any questions, comments, or new command-line tools to share, we provide a few ways to get in touch.

10.1 Let’s Recap

This book explored the power of employing the command line to perform data science tasks. It is an interesting observation that the challenges posed by this relatively young field can be tackled by such a time-tested technology. It is our hope that you now see what the command line is capable of. The many command-line tools offer all sorts of possibilities that are well suited to the variety of tasks encompassing data science.

There are many definitions for data science available. In Chapter 1, we introduced the OSEMN model as defined by Mason and Wiggens, because it is a very practical one that translates to very specific tasks. The acronym OSEMN stands for obtaining, scrubbing, exploring, modeling, and interpreting data. Chapter 1 also explained why the command line is very suitable for doing these data science tasks.

In Chapter 2, we explained how you can set up your own Data Science Toolbox and install the bundle that is associated with this book. Chapter 2 also provided an introduction to the essential tools and concepts of the command line.

The OSEMN model chapters—Chapter 3 (obtaining), Chapter 5 (scrubbing), Chapter 7 (exploring), and Chapter 9 (modeling)—focused on performing those practical tasks using the command line. We haven’t devoted a chapter to the fifth step, interpreting data, because, quite frankly, the computer, let alone the command line, is of very little use here. We have, however, provided some pointers for further reading on this topic.

In the three intermezzo chapters, we looked at some broader topics of doing data science at the command line, topics which are not really specific to one particular step. In Chapter 4, we explained how you can turn one-liners and existing code into reusable command-line tools. In Chapter 6, we described how you can manage your data workflow using a command-line tool called Drake. In Chapter 8, we demonstrated how ordinary command-line tools and pipelines can be run in parallel using GNU Parallel. These topics can be applied at any point in your data workflow.

It is impossible to demonstrate all command-line tools that are available and relevant for doing data science. New command-line tools are created on a daily basis. As you may have come to understand by now, this book is more about the idea of using the command line, rather than giving you an exhaustive list of tools.

10.2 Three Pieces of Advice

You probably spent quite some time reading these chapters and perhaps also following along with the code examples. In the hope that it maximizes the return on this investment and increases the probability that you’ll continue to incorporate the command line into your data science workflow, we would like to offer you three pieces of advice: (1) be patient, (2) be creative, and (3) be practical. In the next three subsections we elaborate on each piece of advice.

10.2.1 Be Patient

The first piece of advice that we can give is to be patient. Working with data on the command line is different from using a programming language, and therefore it requires a different mindset.

Moreover, the command-line tools themselves are not without their quirks and inconsistencies. This is partly because they have been developed by many different people, over the course of multiple decades. If you ever find yourself at a loss regarding their mind-dazzling options, don’t forget to use –help, man, or your favorite search engine to learn more.

Still, especially in the beginning, it can be a frustrating experience. Trust us, you will become more proficient as you practice using the command line and its tools. The command line has been around for many decades, and will be around for many more to come. It is a worthwhile investment.

10.2.2 Be Creative

The second, related piece of advice is to be creative. The command line is very flexible. By combining the command-line tools, you can accomplish more than you might think.

We encourage you to not immediately fall back onto your programming language. And when you do have to use a programming language, think about whether the code can be generalized or reused in some way. If so, consider creating your own command-line tool with that code using the steps we discussed in Chapter 4. If you believe your command-line tool may be beneficial for others, you could even go one step further by making it open source.

10.2.3 Be Practical

The third piece of advice is to be practical. Being practical is related to being creative, but deserves a separate explanation. In the previous subsection, we mentioned that you should not immediately fall back to a programming language. Of course, the command line has its limits. Throughout the book, we have emphasized that the command line should be regarded as a companion approach to doing data science.

We’ve discussed four steps for doing data science at the command line. In practice, the applicability of the command line is higher for step 1 than it is for step 4. You should use whatever approach works best for the task at hand. And it’s perfectly fine to mix and match approaches at any point in your workflow. The command line is wonderful at being integrated with other approaches, programming languages, and statistical environments. There’s a certain trade-off with each approach, and part of becoming proficient at the command line is to learn when to use which.

In conclusion, when you’re patient, creative, and practical, the command line will make you a more efficient and productive data scientist.

10.3 Where To Go From Here?

As this book is on the intersection of the command line and data science, many related topics have only been touched upon. Now, it’s up to you to further explore these topics. The following subsections provide a list of topics and suggested resources to consult.

10.3.1 APIs

  • Russell, Matthew. 2013. Mining the Social Web. 2nd Ed. O’Reilly Media.
  • Warden, Pete. 2011. Data Source Handbook. O’Reilly Media.

10.3.2 Shell Programming

  • Winterbottom, David. 2014. “Commandlinefu.com.” http://www.commandlinefu.com.
  • Peek, Jerry, Shelley Powers, Tim O’Reilly, and Mike Loukides. 2002. Unix Power Tools. 3rd Ed. O’Reilly Media.
  • Goyvaerts, Jan, and Steven Levithan. 2012. Regular Expressions Cookbook. 2nd Ed. O’Reilly Media.
  • Cooper, Mendel. 2014. “Advanced Bash-Scripting Guide.” http://www.tldp.org/LDP/abs/html.
  • Robbins, Arnold, and Nelson H. F. Beebe. 2005. Classic Shell Scripting. O’Reilly Media.

10.3.3 Python, R, and SQL

  • Wickham, Hadley. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer.
  • McKinney, Wes. 2012. Python for Data Analysis. O’Reilly Media.
  • Rossant, Cyrille. 2013. Learning Ipython for Interactive Computing and Data Visualization. Packt Publishing.

10.3.4 Interpreting Data

  • Shron, Max. 2014. Thinking with Data. O’Reilly Media.
  • Patil, DJ. 2012. Data Jujitsu. O’Reilly Media.

10.4 Getting in Touch

This book would not have been possible without the many people who created the command line and the numerous command-line tools. It’s safe to say that the current ecosystem of command-line tools for data science is a community effort. We have only been able to give you a glimpse of the many command-line tools available. New ones are created everyday, and perhaps some day you will create one yourself. In that case, we would love to hear from you. We’d also appreciate it if you would drop us a line whenever you have a question, comment, or suggestion. There are a couple of ways to get in touch: