How to Clean Data at the Command Line
Cleaning data is a widely known process that can let us explore data and see beyond its raw form. Multiple technologies can solve this task, but we have a problem.
The data-driven problem we face
Whenever you want to import a CSV file, by habit, you go to Google and see how to find the two lines that you always forget (in Python for example) so you get them open up your text editor to make a file and paste what you found in it.
Why the command line?
The simplest data cleaning tasks might sound frustrating or time-wasting and maybe you use a higher-level library like Pandas but I bet you still write more code than just dealing with the terminal which can pack a bunch of lines of codes into just one-liner at the command line.
This ebook makes dealing with CSV files, JSON, or in general any text file much easier.
What's in it for you?
In this ebook, I'm trying to save your time and the hassle of dealing with files at the system level. You may also like the adventure of exploring command-line tools and programs that you may not have heard of. I encourage you to try these tools as I do on my workdays.
While dealing with the command line may sound a bit geeky, this ebook is simple and easy to follow, and it's a lot of fun.
There are real examples from a scientific paper, COVID tracking project data, Reddit user data, and more that you can practice with and try useful programs and tools at the comfort of your command line.
Content
In this ebook you'll be able to clean data using command-line tools: tr, grep, sort, uniq, sort, awk, sed, and csvlook and practice on cleaning a COVID-19 CSV file using command-line programs: csvkit and xsv comparing the performance of each.
You'll also see how to sort and concatenate a large CSV file with csvkit and xsv, and calculate their performance with respect to Pandas.
In the last chapter, you'll get to know how to clean a JSON file using command-line program jq.
Read this before you buy
The content is a curated list of blog posts I published on my personal site distributed among the book chapters:
Chapter 1: https://www.ezzeddinabdullah.com/posts/how-to-clean-text-data-at-the-command-line
Chapter 2: https://www.ezzeddinabdullah.com/posts/how-to-clean-csv-data-at-the-command-line
Chapter 3: https://www.ezzeddinabdullah.com/posts/how-to-clean-csv-data-at-the-command-line-part-2
Chapter 4: https://www.ezzeddinabdullah.com/posts/how-to-clean-json-data-at-the-command-line
What makes the ebook different from these blog posts?
I've made some fixes to some benchmark results and some command lines used besides syntax-highlighting of the codes and avoiding call-to-actions inside chapters. Also, the format is PDF so you can get the pack of information to the four chapters in just one package and can read it on your laptop or phone.
Buying this ebook will encourage me to publish more whether ebooks or courses as well.
Screenshot
Here is a screenshot of how it looks (with a sample of the syntax-highlighted code snippets):
Who is this for?
If you are a data scientist, data engineer, data analyst, software developer, or you use data a lot (like TXT, CSV, or JSON), this ebook is for you.