Experimental Web Scraping for Data Science
Who is this ebook for?
This ebook targets people who are new to web scraping and want to get a light approach to understand the basics of web scraping for real-world applications.
It's important to note that this ebook is designed to be a practical introduction to web scraping and is not intended to be a comprehensive guide to web scraping (just in the first edition).
While this is an intro ebook, it's also a good place to start in the field if you're a data analyst or a data scientist. This is because it's just about getting data from the web, I'll also show how to store that data into a data frame and then export it to whatever format you want whether it's a CSV or JSON. You can find more about that in the projects section which is the sweets of this ebook.
Even if you're a marketer, a new startup; having limited API connections, or a fresh grad and aspiring to learn web scraping. This ebook is for you.
Many have been working with the web for a while and copied and pasted some info manually, but have not yet found a good way to automate the process. If that triggers you, then this ebook is definitely for you.
What's in it for you?
In this ebook, I'll cover the following topics:
- What is web scraping?
- How to parse the HTML web page with Beautiful Soup?
- How to find HTML tag(s) with or without attributes?
- How to extract links and multiple tags?
- How to go up the DOM tree?
- How to go sideways in the tree?
- How to deal with multiple pages?
- How to store the web data into a data frame?
- How to export your data into CSV or JSON?
- How to use API to get data?
4 Reasons why web scraping is important
If you want to get data from a website, you need first to check if it has an API (Application Programming Interface) or not. If the site has an API already, you might not reinvent the wheel.
Note: API is a programming interface that allows you to access data from the website.
Here are the four reasons why you might need a web scraper:
- The website you want to get data from does not provide an API in the first place.
- The API provided is not free.
- The API provided has some limitations as you can only access it a certain number of times.
- The API does not expose all the data you wish to extract.
Why web scraping for data science?
Data on the web is an opportunity for data scientists to collect, analyze, and visualize. You find lots and lots of "raw materials" out there on the web, and you can use them to build your own data science projects.
The web exposes a lot of interesting opportunities such as:
- You might find an interesting table on a Wikipedia page that you can retrieve and do some statistical analysis on.
- Perhaps you want to get a list of reviews on Amazon to perform some sentiment analysis on, create a recommendation system, or build a machine learning model to spot fake reviews.
- You might wish to get some visualizations of some properties on a real-estate site.- You'd like to enrich your natural language processing model in your articles classification project by getting more data from news articles websites and blogs as well to avoid bias.
- You might be wondering about social media analytics on Twitter, Facebook, and other social media.
- It might be interesting to monitor a nerds website like Hacker News to see the trending new stories that you're interested in.
When you learn about web scraping and you start paying attention to it more you will find yourself have the power to do a lot more with your data. This will show you many different business ideas that you can implement in your data science projects or make money web scraping as a freelancer.