Introduction To Web Scraping

Written by Prithvi Singh|6 min read|

Hey guys, today we will talk about one of the most popular terms related to data generation i.e Web Scraping.

Most of you may already know about web scraping or even have done some work on it. But for those who don’t have any experience in web scraping or those who are looking to learn about it, are in the right place.

What is Web Scraping?

Web Scraping is the process of extracting data from the website, transforming the information on a webpage into structured data for further analysis. As we all know that the internet is a collection of the huge amount of data that web scraping has become an important part of big data sets.

People write various web scraping scripts for generating huge data, which is then passed for further processing like aggregation, filtering, reducing etc. These transformed data is then passed to other systems like data streams, database, file systems etc.

Use cases for web scraping

Web scraping is used in a variety of digital business that relies on data harvesting, which also is another name for web scraping. Some of the popular use cases of scraping:- 

Google uses search bots for crawling a site, analyzing its content and then rank it.

Market research companies scrape the data from social media or blogs or forums and then analyzes the data for providing insights.

You can even make your own scraping bots which fetch prices for different products for a user and notify them if there is a drop on a product.

But web scraping is also used for illegal purposes, including illegal content. People scrape data from a site and then use that data for illegal activities like copyrighted content.

How does web scraping work?

There are many tools available which provide web scraping services to the user. But if you want a custom script which scrapes data according to your need then you need to know how the scraping process works.

Each domain has a particular HTML structure. Which you have to recognize.

You need to extract that HTML content and parse through its DOM elements.

You need to target particular elements which you want from the site. And scrape that item.

And lastly, you need to store that scraped data to database or file or any other storage system.

But while scraping a website you need to follow that website’s rules and policies regarding web scraping. Because some websites don’t allow scraping of their content or only provide that for authorized users. Some websites provide their own APIs for web scraping which are not free so you need to buy a suitable package. But then again there is always a workaround for paid stuff, isn’t it? But we won’t discuss that in this article.

Now, let’s start with our own web scraping script. For this purpose two, most used languages are python and node js. Now there are many debates about which one is faster than the others. My experience python script is faster than the node.js but node.js provides more control over the DOM elements of the site.

In my example, I am going to use node.js. It provides libraries like Axios, cheerio and stores it to a CSV file. We will scrape one of the most popular search engine Bing. 

Create a node project using npm init command inside an empty directory. Now let’s install the libraries we are going to need which are using npm install command. The libraries which we are going to need are Axios and cheerio.

Now let’s create our app.js file which will contain the logic to scrape the bing search result page. We will make a function which will make a axio request to the URL you provide and then use cheerio object for further operation.

Let me explain the code we have used async to make the function in order to use await. I already mention that axio returns a promise and in order to resolve that we have used await. After that, we took the HTML content from the Axios response and load it into a cheerio object. In this way, you have control each dom element of the response HTML content.

If you right-click the search result page and click on inspect you will get to see the HTML tags of the page. Now while selecting the tags make sure you mostly target the tags which have id because the class name of the tags can be dynamic. 

Now let’s target each search result shown on the page. And you will see the element li is repeated. Don’t worry cause when you target a tag which is getting repeated it will return a list containing all of them. 

But we don’t want to target all of the li elements. We only want those which have class b_algo cause those contains the main result element.

In this way, you get the list of each search result on the page. Now we can extract things like title, description etc for each of those list elements. 

Now before I post my code, try by yourself to see if you can target the correct HTML tags or not. Once done come back to the article to see my code.

Let me walk you through the code. I created one list which will hold the data we will be scraping from each element. Our library cheerio provides function like map and each to iterate over the elements of the list and parse each one of them one by one. And for this purpose, I have used this to target each element.

I am scraping title and description for each element, assigning them to object and storing them in the list. I hope that you were able to target the HTML tags correctly and if not these will work for you as long as you are on bing. Now some of you may have targeted different tags and still getting results, that happens because the tags also change based on location and various other factors. Just make sure that the tags you target do not change on each search. They should be static and not dynamic.

Wrapping it up: – 

If you want to know more about these libraries or web scraping I am providing some links which will help you to do that.