Data Collection is the first step for building any data-driven application, data analysis or a machine learning project. Web crawling is one of the ways of collecting data from the internet. Many of us are familiar with crawling libraries like Scrapy, Beautifulsoup, Selenium or Puppeteer.

But before crawling any website we need to understand that different websites having a different way of storing data and displaying it. Just so we have a reference, I'll assume we are crawling an e-commerce website. In this article, I will try to outline all points related to crawling.

  1. Analyse the website
  2. Find the API
  3. Find the embedded data
  4. using XPath
  5. Rendering through browser

Analyse the website

First thing is to understand the data being displayed on the website. This is important because there are many cultural differences across countries that create differences in how we display text. For example, some countries like the US, UK and India will show prices with ‘dot’ for the decimal and a ‘comma’ for the thousand or million mark ($1,243.50). But many European countries like France and Germany will switch the dots and commas (€1.243,50).

So start by a general exploration and note these details. Try opening product pages belonging to different categories and so on. Try to find different ways in which the website displays their offers and discounted prices.

Find The API

If the website is using APIs for displaying the data, it will be easier to write the crawler. Everyone likes structured data, and APIs will provide you this with ease, as opposed to unstructured XPaths. Mostly such websites will have data in the form of JSON or XML.

But how do we know this website is using an API call for displaying data? And how will you find that API?

  1. Many websites use the concept of ‘one page displaying’. Open a webpage and inspect it. Go to XHR of networks and reload the page. You will see a list all of the requests which are the type of XHR (data specific). Analyse all network requests that you see.
  2. If you find a ‘load more’ button on the web page, or data is being displayed on scrolling, you may not find the API you need. But when you scroll or hit the ‘load more’ button, the API request will be made.
  3. Once you get the API, try to modify and make requests like page number. You’ll find some other parameters as well that you can play around with.
  4. On some websites you might find the API inside the script. To check this, go to the source page and try to search using keywords like API, URL, XMLHttpRequest or Ajax.

Find the embedded data

Some websites fetch data inside a script and display it. You will not find the data in an API, but you can find data inside the <script> tag of type application/Id+JSON. These scripts are used by Facebook Ads and other platforms to show product specific data quickly in the ad-block. It is a cool way to find all the important information in a single place in the HTML. You can use the 'extruct' library by Scrapinghub to find them easily.

But be careful, sometime the data in the script is not complete.

Using XPath or CSS Selector

If the above steps did not get you the required data, you can use XPath or CSS selector. But note that the XPath which is copied from browser is not the same as that after downloading the HTML. This happens because of Javascript which provides some dynamic structure to the page. You can use a chrome plugin called 'XPath Helper' to help you find the correct XPath.

So first download the HTML page and open it in the browser. Now use the selector, or if you are using Scrapy framework for crawling, you can use Scrapy shell first to verify the correct path.

Rendering through Browser

Sometime you won’t be able to get the data you need through all the above methods. Rendering basically opens the entire browser and mimics the process as if a human was doing it.

This is when you have to use rendering. You can use selenium or puppeteer for this. But this process should be the last resort, because takes up too much memory.

I hope you find this article helpful! For any feedback, or if you just want to say hi, reach out to me at [email protected]!