Wednesday 11 December 2019

Using Webscraper to extract data from paginated search result pages, another real case study


First we need to give WebScraper our starting URL. This may be as simple as performing the search in a browser and grabbing the url of the first page of results. In this case things are even easier, because the landing page on the website has the first pageful of results. So that's our starting URL.

After that, we want WebScraper to 'crawl' the pagination (ie next/previous, or page 1, 2 3 etc) and ignore any of the other links to other parts of the site. In this case, the pages have a nice clean querystring which only includes the page number:
So as far as the crawl goes, we can tell WebScraper to ignore any url which doesn't contain "?page=". We're in the Advanced section of the Scan tab for these next few steps.
In some cases, the pages we want the data from are a further click from the search results pages. Let's call those 'detail pages'. It's possible to tell WS to stick to the scan rule(s) we've just set up, but to also visit and only scrape data from detail pages. This is done by setting up rules on the 'Output filter' tab. We don't need to do that here but this article goes into more detail about that.

In our case we're still going to use the output filter to make sure that we only scrape data from pages that we're interested in. We're starting within the 'directory' /names. So the first part, "url does contain /names" shouldn't be necessary*. There's a silly quirk here which is that the first page won't stick at ?page=1, it redirects back to /names. Therefore we can't use /names?page= here if we're going to capture the data from that first page. I notice that there are detail pages in the form /names/somename  so in order to filter those from the results, I've added a second rule here, "URL does not contain names/"
If we're lucky, the site will be responsive and won't clam up on us after a limited number of responses, and our activities will go unnoticed. However, it's always a good policy to put in a little throttling (the lower the number, the longer the scan will take). 
Don't worry about the number of threads too much here, the above setting takes that into account and it's beneficial to use multiple threads, even when throttling right down).

Finally, we set up our columns. If you don't see the Output file columns tab, click Complex setup (below the Go button). By default, you'll see two columns set up already, for page URL and page title. Page URL is often useful later, page title not so much. You can delete one or both of those if you don't want them in the output file.

In this example, we want to extract the price, which appears (if we 'view source') as
data-business-name-price="2795". This is very easy to turn into a regex, we just replace the part we want to collect with (.+?)
I had a little confusion over whether this appeared on the page with single-quotes or double-quotes, and so to cover the possibility that the site isn't consistent with these, I also replaced the quotes in the expression with ["'] (square brackets containing a double and single quote).

(Setting up the other columns for this job was virtually identical to the above step).

If necessary, WebScraper has a Helper tool to help us create and test that Regex. (View>Helper for classes/Regex.)

The Test button on the Output columns tab will run a limited scan, which should take a few seconds and put a few rows in the Results tab. We can confirm that the data we want is being collected.

Note that, by default, Webscraper outputs to csv and puts the data from one page on one row. That's fine if we want one piece of data from each page (or extract a number of different pieces into different columns) but here we have multiple results on each page.

Choosing json as an output format gets around this because the format allows for arrays within arrays. If we need csv as the output then WS will initially put the data in a single cell, separated with a separator character (pipe by default) and then has an option on the Post-process tab to split those rows and put each item of data on a separate row. This is done when exporting, so won't be previewed on the Results tab.

That's the setup. Go runs the scan, and Export exports the data.
*remembering to answer 'Directory' when asked whether the starting url is a page or directory



No comments:

Post a Comment