Saturday, 14 April 2018

Scraping details from a web database using WebScraper

In the example below, you can see that the important data are not identified by a *unique* class or id.  This is a common scenario, so I thought I'd write this post to explain how to crawl such a site and collect this information. This tutorial uses WebScraper from PeacockMedia.
1. As there's no unique class or id identifying the data we want*  we must use a reguar expression (regex).

Toggle to 'Complex setup' if you're not set to that already.

Add a column to the output file and choose Regular expression. Then use the helper button (unless you're a regex aficionado and have already written the expression). You can search the source for the part you want and test your expression as you write it. (see below). Actually writing the expression is outside the scope of this article, but one important thing to say is that if there is no capturing in your expression (round brackets) then the result of the whole expression will be included in your output. If there is capturing (as in my example below) then only the captured data is output.

 2. paste the expression into the 'add column' dialog and OK.   Here's what the setup looks like after adding that first column.
 3. This is a quick test, I've limited the scan to 100 links simply to test it. You can see this first field correctly going into the column. At this point I've also added another regex column for the film title, because the page title is useless here (it's the same for every detail page)
4. Add further columns for the remaining information as per step 1.

5. If you limited the links while you were testing as I did, remember to change that back to a suitably high limit (the default is 200,000 I believe) and set your starting url to the home page of the site or something suitable for crawling all of the pages you're interested in.


As always, please contact support if you need more help or would like to commission us to do the job.



*  as I write this, webscraper checks for class, id and itemprop  inside <div>, <span>, <dd> and <p> tags, this is likely to be expanded in future.

No comments:

Post a Comment