Tuesday 27 November 2018

New app - 'Time machine for your website'

[Edit 5 Dec - video added - scroll to bottom to see it]
[Edit 18 Dec - beta made public, available from website]
[Edit 10 Mar 2019 - version 2 made public]
[Important note 17 Mar 2019: Name of app has had to be changed slightly from Watchman to Website Watchman because of a conflict with a Linux open source title]

We kick off quite a few experimental projects. In most cases they never really live up to the original vision or no-one's interested.

This is different. It's already working so beautifully and is proving indispensable here, I'm convinced that it will be even more important than Integrity and Scrutiny.

So what is it?

It monitors a whole website (or part of a website, or a single page) and reports changes.

You may want to monitor a page of your own, or a competitor's or a supplier's, and be alerted to changes. You may want to simply use it as a 'time machine' for your own website and have a record of all changes. There are probably use-cases that we haven't thought of.

You can easily schedule an hourly, daily, weekly or monthly scan so that you don't have to remember to do it, and the app doesn't even need to be running, it'll start up at the scheduled time.

Other services like this exist. But this is a desktop app that you own and are in control of. It goes very deep. It can scan your entire site, with loads of scanning options just like Integrity and Scrutiny, plus blacklisting and whitelisting of partial urls. It doesn't just take a screenshot, it keeps its own record of every change to every resource used by every page. It can display a page at any point in time - not just a screenshot but a 'living' version of the historic page using the javascript & css as it was at the time.

It allows you to switch between different versions of the page and spot changes. It'll run a comparison and highlight the changes in the code or the visible text or resources.

It stores the website in a web archive, you can export any version of any page at any point in time as a screenshot image or a collection of all of the files (html, js, css, images etc) involved in that version of that page.

The plan was to release this in beta in the New Year. But it's already at the stage where all of the fundamental functionality is in place and we're using it for real.

If you're at all interested in trying an early version and reporting back to us, you can now download the beta from the website. [Edit: version 1 is now the stable release and is free, version 2 is also free and is in beta]

The working title has been Watchtower, but it won't be possible to use that name because of a clash with the 'Watchtower library' and related apps. It'll likely be some variation on that name.

Monday 5 November 2018

Webscraper and pagination with case studies

If you're searching for help using WebScraper for MacOS then the chances are that the job involves pagination, because this situation provides some challenges.

Right off, I'll say that there is another approach to extracting data in cases like this from certain sites. It uses a different tool which we haven't made publicly available, but contact me if you're interested.

Here's the problem:  the search results are paginated (page 1, 2, 3 etc). In this case, all of the information we want is right there on the search results pages, but it may be that you want Webscraper to follow the pagination, and then follow the links through to the actual product pages (let's call them 'detail pages') and extract the data from those.

1. We obviously want to start WebScraper at the first page of search results. It's easy to grab that url and give it to WebScraper:

2. We aren't interested in Webscraper following any links other than those pagination links. (we'll come to detail pages later). In this case it's easy to 'whitelist' those pagination pages.

3. The pagination may stop after a certain number of pages. But in this case it seems to go on for ever. One way to limit our crawl is to use these options:

A more precise way to stop the crawl at a certain point in the pagination is to set up more rules:

4. At this point, running the scan proves that WebScraper will follow the search results pages we're interested in, and stop when we want.

5. In this particular case, all of the information we want is right there in the search results lists. So we can use WebScraper's class and regex helpers to set up the output columns.

Detail pages

In the example above, all of the information we want is there on the search result pages, so the job is done. But what if we have to follow the 'read more' link and then scrape the information from the detail page?

There are a few approaches to this, and a different approach that I alluded to at the start. The best way will depend on the site.

1. Two-step process

This method involves using the technique above to crawl the pagination, and collect *only* the urls of the detail pages  in a single column of the output file.  Then as a separate project, use that list as your starting point (File > Open list of links)  so that WebScraper scrapes data from the pages whose those urls, ie your detail pages. This is a good clean method, but it does involve a little more work to run it all. With the two projects set up properly and saved as project files,  you can open the first project, run it, export the results, open the second project, run it and then export your final results.

2. Set up the rules necessary to crawl through to the detail pages and scrape the information from only those.

Here are the rules for a recent successful project

"?cat=259&sort=price_asc&set_page_size=12&page=" is the rule which allows us to crawl the paginated pages.
"?productid="  is the one which identifies our product page.

Notice here that the two rules appear to contradict each other. But when using 'Only follow', the two rules are 'OR'd. The 'ignore' rules that we used in the first case study are 'AND'ed, which results in no results if you have more than one 'ignore urls that don't contain'.

So here we're following pages which are search results pages, or product detail pages.  

The third rule is necessary because the product page (in this case) contains links to 'related products' which aren't part of our search but do fit our other rules. We need to ignore those, otherwise we'll end up crawling all products on the entire site.

That would probably work fine, but we'd get irrelevant lines in our output because WebScraper will try to scrape data from the search results pages as well as the detail pages. This is where the Output filter comes into play.

The important one is "scrape data from pages where... URL does contain ?productid".  The other rule probably isn't needed (because we're ignoring those pages during the crawl) but I added it to be doubly sure that we don't get any data from 'related product' pages.

Whichever of those methods you try, the next thing is to set up the columns in the output file (ie what data you want to scrape.)  That's beyond the scope of this article, and the 'helpers' are much improved in recent WebScraper versions. There's a separate article about using regex to extract the information you want here.