Friday, 18 January 2019

scraping Yelp for phone numbers of all plumbers in California (or whatever in wherever)

I've written similar tutorials to this one before, but I've made the screenshots and written the tutorial this morning to help someone and wanted to preserve the information here.

We're using the Webscraper app. The procedure below will work while you're in demo mode, but the number of results will be limited.

We enter our starting url. Perform your search on Yelp and then click through to the next page of results. Note how the url changes as you click through. In this case it goes:

(I added in &start=0 on the first one, that part isn't there when you first go to the results, but this avoids some duplication due to the default page and &start=0 being the same page).

So our starting url should be:

In order to crawl through the results pages, we limit our crawl to urls that match the pattern we've observed above. We can do that by asking the crawl to ignore any urls that don't contain

In this case we're going to additionally click the links in the results, so that we can scrape the information we want from the businesses' pages. This is done on the 'Output filter' tab. Check 'Filter output' and enter these rules:
URL contains /biz/
and URL contains ?osq=Plumbers
(The phone numbers and business names are right there on the results pages, we could grab them from there, but for this exercise we're clicking through to the business page to grab the info from there. It has advantages.)

Finally we need to set up the columns in our output file and specify what information we want to grab. On the business page, the name of the business is in the h1 tags, so we can simply select that. The phone number is helpfully in a div called 'biz-phone' so that's easy to set up too.

Then we run by pressing the Go button. In an unlicensed copy of the WebScraper app, you should see 10 results. Once licensed, the app should crawl through the pagination and collect all (in this case) 200+ results.


I was able to get all of the results (matching those available while using the browser) for this particular category. For some others I noticed that Yelp didn't seem to want to serve more than 25 pages of results, even when the page said that there were more pages. Skipping straight to the 25th page and then clicking 'Next' resulted in a page with hints about searching.

This isn't the same as becoming blacklisted, which will happen when you have made too many requests in a given time. This is obvious because you then can't access Yelp in your browser without changing your IP address. One measure to avoid this problem is to use ProxyCrawl which is a service that you can use by getting yourself an account (free initially), switch on 'Use ProxyCrawl' in WebScraper and enter your token in Preferences.

Wednesday, 9 January 2019

New website monitor / archive utility for Mac arrives at full stable release and is still free

Watchman is an easy-to-use website monitoring / archiving utility.

You can use it to watch a single page, a part of a website or a whole website. It can run on schedule (hourly, daily, weekly, monthly) and alert you to the changes you're interested in, which could be visible text, source, resources appearing on the page, its status, or you can simply leave it to archive all changes to all files. You can set up multiple website configurations, each with their own schedule. It uses the system's launchd, meaning that Watchman doesn't have to be left running, it'll just start as needed.

Watchman uses the same fast, efficient crawling engine as Scrutiny and Integrity, which has been developed over 12 years and offers a huge amount of configuration and tuning. This is coupled with a new web archive format.

Its web archive format can store changes like a Time Machine backup. You can view any page as it appeared on a certain date. When you do so, you're viewing a 'living' version of the page, with its css and javascript running as it was at the time, not a simple screenshot. You can of course export a version of a page as an image, or as a collection of all the files under their original filenames, as they were on that date. You can switch between versions of a page to compare them.

It's a desktop app running on your own Mac, so you own your own data.

It's early days, there are many more features in the pipeline, but for now it's stable and doing invaluable work. And version 1.x is free to download and use. (The next major version may not be free, or may be 'freemium' but the current version will continue to work and remain free.)

I've been flagging the release of Watchman for a while, it's been a long time since I've been so excited about a new project and I believe it'll become a more important title for us than Scrutiny.

Tuesday, 27 November 2018

New app - 'Time machine for your website'

[Edit 5 Dec - video added - scroll to bottom to see it]
[Edit 18 Dec - beta made public, available from website]

We kick off quite a few experimental projects. In most cases they never really live up to the original vision or no-one's interested.

This is different. It's already working so beautifully and is proving indispensable here, I'm convinced that it will be even more important than Integrity and Scrutiny.

So what is it?

It monitors a whole website (or part of a website, or a single page) and reports changes.

You may want to monitor a page of your own, or a competitor's or a supplier's, and be alerted to changes. You may want to simply use it as a 'time machine' for your own website and have a record of all changes. There are probably use-cases that we haven't thought of.

You can easily schedule an hourly, daily, weekly or monthly scan so that you don't have to remember to do it, and the app doesn't even need to be running, it'll start up at the scheduled time.

Other services like this exist. But this is a desktop app that you own and are in control of. It goes very deep. It can scan your entire site, with loads of scanning options just like Integrity and Scrutiny, plus blacklisting and whitelisting of partial urls. It doesn't just take a screenshot, it keeps its own record of every change to every resource used by every page. It can display a page at any point in time - not just a screenshot but a 'living' version of the historic page using the javascript & css as it was at the time.

It allows you to switch between different versions of the page and spot changes. It'll run a comparison and highlight the changes in the code or the visible text or resources.

It stores the website in a web archive, you can export any version of any page at any point in time as a screenshot image or a collection of all of the files (html, js, css, images etc) involved in that version of that page.

The plan was to release this in beta in the New Year. But it's already at the stage where all of the fundamental functionality is in place and we're using it for real.

If you're at all interested in trying an early version and reporting back to us, you can now download the beta from the website.

The working title has been Watchtower, but it won't be possible to use that name because of a clash with the 'Watchtower library' and related apps. It'll likely be some variation on that name.

Monday, 5 November 2018

Webscraper and pagination with case studies

If you're searching for help using WebScraper for MacOS then the chances are that the job involves pagination, because this situation provides some challenges.

Right off, I'll say that there is another approach to extracting data in cases like this from certain sites. It uses a different tool which we haven't made publicly available, but contact me if you're interested.

Here's the problem:  the search results are paginated (page 1, 2, 3 etc). In this case, all of the information we want is right there on the search results pages, but it may be that you want Webscraper to follow the pagination, and then follow the links through to the actual product pages (let's call them 'detail pages') and extract the data from those.

1. We obviously want to start WebScraper at the first page of search results. It's easy to grab that url and give it to WebScraper:

2. We aren't interested in Webscraper following any links other than those pagination links. (we'll come to detail pages later). In this case it's easy to 'whitelist' those pagination pages.

3. The pagination may stop after a certain number of pages. But in this case it seems to go on for ever. One way to limit our crawl is to use these options:

A more precise way to stop the crawl at a certain point in the pagination is to set up more rules:

4. At this point, running the scan proves that WebScraper will follow the search results pages we're interested in, and stop when we want.

5. In this particular case, all of the information we want is right there in the search results lists. So we can use WebScraper's class and regex helpers to set up the output columns.

Detail pages

In the example above, all of the information we want is there on the search result pages, so the job is done. But what if we have to follow the 'read more' link and then scrape the information from the detail page?

There are a few approaches to this, and a different approach that I alluded to at the start. The best way will depend on the site.

1. Two-step process

This method involves using the technique above to crawl the pagination, and collect *only* the urls of the detail pages  in a single column of the output file.  Then as a separate project, use that list as your starting point (File > Open list of links)  so that WebScraper scrapes data from the pages whose those urls, ie your detail pages. This is a good clean method, but it does involve a little more work to run it all. With the two projects set up properly and saved as project files,  you can open the first project, run it, export the results, open the second project, run it and then export your final results.

2. Set up the rules necessary to crawl through to the detail pages and scrape the information from only those.

Here are the rules for a recent successful project

"?cat=259&sort=price_asc&set_page_size=12&page=" is the rule which allows us to crawl the paginated pages.
"?productid="  is the one which identifies our product page.

Notice here that the two rules appear to contradict each other. But when using 'Only follow', the two rules are 'OR'd. The 'ignore' rules that we used in the first case study are 'AND'ed, which results in no results if you have more than one 'ignore urls that don't contain'.

So here we're following pages which are search results pages, or product detail pages.  

The third rule is necessary because the product page (in this case) contains links to 'related products' which aren't part of our search but do fit our other rules. We need to ignore those, otherwise we'll end up crawling all products on the entire site.

That would probably work fine, but we'd get irrelevant lines in our output because WebScraper will try to scrape data from the search results pages as well as the detail pages. This is where the Output filter comes into play.

The important one is "scrape data from pages where... URL does contain ?productid".  The other rule probably isn't needed (because we're ignoring those pages during the crawl) but I added it to be doubly sure that we don't get any data from 'related product' pages.

Whichever of those methods you try, the next thing is to set up the columns in the output file (ie what data you want to scrape.)  That's beyond the scope of this article, and the 'helpers' are much improved in recent WebScraper versions. There's a separate article about using regex to extract the information you want here.

Wednesday, 26 September 2018

http requests - they're not all the same

This is the answer to a question that I was asked yesterday. I thought that the discussion was such an interesting one that I'd post the reply publicly here.

A common perception is that a request for a web page is simply a request. Why might a server give different responses to different clients? To be specific, why might Integrity / Scrutiny receive one response when testing a url, yet a browser sees something different? What are the differences?

user-agent string

This is sent with a request to identify "who's asking". Abuses of the user-agent string by servers range from sending a legitimate-looking response to search engine bots and dodgy content to browsers, through to refusing to respond to requests that don't appear to come from browsers. Integrity and Scrutiny are good citizens and by default have their own standards-compliant user-agent string. If it's necessary for testing purposes, this can be changed to that of a browser or even a search engine bot.

header fields

A request contains a bunch of header fields. These are specifically designed to allow a server to tailor its content to the client. There are loads of possible ones and you can invent custom ones, some are mandatory, many optional. By default, Scrutiny includes the ones that the common browsers include, with similar settings.  If your own site requires a particular unusual or custom header field / value to be present, you can add them (in Scrutiny's 'Advanced settings'). 

cookies and javascript

Browsers have these things enabled by default, They're just part of our online lives now (though accessibility standards say that sites should be usable without them) but they're options in Scrutiny and deliberately both off by default. I'm discovering more and more sites which will test for cookies being enabled in the browser (with a handshake-type thing) and refuse to serve if not. There are a few sites which refuse to work properly without javascript being enabled in the browser. This is a terrible practice but it does happen, thankfully rarely. Switch cookies on in Scrutiny if you need to. But always leave the javascript option *off* unless your site does this when you switch js off in your browser:
An image showing a blank web page, message: This site requires Javascript to work


There are a couple of other things under Scrutiny's Preferences > Links > Advanced (and Integrity's Preferences > Advanced).   'Use GET for all connections' and 'Load data for all connections'. Both will probably be off by default. 
Screenshot of a couple of Scrutiny's preferences, always use GET and load data for all connections

A  browser will generally use GET when making a request (unless you're sending a form) and it will probably load all of the data that is returned.  For efficiency, a webcrawler can use the HEAD method when testing external links (because it doesn't need the actual content of the page, only the status code). If it does use the GET (for internal connections where it does want the content, or if  you have 'always use GET' switched on) and if if doesn't need the page content, it can cancel a request after getting the status code. This very rarely causes a problem, but I have had one or two cases where a large number of cancelled requests to the same server can cause problems.  

'Use GET for all connections' is unlikely to make any visible difference when scanning a site. Using the HEAD method (which by all standards should work) may not always work. but if a link returns any kind of error after using the HEAD method, Integrity / Scrutiny tests the same url again using GET. 

Other considerations

Outside of the particulars of the http request itself are a couple of things that may also cause different responses to be returned to a webcrawler and a browser. 

One is the frequency of the requests. Integrity and Scrutiny will send many more requests in a given space of time than a browser, probably many at the same time (depending on your settings). This is one of the factors involved in LinkedIn's infamous 999 response code. 

The other is authentication. A frequently-asked question is why a link to social media link returns a response code such as 'forbidden' when the link works fine in a browser. Having cookies switched on (see above) may resolve this but we forget that when we visit social media sites we have logged in at some point in the past and our browser remembers who we are. It may be necessary to be authenticated as a genuine user of a site when viewing a page that may appear 'public'.  Scrutiny and Webscraper allow authentication, the Integrity family doesn't.

I love this subject. Comments and discussion are very welcome.

Friday, 21 September 2018

New free flashcard / Visualisation & Association method for MacOS

Vocabagility is more than a flashcard system, it's a method. Cards are selected and shuffled, one side is shown. Give an answer, did you get it right? Move on. As quick and easy as using a pack of real cards in your pocket.

The system also encourages you to invent an amusing mental image linking the question and answer (Visualization and Association)

Cards that you're not certain about have a greater probability of being shown.

This is an effective system for learning vocabulary / phrases for any language but could be used for learning other things too.

Download Vocabagility for Mac for free here.

Sunday, 16 September 2018

ScreenSleeves ready to go as a standalone app

In the last post I gave a preview of a new direction for ScreenSleeves and now it's ready to go.

Changes in MacOS Mojave have made it impossible to continue with ScreenSleeves as a true screensaver. Apple have not made it possible (as far as I know at the time of writing) to grant a screensaver plugin the necessary permission to communicate with or control other apps.

Making ScreenSleeves run as a full app (in its own window) has several benefits:

  • Resize the window from tiny to large, and put it into full-screen mode.
  • Choose to keep the window on top of others when it's small, or allow others to move on top of it
  • The new version gives you the option to automate certain things, emulating a screensaver:
    • Switch to full-screen mode with a keypress (cmd-ctrl-F) or after a configurable period of inactivity
    • Switch back from full-screen to the floating window with a wiggle of the mouse or keypress
    • Block system screensaver, screen sleep or computer sleep while in full-screen mode and as long as music is playing
As mentioned, Mojave has much tighter security. The first time you run this app, you'll be asked to allow ScreenSleeves access to several other things. It won't ask for permission for anything which isn't necessary for it to function as intended. You should only be troubled once for each thing that Screensleeves needs to communicate with.

The new standalone version (6.0.0) is available for download, it runs for free for a trial period, then a small price to continue using it. (Previously, the screensaver came in a free and 'pro' versions with extras in the paid version).