Friday 18 January 2019

scraping Yelp for phone numbers of all plumbers in California (or whatever in wherever)

I've written similar tutorials to this one before, but I've made the screenshots and written the tutorial this morning to help someone and wanted to preserve the information here.

We're using the Webscraper app. The procedure below will work while you're in demo mode, but the number of results will be limited.

We enter our starting url. Perform your search on Yelp and then click through to the next page of results. Note how the url changes as you click through. In this case it goes:
https://www.yelp.com/search?find_desc=Plumbers&find_loc=California&ns=1&start=0
https://www.yelp.com/search?find_desc=Plumbers&find_loc=California&ns=1&start=20
https://www.yelp.com/search?find_desc=Plumbers&find_loc=California&ns=1&start=40
etc.

(I added in &start=0 on the first one, that part isn't there when you first go to the results, but this avoids some duplication due to the default page and &start=0 being the same page).

So our starting url should be:
https://www.yelp.com/search?find_desc=Plumbers&find_loc=California&ns=1&start=0

In order to crawl through the results pages, we limit our crawl to urls that match the pattern we've observed above. We can do that by asking the crawl to ignore any urls that don't contain
?find_desc=Plumbers&find_loc=California&ns=1&start=

In this case we're going to additionally click the links in the results, so that we can scrape the information we want from the businesses' pages. This is done on the 'Output filter' tab. Check 'Filter output' and enter these rules:
URL contains /biz/
and URL contains ?osq=Plumbers
(The phone numbers and business names are right there on the results pages, we could grab them from there, but for this exercise we're clicking through to the business page to grab the info from there. It has advantages.)

Finally we need to set up the columns in our output file and specify what information we want to grab. On the business page, the name of the business is in the h1 tags, so we can simply select that. The phone number is helpfully in a div called 'biz-phone' so that's easy to set up too.

Then we run by pressing the Go button. In an unlicensed copy of the WebScraper app, you should see 10 results. Once licensed, the app should crawl through the pagination and collect all (in this case) 200+ results.

Limitations

I was able to get all of the results (matching those available while using the browser) for this particular category. For some others I noticed that Yelp didn't seem to want to serve more than 25 pages of results, even when the page said that there were more pages. Skipping straight to the 25th page and then clicking 'Next' resulted in a page with hints about searching.

This isn't the same as becoming blacklisted, which will happen when you have made too many requests in a given time. This is obvious because you then can't access Yelp in your browser without changing your IP address. One measure to avoid this problem is to use ProxyCrawl which is a service that you can use by getting yourself an account (free initially), switch on 'Use ProxyCrawl' in WebScraper and enter your token in Preferences.

Wednesday 9 January 2019

New website monitor / archive utility for Mac arrives at full stable release and is still free

Watchman is an easy-to-use website monitoring / archiving utility.

You can use it to watch a single page, a part of a website or a whole website. It can run on schedule (hourly, daily, weekly, monthly) and alert you to the changes you're interested in, which could be visible text, source, resources appearing on the page, its status, or you can simply leave it to archive all changes to all files. You can set up multiple website configurations, each with their own schedule. It uses the system's launchd, meaning that Watchman doesn't have to be left running, it'll just start as needed.

Watchman uses the same fast, efficient crawling engine as Scrutiny and Integrity, which has been developed over 12 years and offers a huge amount of configuration and tuning. This is coupled with a new web archive format.



Its web archive format can store changes like a Time Machine backup. You can view any page as it appeared on a certain date. When you do so, you're viewing a 'living' version of the page, with its css and javascript running as it was at the time, not a simple screenshot. You can of course export a version of a page as an image, or as a collection of all the files under their original filenames, as they were on that date. You can switch between versions of a page to compare them.

It's a desktop app running on your own Mac, so you own your own data.

It's early days, there are many more features in the pipeline, but for now it's stable and doing invaluable work. And version 1.x is free to download and use. (The next major version may not be free, or may be 'freemium' but the current version will continue to work and remain free.)

I've been flagging the release of Watchman for a while, it's been a long time since I've been so excited about a new project and I believe it'll become a more important title for us than Scrutiny.

https://peacockmedia.software/mac/watchman