I've written similar tutorials to this one before, but I've made the screenshots and written the tutorial this morning to help someone and wanted to preserve the information here.
We're using the Webscraper app. The procedure below will work while you're in demo mode, but the number of results will be limited.
We enter our starting url. Perform your search on Yelp and then click through to the next page of results. Note how the url changes as you click through. In this case it goes:
(I added in &start=0 on the first one, that part isn't there when you first go to the results, but this avoids some duplication due to the default page and &start=0 being the same page).
So our starting url should be:
In order to crawl through the results pages, we limit our crawl to urls that match the pattern we've observed above. We can do that by asking the crawl to ignore any urls that don't contain
In this case we're going to additionally click the links in the results, so that we can scrape the information we want from the businesses' pages. This is done on the 'Output filter' tab. Check 'Filter output' and enter these rules:
URL contains /biz/
and URL contains ?osq=Plumbers
Finally we need to set up the columns in our output file and specify what information we want to grab. On the business page, the name of the business is in the h1 tags, so we can simply select that. The phone number is helpfully in a div called 'biz-phone' so that's easy to set up too.
Then we run by pressing the Go button. In an unlicensed copy of the WebScraper app, you should see 10 results. Once licensed, the app should crawl through the pagination and collect all (in this case) 200+ results.
I was able to get all of the results (matching those available while using the browser) for this particular category. For some others I noticed that Yelp didn't seem to want to serve more than 25 pages of results, even when the page said that there were more pages. Skipping straight to the 25th page and then clicking 'Next' resulted in a page with hints about searching.
This isn't the same as becoming blacklisted, which will happen when you have made too many requests in a given time. This is obvious because you then can't access Yelp in your browser without changing your IP address. One measure to avoid this problem is to use ProxyCrawl which is a service that you can use by getting yourself an account (free initially), switch on 'Use ProxyCrawl' in WebScraper and enter your token in Preferences.
Friday 18 January 2019
Wednesday 9 January 2019
You can use it to watch a single page, a part of a website or a whole website. It can run on schedule (hourly, daily, weekly, monthly) and alert you to the changes you're interested in, which could be visible text, source, resources appearing on the page, its status, or you can simply leave it to archive all changes to all files. You can set up multiple website configurations, each with their own schedule. It uses the system's launchd, meaning that Watchman doesn't have to be left running, it'll just start as needed.
Watchman uses the same fast, efficient crawling engine as Scrutiny and Integrity, which has been developed over 12 years and offers a huge amount of configuration and tuning. This is coupled with a new web archive format.
It's a desktop app running on your own Mac, so you own your own data.
It's early days, there are many more features in the pipeline, but for now it's stable and doing invaluable work. And version 1.x is free to download and use. (The next major version may not be free, or may be 'freemium' but the current version will continue to work and remain free.)
I've been flagging the release of Watchman for a while, it's been a long time since I've been so excited about a new project and I believe it'll become a more important title for us than Scrutiny.