I've recently been working on some enhancements to Webscraper so that it can handle some of the odd problems encountered while scraping the US Yellowpages.com. It's doing that pretty well now, with some setting up. (though scraping is a constantly-shifting area.)
Users' needs are usually pretty specific. I wondered whether I could make a much simplified version of WebScraper, pre-configured to scan that particular site.
It turns out that they provide an API, which changes the game. This app can be much more efficient. The trade-off is that there is a 'fair use' limit per API key, but this is very generous and the limit I've built into my app is pretty hard to hit.
This is the interface.
Yes, that's all there is.
In return for such a simple interface, all configuration is fixed. It's locked to yellowpages.com and the output file is pre-configured too.
For more flexibility (or to scrape a different site) please look at WebScraper.
Thursday, 20 April 2017
Saturday, 1 April 2017
Finding redirect chains using Scrutiny for Mac
Setting up redirects is important when moving a site, but don't let it get out of hand over time!
John Mueller has said that the maximum number of hops Googlebot will follow in a chain is five.
Scrutiny keeps track of the number of times any particular request is redirected, and can report these to you if you have any.
Here's how:
First you need to scan your site. Add a config in Scrutiny, give it a name and your starting url (home page)
Then press 'Scan now'.
One you've scanned your site and you're happy that you don't need to tweak your settings for any reason, go to the SEO results.
If there are any urls with a redirect chain, it will be shown in this list:
(Note that at the time of writing, Scrutiny is configured to include pages in this count if they have greater than 5 redirects, but you can see all redirect counts in the Links 'by link' view as described later).
You can see the pages in question by choosing 'Redirect chain' from the Filter button over on the right:
That will show you the urls in question (as things stand in the current version as I write this, it'll show the *final* url - this is appropriate here, because this SEO table lists pages, not links, the url shown is the actual url of the page in question.)
A powerful tool within Scrutiny is to see a trace of the complete journey.
Find the url in the Links results. (You can sort by url, or paste a url into the search box.) Note that as from version 7.2, there is a 'Redirect count' column in the Links 'by link' view. You may need to switch the column on using the selector to the top-left of the table. You can sort by this column to fin the worst offenders:
.. and double-click to open the link inspector. The button to the right of the redirect field will show the number of redirects, and you can use this button to begin the trace:
Some of this functionality is new (or improved) in version 7.2. Users of 7.x should update.
There is a very reasonable upgrade path for users of versions earlier than 7.
John Mueller has said that the maximum number of hops Googlebot will follow in a chain is five.
Scrutiny keeps track of the number of times any particular request is redirected, and can report these to you if you have any.
Here's how:
First you need to scan your site. Add a config in Scrutiny, give it a name and your starting url (home page)
Then press 'Scan now'.
One you've scanned your site and you're happy that you don't need to tweak your settings for any reason, go to the SEO results.
If there are any urls with a redirect chain, it will be shown in this list:
(Note that at the time of writing, Scrutiny is configured to include pages in this count if they have greater than 5 redirects, but you can see all redirect counts in the Links 'by link' view as described later).
You can see the pages in question by choosing 'Redirect chain' from the Filter button over on the right:
That will show you the urls in question (as things stand in the current version as I write this, it'll show the *final* url - this is appropriate here, because this SEO table lists pages, not links, the url shown is the actual url of the page in question.)
A powerful tool within Scrutiny is to see a trace of the complete journey.
Find the url in the Links results. (You can sort by url, or paste a url into the search box.) Note that as from version 7.2, there is a 'Redirect count' column in the Links 'by link' view. You may need to switch the column on using the selector to the top-left of the table. You can sort by this column to fin the worst offenders:
.. and double-click to open the link inspector. The button to the right of the redirect field will show the number of redirects, and you can use this button to begin the trace:
Some of this functionality is new (or improved) in version 7.2. Users of 7.x should update.
There is a very reasonable upgrade path for users of versions earlier than 7.
Wednesday, 29 March 2017
First public beta of 404bypass
Problem: You've moved a website, your pages are now on a different domain and for various reasons the urls may no longer be an exact match. Maybe the directory structure is different, maybe some pages have had urls altered for SEO reasons.
Solution: a .htaccess file at the root of the old domain to redirect old pages. Users as well as Google will automagically follow those redirects (with a 301 server response code to tell them that the page has moved permanently) and receive the new pages.
Creating that file requires a lot of manual matching and copying and pasting
Better solution: Use an app which will scan both versions of the site, compare the pages, make 'smart matches' and produce that .htaccess file for you.
404bypass uses the Integrity scanning engine to quickly scan your sites, compiling lists of your old and new pages. It can then use smart and manual matching to quickly and easily produce that redirect file.
The use of templates means that you can tweak the format of your output file, or export your matches as other types of file. csv is built-in, and you can even create your own templates, giving the app uses that we may not have thought of yet!
The manual is here
Download from here (First public beta is free for a limited time)
Screenshots below.
Solution: a .htaccess file at the root of the old domain to redirect old pages. Users as well as Google will automagically follow those redirects (with a 301 server response code to tell them that the page has moved permanently) and receive the new pages.
Creating that file requires a lot of manual matching and copying and pasting
Better solution: Use an app which will scan both versions of the site, compare the pages, make 'smart matches' and produce that .htaccess file for you.
404bypass uses the Integrity scanning engine to quickly scan your sites, compiling lists of your old and new pages. It can then use smart and manual matching to quickly and easily produce that redirect file.
The use of templates means that you can tweak the format of your output file, or export your matches as other types of file. csv is built-in, and you can even create your own templates, giving the app uses that we may not have thought of yet!
The manual is here
Download from here (First public beta is free for a limited time)
Screenshots below.
Thursday, 2 February 2017
Scrutiny 7 launched! 50% deal via MacUpdate
After many months in development and more in testing, Scrutiny v7 is now available.
Scrutiny builds on the link tester Integrity. As well as the crawling and link-checking functionality it also handles:
The main new features of version 7 are:
Scrutiny builds on the link tester Integrity. As well as the crawling and link-checking functionality it also handles:
- SEO - loads of data about each page
- Sitemap - generate and ftp your XML sitemap (broken into parts with a sitemap index for larger sites)
- Spelling and grammar check
- Site search with many parameters including multiple search terms and 'pages that don't contain'
- Many advanced features such as authentication, cookies, javascript.
The main new features of version 7 are:
- Multiple windows - have as many windows open as you like to run concurrent scans, view data, configure sites, all at once
- New UI, includes breadcrumb widget for good indication of where you are, and switching to other screens
- Organise your sites into folders if you choose
- Autosave saves data for every scan, giving you easy access to results for any site you've scanned
- Better reporting - summary report looks nicer, full report consists of the summary report plus all the data as CSV's
- Many more new features and enhancements
MacUpdate are currently running a 50% discount. [update, now finished, but look out for more]
Note that there's an upgrade path for users of v5 and v6 with a small fee ($20). You can use this form for the upgrade.
Monday, 16 January 2017
How to use webscraper to compile a list of names and numbers from a directory
[updated 17 April 2017]
First we find a web search which produces the results we're after. This screenshot shows how that's done, and the url we want to grab as the starting url for our crawl.
That url goes into the starting url box in Webscraper.
We obviously don't want the crawling engine to scan the entire site but we do want it to follow those 'More info' links, because that's where the detail is. We notice that those links go through to a url which contains /mip/ so we can use that term to limit the scan (called 'whitelisting').
We also notice the pagination here. It'll be useful if Webscraper will follow those links, to find further results for our search, and then follow the 'more information' links on those pages. We notice that the pagination uses "&page=" in its urls, so we can whitelist that term too in order to allow the crawler access to page 2, page 3 etc of our search results.
The whitelist field in Webscraper allows for multiple expressions, so we can add both of these terms (separated with a comma). Webscraper shouldn't follow any url which doesn't contain either of these terms.
** note a recent change to Webscraper's fields and my advice here - see the end of this article.
That's the setting up. If the site in question requires you to be logged in, you'll need to check 'attempt authentication' and use the button to visit the site and log in. That's dealt with in another article.
Kick off the scan with the Go button. At present, the way Webscraper works is that you perform the scan, then when it completes, you build your output file and finally save it.
If the scan appears to be scanning many more pages than you expected, then you can use the Live View button to see what's happening, and if necessary, stop and adjust your settings. You're very welcome to contact support if you need help
When the scan finishes, we're presented with the output file builder. I'm after a csv containing a few columns. I always start with the page url as a reference. That allows you to view the page in question if you need to check anything. Select URL and press Add.
Here's the fun part. We need to find the information we want, and add column(s) to our output file using a class or id if possible, or maybe a regular expression. First we try the class helper.
This is the class / id helper. It displays the page, it shows a list of classes found on the page, and even highlights them as you hover over the list. Because we want to scrape information off the 'more info' pages, I've clicked through to one of those pages. (You can just click links within the browser of the class helper.)
Rather helpfully, the information I want here (the phone number) has a class "phone". I can double-click that in the table on the left to enter it into the field in the output file builder, and then press the Add button to add it to my output file. I do exactly the same to add the name of the business (class - "sales-info").
For good measure I've also added the weblink to my output file. (I'm going to go into detail re web links in a different article because there are some very useful things you can do.)
So I press save and here's the output file. (I've not bothered to blur any of the data here, it's available on the web)
So that's it. How about getting a nice html file with all those weblinks as clickable links? That'll be in the next article.
** update - screen-scraping is a constantly-shifting thing... Recently yellowpages began putting "you may also be interested in...." links on the information pages - these links contain /mip/ and they aren't limited to the area that you originally searched for. Result - WebScraper's crawl going on ad infinitum.
What we needed was a way to follow the pagination from our original search results page, to follow any links through to the information pages, to scrape the data from those pages (and only those pages) but not follow any links on them.
So now Webscraper (as from v2.0.3) has a field below the 'blacklist' and 'whitelist' fields labelled 'information page:' A partial url in that box indicates to Webscraper that matching urls are to be scraped, but not parsed for more links. My setup for a yellowpages scrape looks like this and it works beautifully:
Once again, You're very welcome to contact support if you need help with any of this.
First we find a web search which produces the results we're after. This screenshot shows how that's done, and the url we want to grab as the starting url for our crawl.
That url goes into the starting url box in Webscraper.
We obviously don't want the crawling engine to scan the entire site but we do want it to follow those 'More info' links, because that's where the detail is. We notice that those links go through to a url which contains /mip/ so we can use that term to limit the scan (called 'whitelisting').
We also notice the pagination here. It'll be useful if Webscraper will follow those links, to find further results for our search, and then follow the 'more information' links on those pages. We notice that the pagination uses "&page=" in its urls, so we can whitelist that term too in order to allow the crawler access to page 2, page 3 etc of our search results.
The whitelist field in Webscraper allows for multiple expressions, so we can add both of these terms (separated with a comma). Webscraper shouldn't follow any url which doesn't contain either of these terms.
** note a recent change to Webscraper's fields and my advice here - see the end of this article.
That's the setting up. If the site in question requires you to be logged in, you'll need to check 'attempt authentication' and use the button to visit the site and log in. That's dealt with in another article.
Kick off the scan with the Go button. At present, the way Webscraper works is that you perform the scan, then when it completes, you build your output file and finally save it.
If the scan appears to be scanning many more pages than you expected, then you can use the Live View button to see what's happening, and if necessary, stop and adjust your settings. You're very welcome to contact support if you need help
When the scan finishes, we're presented with the output file builder. I'm after a csv containing a few columns. I always start with the page url as a reference. That allows you to view the page in question if you need to check anything. Select URL and press Add.
Here's the fun part. We need to find the information we want, and add column(s) to our output file using a class or id if possible, or maybe a regular expression. First we try the class helper.
This is the class / id helper. It displays the page, it shows a list of classes found on the page, and even highlights them as you hover over the list. Because we want to scrape information off the 'more info' pages, I've clicked through to one of those pages. (You can just click links within the browser of the class helper.)
Rather helpfully, the information I want here (the phone number) has a class "phone". I can double-click that in the table on the left to enter it into the field in the output file builder, and then press the Add button to add it to my output file. I do exactly the same to add the name of the business (class - "sales-info").
For good measure I've also added the weblink to my output file. (I'm going to go into detail re web links in a different article because there are some very useful things you can do.)
So I press save and here's the output file. (I've not bothered to blur any of the data here, it's available on the web)
So that's it. How about getting a nice html file with all those weblinks as clickable links? That'll be in the next article.
** update - screen-scraping is a constantly-shifting thing... Recently yellowpages began putting "you may also be interested in...." links on the information pages - these links contain /mip/ and they aren't limited to the area that you originally searched for. Result - WebScraper's crawl going on ad infinitum.
What we needed was a way to follow the pagination from our original search results page, to follow any links through to the information pages, to scrape the data from those pages (and only those pages) but not follow any links on them.
So now Webscraper (as from v2.0.3) has a field below the 'blacklist' and 'whitelist' fields labelled 'information page:' A partial url in that box indicates to Webscraper that matching urls are to be scraped, but not parsed for more links. My setup for a yellowpages scrape looks like this and it works beautifully:
Once again, You're very welcome to contact support if you need help with any of this.
Tuesday, 3 January 2017
Crawling a website that requires authentication
This is a big subject and gets bigger and more complicated as website become increasingly clever at preventing non-human visitors from being able to log in.
My post How to use Scrutiny to test a website which requires authentication has been updated a number of times in its history and I've just updated it again to include a relatively recent Scrutiny feature. It's a simple trick involving a browser window within Scrutiny which allows you to log into your site. If there's a tracking cookie, that's then retained for Scrutiny's scan.
It used to be possible to simply log in using Safari - Safari's cookies seem to have been systemwide, but after Yosemite, a browser's cookies seem to be specific to that browser.
The reason for this all being on my mind today is that I've just worked the same technique into WebScraper. I wanted to compile a list of some website addresses from pages within a social networking site which is only visible to authenticated users.
Webscraper doesn't have the full authentication features of Scrutiny but I think this method will work with the majority of websites which require authentication.
(This feature, and others, are in Webscraper 1.3 which will be available very shortly)
SaveSave
My post How to use Scrutiny to test a website which requires authentication has been updated a number of times in its history and I've just updated it again to include a relatively recent Scrutiny feature. It's a simple trick involving a browser window within Scrutiny which allows you to log into your site. If there's a tracking cookie, that's then retained for Scrutiny's scan.
It used to be possible to simply log in using Safari - Safari's cookies seem to have been systemwide, but after Yosemite, a browser's cookies seem to be specific to that browser.
The reason for this all being on my mind today is that I've just worked the same technique into WebScraper. I wanted to compile a list of some website addresses from pages within a social networking site which is only visible to authenticated users.
Webscraper doesn't have the full authentication features of Scrutiny but I think this method will work with the majority of websites which require authentication.
(This feature, and others, are in Webscraper 1.3 which will be available very shortly)
SaveSave
Sunday, 1 January 2017
17% off Integrity Plus
We'd like to wish you a happy and prosperous New Year.
Of course, that means having the best tools, and if you're a user of website link checker Integrity, or have trialled Integrity Plus, you'll enjoy the extra features of Integrity Plus for Mac. As well as the fast and accurate link check, you can filter and search your results, manage settings for multiple sites and generate an xml sitemap.
So we're offering a 17% discount to kick off 2017 (see what we did there?) Exp 14 Jan 2017
There's no coupon, simply buy from within the app or use this secure link:
https://pay.paddle.com/checkout/496583
Of course, that means having the best tools, and if you're a user of website link checker Integrity, or have trialled Integrity Plus, you'll enjoy the extra features of Integrity Plus for Mac. As well as the fast and accurate link check, you can filter and search your results, manage settings for multiple sites and generate an xml sitemap.
So we're offering a 17% discount to kick off 2017 (see what we did there?) Exp 14 Jan 2017
There's no coupon, simply buy from within the app or use this secure link:
https://pay.paddle.com/checkout/496583
Subscribe to:
Posts (Atom)






















