Thursday 9 July 2015

Using Integrity to scan Blogger sites for broken links - some specifics

I've recently been helping someone with a few issues experienced when testing a Blogger blog with Integrity.

Some of these things are of general interest, some will be useful to anyone else who's link-checking a Blogger site. These tips apply equally to Integrity Plus and Scrutiny.

1. Share links being reported as bad

You may have these share links at the bottom of each post.
As you'd expect, they redirect to a login page, so no danger of Integrity actually sharing any of your posts. The problem comes when you're testing a larger site with more threads. These links may eventually begin to return an error code. I don't know whether this is because of the heavy bombardment on the share functionality, or whether Blogger is detecting the abnormal use. Either way, you may begin to get lots of red in your results.

One solution is to turn down the number of threads to a minimum. This isn't desirable because the crawl will then take hours. A better solution is to ask Integrity not to check those links (it's pretty certain that they'll be ok).

(Note: Even though these link use a querystring with parameters, checking 'ignore querystrings' won't work because these links have a different domain to the blog address, thus they look like external links and the 'ignore querystrings' setting only applies to internal links.)

Add a 'blacklist rule' using the little [+] button (screenshot below). Make a rule that says 'do not check urls containing share-post'
While here, add similar rules for 'delete-comment' and 'post-edit'. It was a concern to see these urls appearing in my link-check results. They do indeed appear in the pages' html code, although they're hidden by the browser if you're browsing as a guest. But no need to worry - as you'd expect, they also redirect to a login screen and Integrity isn't capable of logging in. *


2. A large amount of yellow

Integrity highlights redirected urls in yellow. Not an error but a 'FYI'. Some webmasters like to find and deal with redirects, but the Blogger server uses redirects extensively and it's just part of the way it works. When testing a Blogger site, you will see a lot of these but it's not usually something you need to worry about.

If you like, you can change the colour that Integrity uses to highlight such links - you can change it to white, or better still, transparent. See Preferences > Views and then click the yellow colour-well to see the standard OSX colour picker with an 'opacity' slider.

3. Pageviews on your website

Given that Google Analytics uses client-side javascript to make it work (meaning that crawling apps like Integrity don't trigger page views **) I was surprised to find Integrity triggering page views with a Blogger site. I guess it counts the views server-side.

It seems that changing the user-agent string to that of Googlebot stopped these hits from registering.

The user-agent string is how any browser or web crawler identifies itself. It's useful for a web server to know who's hitting on it.

Posing as Googlebot by using the Googlebot user-agent string:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
... seems to work - it prevents hits from triggering page views in Blogger's dashboard

Deliberately using another string (known as 'spoofing') is technically mis-use of the user-agent string, but until Google recognises Scrutiny and Integrity as web crawlers, then I think this is forgivable. If you'd like to be a little more transparent then I've found that this alternative also works:
Integrity/5.4 (posing as: Googlebot/2.1; +http://www.google.com/bot.html)

I will be shortly building this Googlebot string into the drop-down picker in Preferences. In the mean time just go to Preferences > Global and paste one of those strings into the 'User-agent string' box.




* neither Integrity or Integrity Plus are capable of authenticating themselves, in effect they're viewing websites as an anonymous guest. Scrutiny is capable of authentication, it's a feature that's much in demand (if you want to test a website which requires you to log in before you see the content) but the feature must be used with care - it's not possible to switch it on without seeing warnings and advice.

** I guess that Scrutiny could trigger page views when its 'run js' feature is switched on, though I haven't tested that

No comments:

Post a Comment