Tuesday 26 January 2016

Integrity and Scrutiny displaying a 200 code for a url that doesn't exist

This problem is specific to the user (ie someone else somewhere else may correctly get an error reported for the same url). When pasted into the browser (or visited from within Integrity or Scrutiny) a page is shown, branded with the internet provider's logo, with a search box and maybe some advertising. 

What's happening? 

The user's internet service provider is recognising that the server being requested doesn't exist, and is 'helpfully' displaying something it considers more useful. My own provider says (quote from their website) "this service is provided free and is designed to enhance the surfing experience by reducing the frustration caused by error pages".

(Note the advertising - your provider is making money out of this.)

The content of the page they provide is neither helpful nor unhelpful, but the 200 code they return with the page is decidedly unhelpful when we're trying to crawl a website and find problems. A web crawler like Integrity or Scrutiny can only know that there's a problem with a link by the server response code.

Personally I think this practice is wrong. If you request a url where the server doesn't exist, it's incorrect to be shown a page with a 200 code.

This is similar to a soft 404 because a 200 is being returned when a request is sent for a page that doesn't exist. I'm tempted to call this a 'soft 5xx' because 5 codes are server errors, although in this case, if there is no server, then we can't have a server response code.

What can we do?

I now know of two providers that offer to switch this service off. Do some digging, your provider may have a web page that allows you to switch this preference yourself. If not, contact them and ask them to switch it off. Integrity / Scrutiny will then behave as expected.

If that fails, then you can use Integrity / Scrutiny's 'soft 404' feature. (Preferences > Links) Find some unique text on the error page (maybe in the page title) and type part or all of that text into this box:

The problem urls will then be reported with a 'soft 404' status which is better than the 200.

Saturday 23 January 2016

What happened to 'recheck bad links' in Integrity / Scrutiny?

Lots of people have missed this feature. 

Unfortunately it was always problematic. Besides bugs and problems with the actual implementation there were some more logical problems too. One example is if a user has 'fixed' an outdated link by removing it from the page. Scrutiny would simply re-test the url it has and continue to report it as a bad link. The fix for this is convoluted. Given that the user may have altered anything on the site since the scan, it's slightly flawed in principle to simply re-check urls that Scrutiny is holding.

There's often a more specific reason for wanting the feature rather than a broad-brush 'recheck all bad links'. For example, the server may have given a few 50x's at a particular point and the user just wants to re-check those. Or the user has fixed a specific link, or links on a specific page.

After working with a user on this, we found a solution that answered the above requirements, while being more sensible in principle. 

From Scrutiny 6.2.1 (and soon in Integrity 6 too) the Links views (by link, by page and by status) allow multiple selection. Context menus then give access to 'Mark as fixed' and 'Re-check selected'.
It is possible to select all urls, or all bad ones. All urls on a particular page or all urls with a specific status. It is still possible to re-check a link which may no longer exist on a page, but if the user selects the url and chooses the re-check option then it's illogical but it's a deliberate action on his part.

6.2.1 is currently in beta, the link below will download it and will give 30 days use.

Wednesday 20 January 2016

Live view in Scrutiny

I'm sure this will please a lot of people.

Scrutiny 5 included a new UI (in reviews, the interface of Scrutiny 4 and before was responsible for lost marks. The new previous / next system was well-received). But not being able to see the results as they happen has been a running theme in support calls since then.

There are many reasons for the lonely progress bar in v5. Not least is that constantly refreshing tableviews / outlineviews eats up cpu and resources. With a very large site, the engine goes faster and further if a tableview is not visible. (Since the v6 engine, there are Scrutiny users crawling millions of links in one sitting now. Efficiency is important!)

As practical as it is in those ways, one major downside of the bare progress bar is that if your scan goes a little pear-shaped (maybe because of timeouts, or because some settings need tweaking) you don't know that until the scan finishes, or until you realise that it's going on far longer than it should. The workaround' for this was a menu option 'View > Partial results' which you'll find in more recent versions of 5 and in 6 (up to 6.1.5). (You need to pause before this option can be used).

But there has still been demand to see what's happening. Maybe so that you can spot any problems as they're happening, maybe so that you can begin to visually scan the results while the scan is taking place, or maybe because it's just fun to watch the numbers change in front of your eyes!

So in 6.2 there's 'live view'. Alongside the progress bar is a button which unfolds a table (see screenshot above). There is a warning (once per session, hopefully not too annoying) that it's not recommended for larger sites. It's not possible to open up a link inspector, or expand one of the rows to see further details, but there's enough there to give that satisfying visual feedback and spot any problems as they happen.

[update] Scrutiny 6.2 is now out of beta, please download the current version from Scrutiny's home page

(New users - use it in demo mode for up to 30 days.)

Please let me know of any problems you might spot.