PeacockMedia: Server reponse codes

Showing posts with label Server reponse codes. Show all posts

Tuesday, 26 January 2016

Integrity and Scrutiny displaying a 200 code for a url that doesn't exist

This problem is specific to the user (ie someone else somewhere else may correctly get an error reported for the same url). When pasted into the browser (or visited from within Integrity or Scrutiny) a page is shown, branded with the internet provider's logo, with a search box and maybe some advertising.

What's happening?

The user's internet service provider is recognising that the server being requested doesn't exist, and is 'helpfully' displaying something it considers more useful. My own provider says (quote from their website) "this service is provided free and is designed to enhance the surfing experience by reducing the frustration caused by error pages".

(Note the advertising - your provider is making money out of this.)

The content of the page they provide is neither helpful nor unhelpful, but the 200 code they return with the page is decidedly unhelpful when we're trying to crawl a website and find problems. A web crawler like Integrity or Scrutiny can only know that there's a problem with a link by the server response code.

Personally I think this practice is wrong. If you request a url where the server doesn't exist, it's incorrect to be shown a page with a 200 code.

This is similar to a soft 404 because a 200 is being returned when a request is sent for a page that doesn't exist. I'm tempted to call this a 'soft 5xx' because 5 codes are server errors, although in this case, if there is no server, then we can't have a server response code.

What can we do?

I now know of two providers that offer to switch this service off. Do some digging, your provider may have a web page that allows you to switch this preference yourself. If not, contact them and ask them to switch it off. Integrity / Scrutiny will then behave as expected.

If that fails, then you can use Integrity / Scrutiny's 'soft 404' feature. (Preferences > Links) Find some unique text on the error page (maybe in the page title) and type part or all of that text into this box:

The problem urls will then be reported with a 'soft 404' status which is better than the 200.

Saturday, 23 January 2016

What happened to 'recheck bad links' in Integrity / Scrutiny?

Lots of people have missed this feature.

Unfortunately it was always problematic. Besides bugs and problems with the actual implementation there were some more logical problems too. One example is if a user has 'fixed' an outdated link by removing it from the page. Scrutiny would simply re-test the url it has and continue to report it as a bad link. The fix for this is convoluted. Given that the user may have altered anything on the site since the scan, it's slightly flawed in principle to simply re-check urls that Scrutiny is holding.

There's often a more specific reason for wanting the feature rather than a broad-brush 'recheck all bad links'. For example, the server may have given a few 50x's at a particular point and the user just wants to re-check those. Or the user has fixed a specific link, or links on a specific page.

After working with a user on this, we found a solution that answered the above requirements, while being more sensible in principle.

From Scrutiny 6.2.1 (and soon in Integrity 6 too) the Links views (by link, by page and by status) allow multiple selection. Context menus then give access to 'Mark as fixed' and 'Re-check selected'.

It is possible to select all urls, or all bad ones. All urls on a particular page or all urls with a specific status. It is still possible to re-check a link which may no longer exist on a page, but if the user selects the url and chooses the re-check option then it's illogical but it's a deliberate action on his part.

6.2.1 is currently in beta, the link below will download it and will give 30 days use.

http://peacockmedia.software/mac/scrutiny/scrutiny621.dmg

Tuesday, 25 November 2014

203 server response code

Here's a very interesting code that Scrutiny turned up on a website.

Scrutiny reports '203 non-authoritative information'. W3C elaborates on this a little bit:

Partial Information 203

When received in the response to a GET command, this indicates that the returned metainformation is not a definitive set of the object from a server with a copy of the object, but is from a private overlaid web. This may include annotation information about the object, for example.

So this means that a third-party is providing the information you see. Presumably this is no different from something many of us do - making some space available on the page and allowing Google or another third party to populate it with advertising. (And indeed the page you get at the domain in question here is the kind of thing you'd expect from clicking on an ad).

What's interesting here is that you can visit a domain and see a page not controlled by the owner of that domain. I guess a less responsible owner wouldn't have the server give this response code, but this seems to me like information I'd really like to know while I'm browsing. Should your browser alert you to this?

Monday, 4 November 2013

509 server response in Integrity / Scrutiny results

I love support calls where I learn something new.

I've never seen a 509 server response before. 500 errors mean that the server was unable to fulfil a valid request, and the 509 means "Bandwidth Limit Exceeded"(not used by all servers).

In this case it meant that Integrity was hitting a site too fast for the server's comfort. Easily solved here by moving the 'Threads' slider over to the left, but if that hadn't fixed it, it's possible to further throttle the crawler by specifying a small delay between requests: