Thursday 31 October 2013

Scanning a specific language version of a website

I've just had a very good support question about testing a particular language version of a website.

I'm posting a generic version of the question and answer here for the benefit of those who pose the same question via Google.


We have regional sites which look for a country / language cookie to keep them in a certain locale. I would like to set up the crawl to stay in China or Japan but there needs to be a cookie to tell our website where the crawler is supposed to stay. Otherwise the site links you back to US/English.

Do Integrity or Scrutiny set or maintain cookies?


Cookies are disabled by default because it's possible for pages to be deleted from the site if the user has been logged into a CMS which has controls which appear as links. Cookies are system-wide, Scrutiny will pick up the authentication from the cookies and in testing a link which is a delete button effectively operates that button.

If you're sure that this won't happen then here's how you enable cookies for Scrutiny crawls:

In your settings tab, press Advanced and then tick 'Attempt to authenticate'. Leave the username and password fields blank and make sure that if you go to the site using Safari that you're logged out (ie seeing the site as a public user).

Then make sure that your Safari is seeing the site in the language that you want to check (ie so Safari has that setting in its cookies) and Scrutiny should pick up and send that Cookie.

Integrity (being free) offers link checking with fewer advanced options, and doesn't allow
authentication or handle cookies.

Friday 25 October 2013

Scrutiny licence half price today

Just a quick 'heads up' - Webmaster tool suite Scrutiny is on 50% offer today at - see the ad on their home page.

Tuesday 15 October 2013

Amazing tech for everyday tasks

It's little things that both amuse and amaze me about today's technology.

It tickled me when Scrutiny started up and started working exactly in sync with the sixth pip of the time signal on the radio. (It's set to scan one of my sites at midday on a Tuesday)

Maybe younger people take this kind of thing for granted but it's taken some amazing technologies to make that happen so accurately - my processor can perform several thousand million operations a second, my system is regularly updating its time from a time server. Rather ironically I wouldn't have noticed this effect if I'd been listening to R4 on DAB rather than FM!

Wednesday 9 October 2013

Finding 'soft 404' internal and external links.

A 'soft 404' isn't the page that the user requested, but which returns a 200 response code. The page may say 'Page not found' or it may be a default page such as the home page or a special page set up for the purpose.

If the page doesn't state that the requested page hasn't been found then it's confusing for the visitor. Unless the page returns a 404 or 410 code then it's very difficult for a web crawler to find the broken link.

Google doesn't like such pages - they don't want to index a page which isn't the expected page. They and other search engines are testing sites for soft 404s. It's best if your site returns a 404 or 410 code when a page isn't found.

However, if your site does return soft 404s or if you want to find your external links that link to soft 404s, then from version 4.5, Scrutiny and Integrity can try to spot and highlight them.

There's a new section in Preferences:

You can switch the feature off (in Preferences) if you have a large site and want best performance and this feature isn't important to you

If your site does generate such pages and you'd like them marked as 404 rather than 200, then either find a term in the page content itself (such as 'sorry, the page you are looking for does not exist') and add this phrase to the list on a new line. If your soft 404 page has a specific url, then you can add all or part of this url to the list.

To find external pages which may be soft 404s, the box in Preferences contains a list of suspicious terms. The list may be increased in future versions, but you can add to it yourself too.

Pages which look like soft 404s (ie return a 2xx code but contain one of the terms in url or content) will have a status of 'soft 404' and will be marked red as per regular 404 pages.