Showing posts with label Integrity Pro. Show all posts
Showing posts with label Integrity Pro. Show all posts

Wednesday, 16 February 2022

In simple terms, the best settings for Integrity and Scrutiny

The default settings for Integrity, Integrity Plus, Integrity Pro and Scrutiny have been tweaked over 15 years. Generally speaking, they will be the best settings and the most likely* to perform a successful, full  and useful scan.

The very short version of this post is: go with the defaults, and only adjust them if you understand and want to use an additional feature, or if you have a problem that may be cured by making a change. Please contact support if you're unsure about anything. 


The rest of this post gives a very basic 'layman' explanation of the site-specific options and settings. 

In version 12 (in beta as I write this) these settings have been rearranged and grouped into a more logical order. They're listed below as they are grouped in version 12.


Options

These are optional features. In general, only enable them (or change the default) if you understand what they mean and are prepared to troubleshoot if the option causes unexpected results.


  • This page only:  Simple - sometimes you may want to scan a single page. If you want to scan an entire site, leave this switched off.
  • Check linked js and css files:  This will drill more deeply into the site. If you're looking for a straightforward link check, leave this off.
  • Check for broken images:  Finding broken images is probably as important as finding broken links, leave this on
  • Check lazyload images/load images:  It's possible that your site uses lazyloading of images. If you know that it does then you may want to enable this. NB there is no standard for lazyloading images. Integrity will try to find the image urls in a couple of likely places, but this option can lead to false positives or duplication. Be prepared for troubleshooting.
  • Check anchors:  An anchor link takes you, not just to a page, but to a specific point on a page. With this option on, Integrity will check that the anchor point exists on the target page. If you know that your site uses this type of link, and you want to test them, enable this option.
  • Flag missing link url:  Sometimes during development, you'll create links with empty targets, or use # as a placeholder. This is a way to find those 'unfinished' links.



Advanced

Here we have the controls that may sometimes need to be altered to 'tune' Integrity to your site. In general, the default values should work, only change them if you have a reason.


  • User-agent string:  The default values should work almost all of the time. If the user-agent string is how Integrity identifies itself. If this is set to that of a real browser (which is now the default value) then that should be fine. (Occasionally a site will give different pages for a mobile browser / desktop browser. Or to Googlebot.) 
  • Accept language:  can be used to check specific language pages of a multilingual site.
  • Timeouts and delays:  Use the defaults. If you have problems with timeouts or certain errors then it may be necessary to adjust these settings.


Site characteristics 

Here are a few settings which may need to be adjusted for your particular site. Again, the defaults should be fine, but refer to this guide or ask for help if you have problems.


  • Ignore querystrings: This is the option that is most likely to need changing to suit your particular site. The default is off and that'll probably be fine. However sometimes a session id or other things can be included in the querystring (the part after the ? in a url.) and sometimes these can cause loops or duplications. In that case the setting should be on.  On the flip side, sometimes important information can be included in the querystring, such as a page id, and so for those sites you definitely need the setting to be off.
  • Page urls have no file extension (more recently renamed 'Starting url has page name without file extension'): The explanation of what this box actually does is lengthy and it's more than likely that you don't need it switched on.  In the case where it's required, Integrity should recognise this and ask you an explicit question, and set this box accordingly.
  • Ignore trailing slash: It's very unlikely that this needs to be switched off (default is on). It has become less important in version 12 because its inner workings are slightly different.


If you're using version 10 or earlier, then you'll have the option to Check links on error pages.  I strongly advise leaving this switched off, as it's pretty likely to cause problems or confusion. v12 doesn't have the option. 

If you have a custom error page (which is likely to be one page) and want to test the links on it, then test it separately by setting up a single-page configuration to a non-existent url (such as mysite.com/xyzabc) 


Rules

If you have a specific problem, then you can sometimes cure that with a targeted 'ignore' or 'don't check' rule.  

The other very useful use for rules is to either ignore an entire section of a site, or to limit the crawl to a specific part of a site.




*It may not seem that way if yours is one of the sites that needs a change from default settings, but that's probability for you.  In practice, only the querystring setting is an unpredictable 'it depends' setting. Go with the default, contact support if you need help.




Monday, 31 January 2022

Locate : an overlooked feature in Integrity and Scrutiny

 The Locate feature is an overlooked feature in Integrity and Scrutiny. It answers the common question, "Integrity is reporting a broken link on a page but I don't know where to find that page" or "that page shouldn't exist any more".

It tells you, as a user, how to find the link in question and the page it's on. These hyperlinks are clickable.


There may be more than one route to click through from the home page (or your starting url) to the link in question, but this tool should show the shortest.

It's important to distinguish here between link urls, and a single instance of a link.

In this example, I've selected a link url which has tested as good. There may be links with that target url on multiple pages (or multiple links on the same page). For example a link to the home page probably exists on every page of a site, maybe in more than one place on a page.


A context menu* triggered by a right-click or control-click on that url row will only show options that are relevant to that url, or the page at that url. In order to access the Locate feature, you need to right-click one of the link instances, which are revealed when you expand the row.

The By page and By status views both show link instances when the page / status is expanded, so Locate can be accessed in those views after expanding a page/status. All Links is a flat table showing link instances, so Locate will appear when you right-click any row. In all of these cases it's important to only select a single row, Locate can't work on multiple selected items.

Similarly, if you open the link inspector, it concerns a link url and the status of that url when tested, and it has a table listing the instances or occurrences of that url in links. Before using the Locate button (or context menu in that table) it's important to select one of the instances in the table.



Recent versions of Integrity and Scrutiny may have Locate in these context menus but it may not appear to do anything. This is fixed in Integrity and Scrutiny v10.4.12+


Friday, 20 August 2021

Many 'soft 404s' found on the KFC website

One way to 'fix' your bad links is to make your server send a 200 code with your custom error page.


Google frowns upon it as "bad practice" and so do I. It makes bad links difficult to find using a link checker. Even if a page says "File not found",  no crawling tool will understand that, will see the 200 and move on.  Maybe this is why the KFC UK website has so many of them.
The way that Integrity and Scrutiny handle this is to look for specified text on the page and in the title. Obviously it can't be pre-filled with all of the possible terms which might appear on anyone's custom error page, so if you know that you use soft 404s on your site, you must give Integrity / Scrutiny  a term that's likely to appear on the error page and that's unlikely to appear anywhere else. Fortunately with this site, WHOOPS!  fits the bill. The switch for the soft 404 search and the list of search terms is in Preferences (above).
And here we see them being reported with the status 'soft 404' in place of the actual (incorrect) 200 status returned by the server.

[update 29 Nov 2021] To be fair to KFC, that long list of bad links is now mostly cleared up, although the soft 404 problem still exists, which isn't going to make it easy to find bad links:


If anyone from KFC reads this, we offer a subscription-based monthly website report and would be very happy to include the 'soft 404' check at no extra charge.  

Monday, 6 April 2020

Checking your browser's bookmarks

I had not considered this until someone recently asked about using Integrity to do the job.



Yes, in principle you can export your bookmarks from Safari or Firefox as a .html file and ask Integrity, Integrity Plus, Pro and Scrutiny to check all of the links it contains.

The only issue is that the free Integrity, and App Store versions of Integrity Plus and Integrity Pro are 'sandboxed', meaning that for security reasons, they generally only have access to local files within their own 'container'. Apple insists on this measure for apps distributed via their App Store.

For this reason, those versions of those apps will not be able to fully crawl a website stored locally (some people like to do this, although there are some advantages if you crawl via a web server, even via the apache server included with MacOS).

However, here we're only talking about parsing a single html file for links, and testing those.

A sandboxed app can access any file that you have chosen via an open or save dialog.

So all you need to do is to use File > Open to choose your bookmarks.html file rather than typing its name or dragging it to the starting url field. (Remember 'check this page only' to ensure that you only check the links on the bookmarks file and the app doesn't try to follow all of them.)
I have bookmarks in Safari going back many years. (nearly 2,000 apparently) There are so many pages there I'd forgotten about and some that clearly no longer exist or have moved.

Tuesday, 10 March 2020

Changes to nofollow links : sponsored and ugc attributes : how to check your links

Google announced changes last year to the way they'd like publishers to mark nofollow links.

The rel attribute can also contain 'sponsored' or 'ugc' to indicate paid links and user-generated content. A while ago, nofollow links were not used for indexing or ranking purposes. But this is changing. Google will no longer treat them as a strict instruction to not follow or index.

This article on moz.com lists the changes and how these affect you.

As from version 9.5.6 of Integrity (including Integrity Plus and Pro) and version 9.5.7 of Scrutiny, These apps allow you to see and sort your links according to these attributes.

There was already a column in the links views for 'rel' which displayed the content of the rel attribute, and a column for 'nofollow' which displayed 'yes' or 'no' as appropriate. Now there are new columns for 'sponsored' and 'ugc' (also displaying yes/no for easy sorting). Many of the views have a 'column' selector . If visible, these columns will be sortable and they'll be included in csv exports.


Sunday, 7 July 2019

Press Release - Integrity Pro v9 released

Integrity Pro version 9 is now fully released. It is a free update for existing licence holders.

The major new features are as follows:
  • Improved Link and Page inspectors. New tabs on the link inspector show all of a url's redirects and any warnings that were logged during the scan.

  • Warnings. A variety of things may now be logged during the scan. For example, a redirect chain or certain problems discovered with the html. If there are any such issues, they'll be highlighted in orange in the links views, and the details will be listed on the new Warnings tab of the Link Inspector.
  • Rechecking. This is an important part of your workflow. Check, fix, re-check. You may have 'fixed' a link by removing it from a page, or by editing the target url. In these cases, simply re-checking the url that Integrity reported as bad will not help. It's necessary to re-check the page that the link appeared on. Now you can ask Integrity to recheck a url, or the page that the url appeared on. And in either case, you can select multiple items before choosing the re-check command.
  • Internal changes. There are some important changes to the internal flow which will eliminate certain false positives.


More general information about Integrity Pro is here:
https://peacockmedia.software/mac/integrity-pro/

Wednesday, 26 September 2018

http requests - they're not all the same

This is the answer to a question that I was asked yesterday. I thought that the discussion was such an interesting one that I'd post the reply publicly here.

A common perception is that a request for a web page is simply a request. Why might a server give different responses to different clients? To be specific, why might Integrity / Scrutiny receive one response when testing a url, yet a browser sees something different? What are the differences?


user-agent string

This is sent with a request to identify "who's asking". Abuses of the user-agent string by servers range from sending a legitimate-looking response to search engine bots and dodgy content to browsers, through to refusing to respond to requests that don't appear to come from browsers. Integrity and Scrutiny are good citizens and by default have their own standards-compliant user-agent string. If it's necessary for testing purposes, this can be changed to that of a browser or even a search engine bot.

header fields

A request contains a bunch of header fields. These are specifically designed to allow a server to tailor its content to the client. There are loads of possible ones and you can invent custom ones, some are mandatory, many optional. By default, Scrutiny includes the ones that the common browsers include, with similar settings.  If your own site requires a particular unusual or custom header field / value to be present, you can add them (in Scrutiny's 'Advanced settings'). 

cookies and javascript

Browsers have these things enabled by default, They're just part of our online lives now (though accessibility standards say that sites should be usable without them) but they're options in Scrutiny and deliberately both off by default. I'm discovering more and more sites which will test for cookies being enabled in the browser (with a handshake-type thing) and refuse to serve if not. There are a few sites which refuse to work properly without javascript being enabled in the browser. This is a terrible practice but it does happen, thankfully rarely. Switch cookies on in Scrutiny if you need to. But always leave the javascript option *off* unless your site does this when you switch js off in your browser:
An image showing a blank web page, message: This site requires Javascript to work


GET and HEAD

There are a couple of other things under Scrutiny's Preferences > Links > Advanced (and Integrity's Preferences > Advanced).   'Use GET for all connections' and 'Load data for all connections'. Both will probably be off by default. 
Screenshot of a couple of Scrutiny's preferences, always use GET and load data for all connections

A  browser will generally use GET when making a request (unless you're sending a form) and it will probably load all of the data that is returned.  For efficiency, a webcrawler can use the HEAD method when testing external links (because it doesn't need the actual content of the page, only the status code). If it does use the GET (for internal connections where it does want the content, or if  you have 'always use GET' switched on) and if if doesn't need the page content, it can cancel a request after getting the status code. This very rarely causes a problem, but I have had one or two cases where a large number of cancelled requests to the same server can cause problems.  

'Use GET for all connections' is unlikely to make any visible difference when scanning a site. Using the HEAD method (which by all standards should work) may not always work. but if a link returns any kind of error after using the HEAD method, Integrity / Scrutiny tests the same url again using GET. 

Other considerations

Outside of the particulars of the http request itself are a couple of things that may also cause different responses to be returned to a webcrawler and a browser. 

One is the frequency of the requests. Integrity and Scrutiny will send many more requests in a given space of time than a browser, probably many at the same time (depending on your settings). This is one of the factors involved in LinkedIn's infamous 999 response code. 

The other is authentication. A frequently-asked question is why a link to social media link returns a response code such as 'forbidden' when the link works fine in a browser. Having cookies switched on (see above) may resolve this but we forget that when we visit social media sites we have logged in at some point in the past and our browser remembers who we are. It may be necessary to be authenticated as a genuine user of a site when viewing a page that may appear 'public'.  Scrutiny and Webscraper allow authentication, the Integrity family doesn't.

I love this subject. Comments and discussion are very welcome.

Monday, 27 August 2018

Migrating website data when switching from app store version of Integrity Plus / Pro to web download version

There are reasons why you might want to start using the web download version of Integrity Plus or Integrity Pro after buying the App Store version.

(We're happy to provide a key, with evidence of the App Store purchase as long as it's for the same user).

The App Store version is necessarily 'sandboxed', a security measure imposed by Apple for apps sold on their Store. However, this kills certain features, such as the ability to crawl a site stored as local files. So the web download version remains un-sandboxed (it pre-dates sandboxing).

The sandboxed and un-sandboxed apps store their data in different places. When switching from the web download version to the app store version, the migration is taken care of by the system (this is the way Apple want you to go and so they make this easy. Invisible in fact).

The app doesn't (yet) detect and automatically handle the reverse situation. But it's possible to do this manually.

Option 1. Integrity Plus / Pro have the option to export / import your websites. 
This requires you to export while you have the app store version installed, and import after you've replaced it with the web download version.

Option 2. Use these Terminal commands. 

They check for and remove any preference file which will be present if you've already run the web download version. Then copy the data from the sandbox 'container' to the location used by the web download version.

(This first set of instructions is for Integrity Plus. For Integrity Pro, scroll down)

First make sure Integrity Plus isn't running.

Then enter this into Terminal:

rm ~/Library/Preferences/com.peacockmedia.Integrity-plus.plist

(if it says there's no file there, don't worry.) Then this:

cp ~/Library/Containers/com.peacockmedia.integrityPlus/Data/Library/Preferences/com.peacockmedia.integrityPlus.plist ~/Library/Preferences/com.peacockmedia.Integrity-plus.plist

Important: now log out of the system and log back in. The system does some wicked things with caching these files. It's sometimes possible to make our change 'stick' using another Terminal command, but I've not found that as reliable for these purposes as logging out / in.

Now start the web download Integrity Plus and see whether your data appears.

Here are the corresponding instructions for Integrity Pro

Make sure Integrity Pro isn't running

First enter into Terminal:

rm ~/Library/Preferences/com.peacockmedia.Integrity-pro.plist

(if it says there's no file there, don't worry.) Then this:

cp ~/Library/Containers/com.peacockmedia.Integrity-pro/Data/Library/Preferences/com.peacockmedia.Integrity-pro.plist ~/Library/Preferences/com.peacockmedia.Integrity-pro.plist

Important: log out of the system and back in.

Wednesday, 11 April 2018

Migrating from Integrity Plus to Integrity Pro

I'm gathering all of this information together so that it's all in one place.

Features

If you're an Integrity Plus user, the newer Integrity Pro offers:

Upgrade fee

If you're interested in upgrading from Plus to Pro then you'll only need to pay the difference between the full current price of the two apps. The Integrity Pro upgrade form is here.

Migrating your website configurations

This is very new, by the time you read this, the necessary feature should exist within the current versions of Integrity Plus and Pro.

Making sure you're on at least 8.0.8 (8.0.86) of the source application (Integrity Plus) you can export all of your website configurations in one batch, or selected ones individually. Save the resulting file somewhere where you can easily find it.


Again making sure you're on at least 8.0.8 (8.0.86) of the destination application (Integrity Pro) use File > Open to open the file that you just saved. Integrity Pro should do the rest and your websites should appear in the list on the left.  If there are any problems, contact support.





Here are some screenshots showing Integrity Pro in action.




SaveSave

Saturday, 24 March 2018

Earlier in the year I made the business decision to sell our apps via the app store once again. It's working out OK, the store seem to have become much more user-friendly from the back end and we seem to be reaching more people with the apps.

Integrity, Integrity Plus and now the new Integrity Pro are all available at the store for those who prefer to obtain their apps that way, and for those who might not have discovered them otherwise.

Here are the links:





You'll be wondering which one suits you best. There's a handy chart here, which shows a broad outline of features and prices.



Monday, 5 March 2018

Open Graph (og:), Twitter Card and more meta data being collected by Integrity Pro and Scrutiny

Yesterday I wrote about the development of Integrity Pro.

One of the enhancements to come in Scrutiny v8 is a load more data about your links and pages,  part of that is the collection of most of the meta data from your pages.

These are some screenshots showing how Integrity Pro displays meta data, as an additional table under SEO, and within the 'page inspector' which pops up when you click one of the pages in the 'links by page' view.




The reporting of og: and twitter card information has been on the Scrutiny enhancement list for a while, and will be in v8 as you see it here in these shots.

Integrity Pro, which also contains this functionality, is now out in full at version 8.0.x