Showing posts with label integrity. Show all posts

Wednesday, 16 February 2022

In simple terms, the best settings for Integrity and Scrutiny

The default settings for Integrity, Integrity Plus, Integrity Pro and Scrutiny have been tweaked over 15 years. Generally speaking, they will be the best settings and the most likely* to perform a successful, full and useful scan.

The very short version of this post is: go with the defaults, and only adjust them if you understand and want to use an additional feature, or if you have a problem that may be cured by making a change. Please contact support if you're unsure about anything.

The rest of this post gives a very basic 'layman' explanation of the site-specific options and settings.

In version 12 (in beta as I write this) these settings have been rearranged and grouped into a more logical order. They're listed below as they are grouped in version 12.

Options

These are optional features. In general, only enable them (or change the default) if you understand what they mean and are prepared to troubleshoot if the option causes unexpected results.

This page only: Simple - sometimes you may want to scan a single page. If you want to scan an entire site, leave this switched off.
Check linked js and css files: This will drill more deeply into the site. If you're looking for a straightforward link check, leave this off.
Check for broken images: Finding broken images is probably as important as finding broken links, leave this on
Check lazyload images/load images: It's possible that your site uses lazyloading of images. If you know that it does then you may want to enable this. NB there is no standard for lazyloading images. Integrity will try to find the image urls in a couple of likely places, but this option can lead to false positives or duplication. Be prepared for troubleshooting.
Check anchors: An anchor link takes you, not just to a page, but to a specific point on a page. With this option on, Integrity will check that the anchor point exists on the target page. If you know that your site uses this type of link, and you want to test them, enable this option.
Flag missing link url: Sometimes during development, you'll create links with empty targets, or use # as a placeholder. This is a way to find those 'unfinished' links.

Advanced

Here we have the controls that may sometimes need to be altered to 'tune' Integrity to your site. In general, the default values should work, only change them if you have a reason.

User-agent string: The default values should work almost all of the time. If the user-agent string is how Integrity identifies itself. If this is set to that of a real browser (which is now the default value) then that should be fine. (Occasionally a site will give different pages for a mobile browser / desktop browser. Or to Googlebot.)
Accept language: can be used to check specific language pages of a multilingual site.
Timeouts and delays: Use the defaults. If you have problems with timeouts or certain errors then it may be necessary to adjust these settings.

Site characteristics

Here are a few settings which may need to be adjusted for your particular site. Again, the defaults should be fine, but refer to this guide or ask for help if you have problems.

Ignore querystrings: This is the option that is most likely to need changing to suit your particular site. The default is off and that'll probably be fine. However sometimes a session id or other things can be included in the querystring (the part after the ? in a url.) and sometimes these can cause loops or duplications. In that case the setting should be on. On the flip side, sometimes important information can be included in the querystring, such as a page id, and so for those sites you definitely need the setting to be off.
Page urls have no file extension (more recently renamed 'Starting url has page name without file extension'): The explanation of what this box actually does is lengthy and it's more than likely that you don't need it switched on. In the case where it's required, Integrity should recognise this and ask you an explicit question, and set this box accordingly.
Ignore trailing slash: It's very unlikely that this needs to be switched off (default is on). It has become less important in version 12 because its inner workings are slightly different.

If you're using version 10 or earlier, then you'll have the option to Check links on error pages. I strongly advise leaving this switched off, as it's pretty likely to cause problems or confusion. v12 doesn't have the option.

If you have a custom error page (which is likely to be one page) and want to test the links on it, then test it separately by setting up a single-page configuration to a non-existent url (such as mysite.com/xyzabc)

Rules

If you have a specific problem, then you can sometimes cure that with a targeted 'ignore' or 'don't check' rule.

The other very useful use for rules is to either ignore an entire section of a site, or to limit the crawl to a specific part of a site.

*It may not seem that way if yours is one of the sites that needs a change from default settings, but that's probability for you. In practice, only the querystring setting is an unpredictable 'it depends' setting. Go with the default, contact support if you need help.

Monday, 31 January 2022

Locate : an overlooked feature in Integrity and Scrutiny

The Locate feature is an overlooked feature in Integrity and Scrutiny. It answers the common question, "Integrity is reporting a broken link on a page but I don't know where to find that page" or "that page shouldn't exist any more".

It tells you, as a user, how to find the link in question and the page it's on. These hyperlinks are clickable.

There may be more than one route to click through from the home page (or your starting url) to the link in question, but this tool should show the shortest.

It's important to distinguish here between link urls, and a single instance of a link.

In this example, I've selected a link url which has tested as good. There may be links with that target url on multiple pages (or multiple links on the same page). For example a link to the home page probably exists on every page of a site, maybe in more than one place on a page.

A context menu* triggered by a right-click or control-click on that url row will only show options that are relevant to that url, or the page at that url. In order to access the Locate feature, you need to right-click one of the link instances, which are revealed when you expand the row.

The By page and By status views both show link instances when the page / status is expanded, so Locate can be accessed in those views after expanding a page/status. All Links is a flat table showing link instances, so Locate will appear when you right-click any row. In all of these cases it's important to only select a single row, Locate can't work on multiple selected items.

Similarly, if you open the link inspector, it concerns a link url and the status of that url when tested, and it has a table listing the instances or occurrences of that url in links. Before using the Locate button (or context menu in that table) it's important to select one of the instances in the table.

Recent versions of Integrity and Scrutiny may have Locate in these context menus but it may not appear to do anything. This is fixed in Integrity and Scrutiny v10.4.12+

Friday, 27 August 2021

The future for Integrity and Scrutiny

[update Feb 22 2022]: Integrity, Integrity Plus and Pro v12 are out of beta and available here.

[post originally written August 2021, updated in the mean time.]

It feels that the various flavours of Integrity and Scrutiny have reached a plateau, they do what they do and judging by their popularity, they're doing it well (all comments welcome).

That's not to say that they're dormant. Far from it. You can see from the release notes that they've all received frequent updates. But these now tend to be improvements and updates rather than new features.

The biggest news recently has been the HTML validation, and work on that will continue.

Work has already begun on v11 of Integrity and Scrutiny, and it'll necessarily be a deep rewrite of the engine. Which will of course be called the v12 engine, because who's heard of a v11 engine?!

Futureproofing is needed. Partly to keep up with changes in the MacOS system, partly to revise the internal structure of the data and partly to replace some tired stuff with newer stuff, for example our current 'sitesucker-like' archiving system.

One feature of Integrity and Scrutiny that has been a bit slack is the archiving. Originally this was simply a dump of the html files during the crawl. It developed a bit, but the archiving and processing in Webarch and Website Watchman have left Integrity and Scrutiny behind, so Integrity and Scrutiny will be brought up to scratch with Webarch-style archiving.

There are long-standing issues that need deeper rewrites in order to fix properly. And parts of the interface that could do with a facelift, particular Scrutiny's website / config management screen.

On the business front, it's more than likely that there will be a price increase, but as usual, no upgrade fee for licence holders of v7 or above. (hint: now is a very good time to buy!)

[Update 28 Nov 2021]

I've just posted a video showing the new Integrity Pro in action.If you use Integrity Pro, this won't *look* tremendously different. The changes are as outlined above; much is under-the-hood, for efficiency or just to keep up-to-date with the changing system and web standards. There are one or two important features missing from the interface in this video.

Friday, 20 August 2021

Many 'soft 404s' found on the KFC website

One way to 'fix' your bad links is to make your server send a 200 code with your custom error page.

Google frowns upon it as "bad practice" and so do I. It makes bad links difficult to find using a link checker. Even if a page says "File not found", no crawling tool will understand that, will see the 200 and move on. Maybe this is why the KFC UK website has so many of them.

The way that Integrity and Scrutiny handle this is to look for specified text on the page and in the title. Obviously it can't be pre-filled with all of the possible terms which might appear on anyone's custom error page, so if you know that you use soft 404s on your site, you must give Integrity / Scrutiny a term that's likely to appear on the error page and that's unlikely to appear anywhere else. Fortunately with this site, WHOOPS! fits the bill. The switch for the soft 404 search and the list of search terms is in Preferences (above).

And here we see them being reported with the status 'soft 404' in place of the actual (incorrect) 200 status returned by the server.

[update 29 Nov 2021] To be fair to KFC, that long list of bad links is now mostly cleared up, although the soft 404 problem still exists, which isn't going to make it easy to find bad links:

If anyone from KFC reads this, we offer a subscription-based monthly website report and would be very happy to include the 'soft 404' check at no extra charge.

Thursday, 10 June 2021

First Apple Silicon (ARM / M1) builds of our apps

Late to the party, I know. Being at the cutting edge has never been in our mission statement, and no problems have been reported with our apps running on Big Sur under Rosetta (Good job Apple).

This was the first attempt at building Integrity as a UB (which contains native binaries for ARM and Intel based Macs). Of course it was going to be more efficient but I didn't realise just how much faster this would run as a native app on one of the new machines.

The UB versions of Integrity, Integrity Plus, Integrity Pro and Scrutiny are all available on their home pages. They are still in testing and the previous known-stable versions are there too, but all seems fine so far.

Monday, 6 April 2020

Checking your browser's bookmarks

I had not considered this until someone recently asked about using Integrity to do the job.

Yes, in principle you can export your bookmarks from Safari or Firefox as a .html file and ask Integrity, Integrity Plus, Pro and Scrutiny to check all of the links it contains.

The only issue is that the free Integrity, and App Store versions of Integrity Plus and Integrity Pro are 'sandboxed', meaning that for security reasons, they generally only have access to local files within their own 'container'. Apple insists on this measure for apps distributed via their App Store.

For this reason, those versions of those apps will not be able to fully crawl a website stored locally (some people like to do this, although there are some advantages if you crawl via a web server, even via the apache server included with MacOS).

However, here we're only talking about parsing a single html file for links, and testing those.

A sandboxed app can access any file that you have chosen via an open or save dialog.

So all you need to do is to use File > Open to choose your bookmarks.html file rather than typing its name or dragging it to the starting url field. (Remember 'check this page only' to ensure that you only check the links on the bookmarks file and the app doesn't try to follow all of them.)

I have bookmarks in Safari going back many years. (nearly 2,000 apparently) There are so many pages there I'd forgotten about and some that clearly no longer exist or have moved.

Tuesday, 10 March 2020

Changes to nofollow links : sponsored and ugc attributes : how to check your links

Google announced changes last year to the way they'd like publishers to mark nofollow links.

The rel attribute can also contain 'sponsored' or 'ugc' to indicate paid links and user-generated content. A while ago, nofollow links were not used for indexing or ranking purposes. But this is changing. Google will no longer treat them as a strict instruction to not follow or index.

This article on moz.com lists the changes and how these affect you.

As from version 9.5.6 of Integrity (including Integrity Plus and Pro) and version 9.5.7 of Scrutiny, These apps allow you to see and sort your links according to these attributes.

There was already a column in the links views for 'rel' which displayed the content of the rel attribute, and a column for 'nofollow' which displayed 'yes' or 'no' as appropriate. Now there are new columns for 'sponsored' and 'ugc' (also displaying yes/no for easy sorting). Many of the views have a 'column' selector . If visible, these columns will be sortable and they'll be included in csv exports.

Wednesday, 26 September 2018

http requests - they're not all the same

This is the answer to a question that I was asked yesterday. I thought that the discussion was such an interesting one that I'd post the reply publicly here.

A common perception is that a request for a web page is simply a request. Why might a server give different responses to different clients? To be specific, why might Integrity / Scrutiny receive one response when testing a url, yet a browser sees something different? What are the differences?

user-agent string

This is sent with a request to identify "who's asking". Abuses of the user-agent string by servers range from sending a legitimate-looking response to search engine bots and dodgy content to browsers, through to refusing to respond to requests that don't appear to come from browsers. Integrity and Scrutiny are good citizens and by default have their own standards-compliant user-agent string. If it's necessary for testing purposes, this can be changed to that of a browser or even a search engine bot.

header fields

A request contains a bunch of header fields. These are specifically designed to allow a server to tailor its content to the client. There are loads of possible ones and you can invent custom ones, some are mandatory, many optional. By default, Scrutiny includes the ones that the common browsers include, with similar settings. If your own site requires a particular unusual or custom header field / value to be present, you can add them (in Scrutiny's 'Advanced settings').

cookies and javascript

Browsers have these things enabled by default, They're just part of our online lives now (though accessibility standards say that sites should be usable without them) but they're options in Scrutiny and deliberately both off by default. I'm discovering more and more sites which will test for cookies being enabled in the browser (with a handshake-type thing) and refuse to serve if not. There are a few sites which refuse to work properly without javascript being enabled in the browser. This is a terrible practice but it does happen, thankfully rarely. Switch cookies on in Scrutiny if you need to. But always leave the javascript option *off* unless your site does this when you switch js off in your browser:

GET and HEAD

There are a couple of other things under Scrutiny's Preferences > Links > Advanced (and Integrity's Preferences > Advanced). 'Use GET for all connections' and 'Load data for all connections'. Both will probably be off by default.

Screenshot of a couple of Scrutiny's preferences, always use GET and load data for all connections

A browser will generally use GET when making a request (unless you're sending a form) and it will probably load all of the data that is returned. For efficiency, a webcrawler can use the HEAD method when testing external links (because it doesn't need the actual content of the page, only the status code). If it does use the GET (for internal connections where it does want the content, or if you have 'always use GET' switched on) and if if doesn't need the page content, it can cancel a request after getting the status code. This very rarely causes a problem, but I have had one or two cases where a large number of cancelled requests to the same server can cause problems.

'Use GET for all connections' is unlikely to make any visible difference when scanning a site. Using the HEAD method (which by all standards should work) may not always work. but if a link returns any kind of error after using the HEAD method, Integrity / Scrutiny tests the same url again using GET.

Other considerations

Outside of the particulars of the http request itself are a couple of things that may also cause different responses to be returned to a webcrawler and a browser.

One is the frequency of the requests. Integrity and Scrutiny will send many more requests in a given space of time than a browser, probably many at the same time (depending on your settings). This is one of the factors involved in LinkedIn's infamous 999 response code.

The other is authentication. A frequently-asked question is why a link to social media link returns a response code such as 'forbidden' when the link works fine in a browser. Having cookies switched on (see above) may resolve this but we forget that when we visit social media sites we have logged in at some point in the past and our browser remembers who we are. It may be necessary to be authenticated as a genuine user of a site when viewing a page that may appear 'public'. Scrutiny and Webscraper allow authentication, the Integrity family doesn't.

I love this subject. Comments and discussion are very welcome.

Tuesday, 24 July 2018

Dark mode and Integrity / Integrity Plus / Integrity Pro / Scrutiny

I have to admit that I really love dark mode. It's very easy on the eye and it jarrs a little when you have to look at a web page with a white background.

(Does anyone know whether it's possible for a website to detect a mac's dark mode setting and display a dark version using an appropriate css? Let me know.)

I did naively expect that the OS would simply draw all windows and controls in the dark colours. But it's up to each developer to build their apps under the new SDK. And carefully check for hard-coded colours and unsuitable images within the app.

We're just about there with Integrity / Integrity Plus / Integrity Pro / Scrutiny and it's been a pleasure to do. The 'dark mode enabled' version of all of this family of apps will be 8.1.5 (a minor point release, as there are very few functional changes).

[update 26 July 18 and again 8 Aug 18] Scrutiny 8.1.8 is available and looks great under dark mode. Obviously will only look dark on 10.14 with dark mode selected.

[update 30 July 18 and again 8 Aug 18] Integrity, Integrity Plus and Integrity Pro are also available with dark mode enabled. As above, 10.14 is required to see them in dark mode

Saturday, 24 March 2018

Earlier in the year I made the business decision to sell our apps via the app store once again. It's working out OK, the store seem to have become much more user-friendly from the back end and we seem to be reaching more people with the apps.

Integrity, Integrity Plus and now the new Integrity Pro are all available at the store for those who prefer to obtain their apps that way, and for those who might not have discovered them otherwise.

Here are the links:

Integrity on the app store

Integrity Plus on the app store

Integrity Pro on the app store

You'll be wondering which one suits you best. There's a handy chart here, which shows a broad outline of features and prices.

Sunday, 4 March 2018

Big news - Integrity Pro

Recently I wrote about the new releases of Integrity and Integrity Plus and how, despite being +2 major version numbers, they're unlikely to get many people this excited.

Something that I hope will light more fires is Integrity Pro! It's not intended as a replacement for Scrutiny, but it will bring the SEO and spellcheck functionality into Integrity and be priced midway between Integrity Plus and Scrutiny.

Thus making it a natural and affordable step for Integrity Plus users. There will be a fair upgrade path for Integrity Plus users, and free for any licensed Scrutiny user who wants the more classic interface and doesn't need the full weight of Scrutiny.

Integrity and Integrity Plus v8 are available now.

Update: Integrity Pro is available in full now.

Sunday, 18 February 2018

Integrity v8 engine under development

[29 Aug 2021 This post is now old, but left here as a record of past developments. This version of the engine has been improved but is still current. The next generation is under development.]

The beta of link checker Integrity v8 is imminent, with Integrity Plus following closely behind and then website scrutineer Scrutiny.

[update 23 Apr 2018 - version 8 is now the full release in Integrity, Integrity Plus and the new Integrity Pro]

Obviously we've been looking forward to version 8 so that we can say 'V8 engine'. The biggest changes in the upcoming versions of Integrity and Scrutiny are in the engine rather than interface, so there's not a huge amount to see on the surface but there is a lot to say about it.

Changes to the way that the data is structured in memory

For 11 years, the way that the data has been structured in Integrity and its siblings has meant that for certain views, tables had to be built following the scan. And rebuilt when certain actions were taken (rechecking, marking as fixed). Not a big deal with small sites, but heavy on time and resources for very large sites. It also meant that certain information wasn't readily available (number of internal backlinks, for example) and had to be looked up or calculated either on the fly or as a background task after the scan. And I have to admit to some cases of instability, unresponsiveness or unexpected side effects caused over the years by that stuff going on in the background.

So we've invested a lot of time in something that isn't a killer new feature, but will make things run more smoothly, efficiently and reliably. Initial informal tests show improvements in speed and efficiency (memory use).

More information about redirects

Redirect information wasn't stored permanently, only the start and end of a redirect chain (the majority are a single step. But if there is more than one redirect, you'll want to know all the in-between details)

Better control over volume of requests

It's becoming increasingly necessary to apply some limits on the rate of the requests that Integrity / Scrutiny make.

Your server may keep responding no matter how hard you hit it. That's great; turn up those threads and watch Integrity / Scrutiny go through your site like a dose of salts.

But I'm seeing more cases of servers objecting to rapid requests from the same source. And the reason isn't always clear. They may return 429 (which means 'too many requests') or some other less obvious error response. Or they may just stop responding, perhaps a defensive measure.

If that's your site, you'll no longer have to turn down the threads and use trial and error with the delay field. You'll be able to simply enter a maximum number of requests per minute, while still using multiple threads for efficiency.

Generally a little more helpful

For a while, Integrity has automatically been re-trying certain bad links, in case the reason for the first unsuccessful attempt was temporary. v8 will built on this. For example, a 'too many http requests' is often caused when a website won't work without the user having cookies enabled (this is true, I'll be happy to discuss further with examples) and in these cases, the link will be re-tried with cookies enabled and this will usually then check out ok. In the case of 429 (too many requests) they'll be re-tried after a pause (if a 'Retry-After' header is sent by the server, the recommended delay will be made before retrying). The scan will be paused if too many are received and advice given to the user. On continue, the 429s will automatically be re-tried. Once again, enhancements that will be invisible to the user.

Why wasn't I informed about Integrity version 7?

Because it doesn't exist; Integrity will skip a version. v7 of Scrutiny was mostly about the interface. Integrity continued to use the classic interface and so remained v6. Integrity, Integrity Plus and Scrutiny (and a new app, Integrity Pro, you heard it here first) will use the v8 engine when it's ready and so for consistency all apps will be called v8.

Charge for an upgrade? Price increase?

So in summary, as far as Scrutiny v8 and Integrity v8 are concerned, the changes won't be immediately visible but they'll do what they do more efficiently and with more information available. They'll provide more information more efficiently in a smaller memory footprint.

There are no immediate plans for price increases, or for upgrade fees. Prices are of course always under review, but any future increases will be for new users, upgrading to 8 from Scrutiny 7 or Integrity 6 will be free.

Update

version 8 of Integrity, Integrity Plus and now also Integrity Pro are through beta and available in full.

Friday, 19 January 2018

Limiting Integrity / Scrutiny's scan to a particular section of the site or directory

This is a frequently-asked question and a subject that's been on the workbench this week.

Integrity / Integrity Plus and Scrutiny allow you to set up rules which limit the scan, or prevent checking of urls matching a pattern:

The rules dialog* will shortly look as above; the dialog will appear as a sheet attached to its main window, it'll allow 'urls which contain' / 'urls which don't contain' and the 'only follow' option is gone from the first menu, leaving 'ignore', 'do not check', and 'do not follow'.

('only follow' was confusing because if you have two such rules, then it doesn't make logical sense. 'do not follow urls that don't contain' does the same job and makes sense if you have more than one.)

It's important to say that it shouldn't be necessary to use a rule such as 'do not follow urls that don't contain /mac/scrutiny' if you are starting at peacockmedia.software/mac/scrutiny because Integrity and Scrutiny should only follow urls at or below the 'directory' that you start at.

It's also important to note that 'not following' isn't the same as 'ignoring'. If a link appears on a page then you will see it in the link check results regardless of whether it matches your 'follow' pattern. If you only want to see urls which follow a certain pattern, use an 'ignore' rule instead.

(That last statement applies to the link check results. The SEO table in Scrutiny and Sitemap table in Integrity Plus and Scrutiny should only show pages which meet your rules and appear at or below the directory you start in.)

Important note - if you're 'ignoring urls not containing' then you probably won't be able to find broken images or linked files (unless their urls happen to match your rule's pattern). So if you have the 'images and linked files' option checked then you'll need to use a 'follow' rule rather than 'ignore'.

Protip - this has been possible for a while but not very well documented: you can use certain special characters. An asterisk to mean 'any number of any character' and a dollar sign at the end to mean 'appears at the end'. For example, "don't check urls that contain .dmg$" will only match urls where .dmg appears at the end of the url. And peacockmedia.software/*/mac/scrutiny will match peacockmedia.software/en/mac/scrutiny and peacockmedia.software/fr/mac/scrutiny

Regex is not supported (yet) in these rules (though it is available in Scrutiny's site search). It's not necessary to type whole urls or to use the asterisk at the start or end of your term. By default it's a partial match, so "urls not containing /fr/mac/scrutiny" will work.

The new "urls that don't contain" functionality is in Scrutiny from v7.6.8 and should shortly be available in Integrity and Integrity plus from v6.11.14

*Years ago, these rules were called 'Blacklist / Whitelist Rules'. I'm mentioning that in case anyone searches for those terms.

Wednesday, 3 January 2018

New Year - Thoughts on the Mac App Store, Integrity Plus available there once again

If you've asked me over the last few years why I don't sell my apps via the Mac App Store, I'll have answered that I hate the store. One day I threw a tantrum over a problem (a catch-22 involving version numbers following a few rejections with one of my apps) and I petulantly took all of my apps off the store.

A decision made with the heart rather than the head, which is in character. But I'd been frustrated with the store since day one. Back then Apple took 2-3 weeks to review an app... Fast fixes to problems weren't possible.

Sandboxing was introduced and then made compulsory for apps listed on the Store. This is a security measure which gives you hoops to jump through and makes certain features impossible.

The iTunesConnect site (used for upload / submission) was not very user-friendly early on (let's not get into why the default music player was involved with installing apps on your computer). Apps get rejected. All of this meant that getting an app listed on the store was time-consuming and frustrating. Then Apple took a whopping 30% of your revenue.

It didn't (and still doesn't) allow for trial periods, which has always been very important for me. Developers get around this by making the app free and offering a 'pro' version (which is something I already have with Integrity / Integrity Plus) or making the app free and offering in-app purchases or subscriptions to a service. Or using advertising. None of which I really like in other apps and won't get involved with myself (another emotional rather than business decision).

I've been waiting for the day that the only way to install an app is via the Store, or by 'jailbreaking' your Mac, as with iOS. That day hasn't come (yet) but Gatekeeper has steered people away from web-downloading apps - even a code-signed app requires a 'lower' security setting than installing apps from the app store.

With all of this said, I do find things much improved.

New year - new start. I have let my logical head overrule my feelings and made the business decision to list paid apps on the store once more. The reason is simple, I want as many people to discover and use my apps as possible. In return for the frustrations and giving Apple a big cut, you get more users.

There are many things to conform with (screenshot sizes, compulsory fields, all of which is pretty reasonable). Uploading and submitting wasn't so painful. Sandboxing still applies, as does the huge cut Apple take. I was amazed to find Integrity Plus approved within hours of finally managing to submit it.

Future plans - I don't think that Scrutiny will appear on the Store again, it has some important features which aren't compatible with Sandboxing, such as the scheduling. But I am drawing plans for an 'Integrity Pro' which adds the SEO stuff that only Scrutiny has at the moment.

Monday, 20 November 2017

Black Friday / Cyber Monday offers on Scrutiny for Mac and Integrity Plus

50% off Scrutiny

For Black Friday / Cyber Monday

If you're a user of Integrity or Integrity Plus, or have trialled Scrutiny, we hope that this discount will help you to make the decision to add Scrutiny for Mac to your armoury.

Simply buy through the app or Scrutiny's home page, the discount will be applied. Exp 28-11-2017

The offer will run for the next week or so, please feel free to share this offer, or use it to buy more licences if you have multiple users.

50% off Integrity Plus

For Black Friday / Cyber Monday

If you're a user of the free Integrity, you may like to learn about the extra power that Integrity Plus gives you. Search / filter / export your link results, Generate an XML sitemap and more.

This offer is being run by MacUpdate, visit Integrity Plus's page at MacUpdate to take advantage. Exp 28-11-2017

The offer will run for the next week or so, please feel free to share this offer or use it to buy more licences if you have multiple users.

Sunday, 9 July 2017

NSURLConnection won't die!

Integrity has been the most interesting project of my life.

I believe that the success of the web is down to the html standard and its flexibility *. It's human-readable, human-writable and web browsers do their best to render a page, whatever problems there might be.

I hate to think how many hundreds of hours of work that this has made for me over the last ten years.

With many applications, you can press every button, test all scenarios and be sure that it works. But with a web crawler you can test it on 99 websites, and it'll fail when the first person tries to use it **.

In order to have a reliable web crawler which parses html itself, you have to investigate every problem and improve your code to handle whatever new unpredictable thing has been tripping it up.

It's taken ten years of this hard work for Integrity (and other related apps which use the same engine) to be as stable as it is. It has more users than ever and head-scratching problems are very few and far between now.

The worst times are where the problem happens at a deeper system level and you can get no debug information.

At a very high level, you can obtain data for a url in a single line. At the opposite extreme you can get involved with sockets etc. for Integrity I've taken the middle ground, creating the response and asynchronous connection and using delegate methods to monitor what's happening and be able to intervene if necessary.

But there comes a point where you say au revoir to the request / connection and wait for your various notifications. If the response / notification is unexpected then you're down to some educated guesswork and trial and error.

That's what's happened this week. At a certain point through scanning particular sites, all NSURLConnections would appear to 'lock up' and all would return timeout notifications. (And any further NSURLConnections created to any url within that app would also time out until the app was quit and re-started even though the same urls would respond in any other app.

As usual there are many questions and answers online, with many suggestions that aren't relevant or have no effect.

I eventually got somewhere with a process of elimination - stripping the relevant code down to bare essentials until the problem had gone, then adding the original code back in, chunk by chunk until it stopped working again.

It appears that the problem is related to connections staying alive. The app obviously manages the number of simultaneous connections, and either lets one connection load all its data and complete naturally, or cancel it (if that data isn't needed) before creating a new connection to replace it.

I think what's going wrong in these cases is that when you think you've let go of a connection with [connection cancel], that connection sometimes stays open, and the next one isn't replacing it but adding to the number until some limit is hit.

Removing all [connection cancel]s and allowing every connection to load and finish naturally completely solved the problem.

Making better use of the HEAD method (when you know that you only need the status but not the data) and explicitly making sure those requests have 'connection: close' in the request header should solve the problem but it doesn't entirely.

There's a lot I still don't know - why is a connection sometimes staying alive after it's been cancelled (or in the case of a HEAD request, when it has supplied the header info and says that it's done). If anyone knows, do tell!

* Despite Microsoft's and Netscape's best attempts to make it their own, it's survived as a truly universal standard - anyone can make a web page that can be read in any browser. It's an unusual thing and the IoT has a lesson to learn.

** There are some ridiculously unexpected things in the code of some websites (written by humans and written by machines)

Friday, 26 May 2017

Alongside our recent post about supporting International Domain names within our web crawling tools, we're very proud and excited to announce that work has started on translating some apps and some web pages into other languages. Initially French, and initially Integrity, Integrity Plus and Scrutiny.

It's impossible to do all of the work at once, it will take a little while but the web pages for Integrity, Integrity Plus and Scrutiny now allow you to choose (top-right) English or French versions of the pages.

Localized context help in French is shortly to go into those apps, followed by the rest of the text within the apps.

Wednesday, 24 May 2017

Heads-up : Internationalised Domain Names (IDNs) supported in our web crawlers

We're a UK-based concern, our apps have been almost always available in a single language - UK English (or just English as we call it here in England!) The vast majority of our users as I write this are from English-speaking countries.

Our alphabet entirely consists of characters available in ascii, and so there has been little call for Integrity, Integrity Plus and Scrutiny (and other tools based on the same engine) to support domain names - ie domain names which contain characters not found in the ascii character set.

But now we've started work on localisation of our apps and web pages, and have received the odd question concerning IDNs.

Let's not confuse this with unusual characters in the path and filename of the url. Our apps have long supported these. You may still see the non-ascii characters displayed, but behind the scenes, those characters are encoded before the http request is put together, usually using a percent-encoding system.

The method is similar with the domain name, but using a a more complex and clever system of character encoding. Browsers (and our web crawlers) often still display the user-friendly unicode version.

You can enter your starting url in the unicode form or the 'punycode' form and it'll be handled correctly. The same goes for unicode or punycode links found on your pages.

Personally, I'm not keen, this does allow for spoofing of legitimate domains using similar characters. There are rules excluding many characters for these reasons.

After lots of extra homework for us, Scrutiny is now handling IDNs, and is in testing.

[update 26 May 2017] Integrity and Integrity Plus also have this enhancement and are also in testing.

If this is useful for you, and you'd like to try the new version (remembering that there may still be the odd bug to iron out) then you're very welcome to download and use it (with the condition that you let us know about any issues you spot.)

[edit - links taken out, out of date]

Monday, 22 May 2017

List of all of a site's images, with file sizes

A recent enhancement to Scrutiny and Integrity make it easy to see a list of all images on a site, with file size.

It was already possible to check all images (not just the ones that are linked, ie a href = "image.jpg" but the actual images on the page, img src = "image.jpg" srcset = "image@2x.jpg 2x" )

The file size was also held within Scrutiny and Integrity, but wasn't displayed in the links views.

Now it is. It's a sortable column and will be included in the csv or html export.

Before the crawl, make sure that you switch on checking images:

You may need to switch on that column if it's not already showing - it's called 'Target size'.

Once it is showing, as with other columns in these tables, you can drag and drop them into a different order, and resize their width.

To see just the images - choose Images from the filter button over on the right (Scrutiny and Integrity Plus)

If you're checking other linked files (js or css) then their sizes may be displayed, but will probably have a ? beside them to indicate that the file size shown has not been downloaded and the uncompressed size verified (the size shown is provided in the server header fields).

This last point applies to Integrity and Integrity Plus, and will appear in Scrutiny shortly.

Note that all of this is just a measure of the sizes of all files found during a crawl. For a comprehensive load speed test on a given page, Scrutiny has such a tool - access it with cmd-2 or Tools > Page Analysis

Tuesday, 16 May 2017

Hidden Gems in Scrutiny 7: Locate a broken link

This tip applies to Integrity, Integrity Plus and Scrutiny.

So the app is reporting a broken link (or maybe it's a redirect you're interested in, or just a good link). You can easily see, copy or visit the target of the link, or the page it appears on. But how did the crawl find this particular page?

The Locate function will tell you. First open the link inspector by double-clicking on the link in one of the Links views Then highlight the 'appears on' page you're interested in, and click 'Locate'.

It won't show you every possible route to that link, but it will show you the shortest.

note that there's a context menu there too with these options.

You may have noticed that the link inspector and the context menus have a 'Highlight' option too. If you're having trouble seeing the link on the page, the Highlight option will do its best to open the page and apply yellow highlighter.

Wednesday, 16 February 2022

Monday, 31 January 2022

Friday, 27 August 2021

Friday, 20 August 2021

Thursday, 10 June 2021

Monday, 6 April 2020

Tuesday, 10 March 2020

Wednesday, 26 September 2018

Tuesday, 24 July 2018

Saturday, 24 March 2018

Sunday, 4 March 2018

Sunday, 18 February 2018

Changes to the way that the data is structured in memory

More information about your links

More information about redirects

Better control over volume of requests

Generally a little more helpful

Why wasn't I informed about Integrity version 7?

Charge for an upgrade? Price increase?

Update

Friday, 19 January 2018

Wednesday, 3 January 2018

Monday, 20 November 2017

50% off Scrutiny

For Black Friday / Cyber Monday

50% off Integrity Plus

For Black Friday / Cyber Monday

Sunday, 9 July 2017

Friday, 26 May 2017

Wednesday, 24 May 2017

Monday, 22 May 2017

Tuesday, 16 May 2017