Sunday 22 March 2020

Improvements to free sitemap visualiser

SiteViz is as its name suggests, a tool for visualising a sitemap.

Integrity Plus, Pro and Scrutiny can export their sitemap data in the form of a .dot file, which contains the information necessary to draw a chart.

Early on, Omnigraffle did a great job of displaying these .dot files but it became expensive. There were other options but nothing free and easy to use. That's where SiteViz came in and its functionality has been built into Scrutiny.

Its default chart didn't look great though. The layout was flawed making the charts less than ideal.

Version 3 contains some vast improvements to that 'bubble tree' theme, along with some improvements to the colouring. Nodes can be coloured according to links in or 'link juice'. (Think liquid flowing through your links). Connections can be coloured according to multiple criteria too. The charts now look much more professional and are much more useful, especially the bubble tree theme. The screenshots you see on this page were made using it.

It will remain a free viewer. You can export the chart as a png image or pdf, but it can't create the .dot sitemap. For that you'll need a website crawler like Integrity Plus or Pro.

Thursday 19 March 2020

What happens in Objective C if you compare two NSObjects with a simple 'greater than' or 'less than'?

This is one of those odd bugs that goes unnoticed because a line of code that shouldn't work,  strangely does, at least most of the time.

The problem exhibited itself under certain circumstances. That led me to investigate and discover this line (in pseudocode):

if ([object getProperty] > [anotherObject getProperty]){

Years ago, this line was correct, because in this particular object's getProperty used to return a primitive such as an int or NSInteger (I can't remember which).

But at some point getProperty was changed so that it returned an NSNumber, which is an object rather than a simple value.

The line should have been updated to (and has been now):

if ([[object getProperty] integerValue] > [[anotherObject getProperty] integerValue]){

Of course I should have searched the source for 'getProperty' and updated accordingly but this particular line escaped. It went unnoticed. The compiler didn't complain, and everything still seemed to work.

If a tree falls in a forest......    If a bad line of code appears to work under testing and no-one reports a problem, is it still a bug?

It didn't always work though. Under certain circumstances that particular thing went screwy (no crash, no exception, just odd results sometimes). But not randomly. It worked or didn't work predictably, depending on the circumstances.

I can't find confirmation of this but it seems from what I've observed that

(NSObject > NSObject) 

returns true or false depending on the order that they were created. I'm assuming that the comparison is being made using the object's address (pointer) treated as a numeric value. This makes sense because declaring NSObject *myobject  declares myobject as a pointer, which is some kind of primitive, known to contain a memory address.

A simple experiment seems to bear this out.

    NSNumber *number1 = [NSNumber numberWithInt:2];
    NSNumber *number2 = [NSNumber numberWithInt:10];
    NSLog(@"number1: %p, number2: %p", number1, number2);
    if(number1 > number2){
        NSLog(@"number1 is greater than number2");
        NSLog(@"number2 is greater than number1");


NumberTest[89819:23978762] number1: 0x7312659ea1d74f79, number2: 0x7312659ea1d74779
NumberTest[89819:23978762] number1 is greater than number2

It's interesting that the objects seem to be allocated going backwards in memory in this example. I assume that allocation of memory is not predictable, but that there would be some general sequence to it, backwards or forwards.

I'm obviously pleased to have found and fixed a problem in this app. But more than anything this has amused and interested me.

If you have anything to contribute on this subject, leave it in the comments.

Monday 16 March 2020

scraping email addresses with WebScraper

This has been a popular question from Webscraper users.

It has already been possible to use WebScraper to scrape email addresses with some caveats. a/ It required setting up, using your favourite regular expression for matching email addresses (these are easy to find online if you're not a regex demon) and b/ it'll only work if the email addresses appear unobfuscated in the source or visible text. It has long been popular to use some method of hiding email addresses from bots.

To help with this particular task, you can now set up WebScraper more easily (as from version 4.11.0).  'Email Addresses' now appears in the drop-down buttons for the simple and complex setups.

Here's the simple setup. It couldn't be easier; 'Email Addresses' appears in the second drop-down if you choose 'Content' in the first. Skip to the Post-processing tab  >>

Output file columns

For the complex setup, by default you get the URL and Title columns by default. You may like to keep those so that you can see which page each email address appears on.  Or delete them if you simply want a list of email addresses. Then add a column. As with the simple setup, choose Content and then Email Addresses.

Results tab

At this point (After running the scan or a test) each page is presented on a row of this table. If there are multiple email addresses on a page, they'll be  listed in a single cell separated by a pipe. We'll fix that later. There may also be big gaps where pages don't contain an email address. That's also something we can fix.

During the scan, at the point where email addresses are scraped from a page, the results are uniqued. So if the same address appears multiple times on the same page (which is likely if the address appears as a link - it may be in the link and in the visible text) then it'll only appear once in that row on the Results tab.

Post-processing tab

Here's where we can tidy up the output. The first checkbox will split multiple results onto separate rows. The third checkbox will skip rows where there are no results. The drop-down button will contain a list of your output columns, choose your email address column. As the label says, these things will be done when you export your output to csv.


The default expression used for this task will match strings like  I found that this can match certain images that have an @ symbol in their filename. If you wish to improve the regular expression, then simply change it in this field in Preferences.


Here's my output for this example (I chose not to include the url and title columns). Note that the same address appears a lot. At time of writing WebScraper doesn't have a 'unique' option on the post-processing tab but that's under consideration. Also note that caveat b at the top of this article still applies.

Tuesday 10 March 2020

Changes to nofollow links : sponsored and ugc attributes : how to check your links

Google announced changes last year to the way they'd like publishers to mark nofollow links.

The rel attribute can also contain 'sponsored' or 'ugc' to indicate paid links and user-generated content. A while ago, nofollow links were not used for indexing or ranking purposes. But this is changing. Google will no longer treat them as a strict instruction to not follow or index.

This article on lists the changes and how these affect you.

As from version 9.5.6 of Integrity (including Integrity Plus and Pro) and version 9.5.7 of Scrutiny, These apps allow you to see and sort your links according to these attributes.

There was already a column in the links views for 'rel' which displayed the content of the rel attribute, and a column for 'nofollow' which displayed 'yes' or 'no' as appropriate. Now there are new columns for 'sponsored' and 'ugc' (also displaying yes/no for easy sorting). Many of the views have a 'column' selector . If visible, these columns will be sortable and they'll be included in csv exports.