Sunday 21 June 2015

Surprising SEO results with a personal blog

A small but perhaps very useful enhancement to Scrutiny (in progress right now) is to remove the limited summary here on the SEO results screen (previously it just gave numbers for pages without title / meta description) in favour of a more comprehensive summary:
Screenshot showing Scrutiny's SEO results table plus the new summary text

Bit of a surprise with this one (a personal blog).

Previously with Scrutiny you've had to use the filter button to visit the results for each test in turn. Now the list is just there at a glance and I guess I haven't been very vigilant here - I wasn't aware that blogger don't automatically stick in a description and I guess I've always been too excited about each new blog post to worry about image alt text....

Wednesday 10 June 2015

Visualising a website in 3D

This is very much a work-in-progress but here's a sneaky peeky at the 3D functionality I'm experimenting with for SiteViz. This is the Peacockmedia site exported by Scrutiny and visualised in 3D by SiteViz:


Friday 5 June 2015

Spidering wikipedia

I've reached a milestone in my 'crawling the English Wikipedia project'. (I'm hoping to find out whether the 'six degrees' principal is true*.) Scrutiny has now managed a scan taking in 3 million links which includes 1.279 million pages in its sitemap results. This is the largest single scan I've ever seen any of my applications run.



My instance of Scrutiny must have been feeling very enlightened after parsing this eclectic raft of articles including Blue tit, Conway Twitty, Wolverine (character) [yes, there are a surprising number of other Wolverines!], Benjamin Anderson (adventurer) and Personal Jesus.

The most fascinating thing about this crawl is that out of the pages scanned here, the article with the most links (excluding a few unusual page types) is alcohol. It has over 6,000 hyperlinks on its page**  This suggests that we have more to say about nature's gift of fermentation than about World War Two, which has two thirds the number of links.


*The uncertainty here is that if you imagine a node structure with each node linking to, say, 100 pages, then you can reach a million pages in three clicks. But those aren't a million unique pages. The number of previously-undiscovered pages diminishes with each page parsed

** this does include 'edit' links and citation anchor links. For future crawls I'll blacklist these for efficiency.

Thursday 4 June 2015

Internal backlinking

The graphs for this website's sitemap are unusual and very attractive.

Here's the 'Daisy' themed graph. perhaps more attractive but maybe less obvious what's going on.

So what *is* going on here? Upon investigation (aka switching on labels by clicking a button in the toolbar)...

... the 2015 pages are all linked from a page, two clicks from home, called 2015 (and only from that page). On that page is a link to a page called 2014 an on that page are links to all 2014 pages plus a link to 2013 pages and so on.

Visiting any of these pages makes it clearer. This is an unusual kind of pagination, a little like scrolling to the bottom of some content and clicking 'more' to load older content. From a user point of view it does work very well. Everything's really obvious, no-one's going to struggle to find the older content, it'll just take more clicks.

So is this a problem? The pages are all discoverable, so no problem there. But some might say that this site isn't making the best exploitation of internal backlinking. In this particular case I don't think it matters, these are reports going back in time, it's unlikely that a visitor is as interested in older reports than the newer ones.

Any other thoughts on the analysis of these graphs or thin internal backlinking - please comment.

(graphs generated by SiteViz, using sitemap files generated by Scrutiny)

Yesterday's post in this series: analysing the structure of larger sites

Wednesday 3 June 2015

What does Amazon's website structure look like?

While discussing the visual analysis of website structure with a Scrutiny user, he mused that it would be useful to see what a successful website such as Amazon would 'look like'. Well here it is:



The eyeball shape is completely unintended and unexpected, and I think really funny. (And slightly ironic.)

In fact this isn't the real picture at all. It only shows pages (of Amazon.co.uk) within 2 clicks from home, not because all pages on the site lie therein but because traversing 100,000 links and including 1,000 pages in this chart barely scratches the surface of the website. (There are ~120 links on the homepage, if every page has an average of 100 pages (it does) then given Scrutiny's top-down approach, it would need to include 10,000 pages in this chart just to reach the 'escape velocity' of the second level). This project is on the back-burner for another day in favour of some smaller commercial sites.

NB the placement of each page in this chart is based on 'clicks from home', not necessarily 'navigation by navbar' or directories implied by the urls.

Other sites


Here are a couple of sites, crawled to completion, to see how successful commercial sites appear.

The first is my favourite clothes retailer
There are a relatively small number of pages 4 clicks from home, but the vast majority of the product pages can be reached within 3 clicks. Based only on this blogger's history of using this site, it *is* more usual to browse than to search with this type of site.

Next up is my favourite shoe site. Again crawled in its entirety.
Very similar, especially if we take into account that it has fewer pages than the clothes retailer.

9 circles

Finally in this tour, for comparison, here's the site of a local authority (middle-tier local government). These are not commercial organisations and not generally renowned for the user-friendliness of their websites.

This '9 circles of hell' does extend outwards and outwards beyond this screenshot. Though to be fair, all of the actual website content is 6 clicks from home or fewer*. After that we're into pages of planning documents etc.

These graphs are analogous to browsing the site. (I have some experience in local authority websites and it is more common than you'd think for users to browse rather than search.) If, in the real world, the search box is used, then the user is 2 clicks from home. If the user starts with Google, then the user potentially lands on the page they need (assuming the page is indexed). But the object of this exercise is to see how successful websites are organised in terms of their link structure and see what we can learn. These three sites have a similar number of pages**

I'm working on some other ideas, so please keep an eye on this blog: besides the Amazon project I'd love to crawl the entire English content of Wikipedia to see whether the 'six degrees' game holds true. I believe this is feasible, I've now successfully made a crawl up to a million links (which included half a million pages in the sitemap) so I don't think the 4-point-something million articles is out of the question.

These websites were crawled by Scrutiny and the graphs generated by SiteViz, a tool I've been working on for a long time to view the .dot files generated by Scrutiny and Integrity Plus. SiteViz is very new and in beta. Other graphing applications can also open Scrutiny's .dot files.

If you have any other thoughts on what we can learn from these charts and figures, please leave them in the comments.

* to be clear, pages are shown here at the fewest number of clicks possible from the starting page (as far as Scrutiny was able to discover)

** in the same ballpark; ~3,000 for the shoe site, ~6,500 for the clothes and ~5,000 for the local authority