Sunday, 27 September 2015

Breakthrough in wikipedia spidering project - 3 million links checked

Today a scan of 3 million links finished. That in itself isn't a breakthrough because I've previously made such a scan, but this time thanks to a very small change at the very heart of the new v6 crawling engine,  at the end of this 3 million link scan, the app was still working within expected resources and both the app and the Mac were still responsive.
 That's a very large number of broken links? But it does seem that this is a true result:
So we're now 'game on' for a 5 million link crawl. The aim of the game is to find out whether the 'six degrees' theory is true (whether you can really reach any of the 5 million English language pages within six clicks).

Friday, 18 September 2015

Generating an XML sitemap for your website

This new video explains XML sitemaps and demonstrates how to generate one for any site using Integrity Plus. It also looks at some of the options.

Wednesday, 16 September 2015

Using Scrutiny to make important checks over your EverWeb website


In this new video we take a look at how to use Scrutiny to make some important UX and SEO checks over your Everweb website.

The checks themselves apply to any website, but EverWeb makes it easy to correct the issues, as we demonstrate here.

This tutorial uses Scrutiny for Mac and the EverWeb 'drag and drop' content management system.

Links become broken over time (link rot) so a regular link check is important. Everweb helps with this issue because it manages the links in your navigator, but links in your content are still vulnerable to your own or external pages naturally being moved, changed or deleted. Fortunately, finding them and fixing them is easy, as demonstrated.

The title tag and meta description are very important (and a good opportunity) for SEO. Scrutiny will highlight any that are missing, too long or too short. EverWeb makes it a breeze to update these where necessary.

Alt text for your images is also important (depending on the image). Once again, Scrutiny can highlight any potential issues / keyword opportunities and the video shows you how to update your site.

On the subject of keywords, it's very important to do your keyword research and ensure that your pages contain a reasonable amount of good quality content. Once Scrutiny has scanned your site, you can see any pages with thin content, keyword stuffed pages, check occurrences of your target keywords and even see a full keyword analysis:

Wednesday, 9 September 2015

Integrity, Integrity Plus and Scrutiny - updates to fix recent mysterious crashes

Since earlier in September there has been a number of support requests with the following pattern:

  • Integrity or Scrutiny being run on Yosemite (seems fine on Mavericks or El Capitan)
  • The app quits at a consistent point in the scan
  • The site contains links to "docs.google...." or "drive.google..."
  • The crash report shows that control is with the system rather than Integrity or Scrutiny at the point of the crash
  • Integrity and Scrutiny have been very stable for a long time, this problem will have started recently and without you changing anything

The google links work in a browser without the browser crashing, even with cookies and js switched off (which is how Integrity and Scrutiny generally send their requests).

Having traced this to the Google links, I then narrowed it to the User-Agent string. The problem seems to be with sending an http request to these urls with the default user-agent string for Integrity or Scrutiny. (By default my apps are honest about their identity.)

I have two workarounds. The first is easy - go to Preferences and switch the User Agent string to one of the browsers.

The second is to blacklist these links by setting up two rules saying:

Do not check urls containing docs.google
Do not check urls containing drive.google


I've just released new versions of Integrity and Integrity Plus (v5.4.2) and Scrutiny (v5.9.12) which contain a little hack meaning that you can continue using the default UA string for Integrity or Scrutiny and the scan will complete with these links being tested.

Tuesday, 8 September 2015

The cosmic watchmaker and the genetic algorithm



Reading about artificial neural networks has been a life-changer. That led on to the unexpected topic of the genetic algorithm, which is very effective by itself (i.e. without using a neural network) at solving tricky problems.

After working through my first example I was really astonished to find that applying rules like crossover and mutation (mimicking our own reproduction) in a population of initial random data, you arrive at a very fit population (i.e. gets you lots of good answers very close to the best answer) in remarkably few generations.

This is really profound stuff and I'm more excited than I have been since I first came to conventional computing in the mid-80s. If you watch a trace of your data 'evolving', it's perfectly obvious why we reproduce sexually - anything which reproduces this way can become fit for a new environment or solve a problem in remarkably few generations. 

It's also clear that you can produce something very distinct (all depending on your test for fitness) from random data very quickly. Thus the intricate watch found on a beach (as long as it could be the product of rules such as genetic crossover and mutation and meets a need perfectly) isn't so remarkable.

To demonstrate all this, and just a fun exercise in this stuff, is a face (a very special one) evolving from random noise. 

Without going into too much detail (that's all here) this exercise starts with a 'population' of chromosomes made of random numbers, which represent pixels in the images. For each new generation, the rules of crossover and mutation* and a test for fitness are applied. 'Survival of the fittest' isn't a good description of what really goes on in the natural world or in our algorithm here. Instead, individuals are randomly selected for reproduction with a bias towards the fittest.

In the human population, 'fittest' means thriving and feeding yourself until you can reproduce. Here, it means how closely each chromosome looks like the target picture. (Think females choosing males with a highly decorative tail).  (Target picture is shown on the right for reference.)

The infinite monkeys concept isn't helpful here either. Sometimes an answer to a solution can appear out of random data (the bigger the population, the more likely) but this isn't generating random data until the right answer is found, it's starting with random data and applying some rules to work towards and very quickly arrive at the right answer.

For a smoother animation, the picture in this video isn't the best picture from each generation, but an average of all of them. It shows that a whole population becomes very fit in a short time, rather than just a few outstanding individuals getting very fit and then passing on their genes as the terms 'survival of the fittest' or 'natural selection' imply.


* We tend to associate genetic mutation with disease, but it's an important part of this process. 
** My title is a response to the 'cosmic watchmaker' argument. It's a terrible analogie for many reasons, not least of which is that a timepiece is obviously a manufactured tool (like a flint axe) and not a living, reproducing being. But I'm really not interested in the religious argument, only in the uses for these amazing, almost magical techniques.