PeacockMedia: Spidering wikipedia

Friday, 5 June 2015

Spidering wikipedia

I've reached a milestone in my 'crawling the English Wikipedia project'. (I'm hoping to find out whether the 'six degrees' principal is true*.) Scrutiny has now managed a scan taking in 3 million links which includes 1.279 million pages in its sitemap results. This is the largest single scan I've ever seen any of my applications run.

My instance of Scrutiny must have been feeling very enlightened after parsing this eclectic raft of articles including Blue tit, Conway Twitty, Wolverine (character) [yes, there are a surprising number of other Wolverines!], Benjamin Anderson (adventurer) and Personal Jesus.

The most fascinating thing about this crawl is that out of the pages scanned here, the article with the most links (excluding a few unusual page types) is alcohol. It has over 6,000 hyperlinks on its page** This suggests that we have more to say about nature's gift of fermentation than about World War Two, which has two thirds the number of links.

*The uncertainty here is that if you imagine a node structure with each node linking to, say, 100 pages, then you can reach a million pages in three clicks. But those aren't a million unique pages. The number of previously-undiscovered pages diminishes with each page parsed

** this does include 'edit' links and citation anchor links. For future crawls I'll blacklist these for efficiency.

Friday, 5 June 2015

Spidering wikipedia

No comments:

Post a Comment