Tuesday 31 December 2019

Finding mixed / insecure website content using Scrutiny

It's been a while since some browsers have been warning that a page is insecure. I read recently that Google Chrome will start blocking HTTP resources in HTTPS pages.

If you've not yet migrated your website to https:// then you're probably thinking about doing it now.

Once the certificate is installed (which I won't go into) then you must weed out links to your http:// pages and pages that have 'mixed' or 'insecure' content, ie references to images, css, js and other files which are http://.

Scrutiny makes it easy to find these.

If you're not a Mac user or you'd simply like me to do it for you, I'm able to supply a mixed content report for a modest one-off price. It will list

  • pages with links to internal http: pages
  • pages which use resources (images, style sheets, etc) which are http
  • https:// pages which have a canonical which is http://
  • https:// urls which redirect to a http:// url


If you're interested in using Scrutiny to do this yourself, read on.

1. Find links to http pages and pages with insecure content.

First you have to give Scrutiny your https:// address as your starting url, and make sure that these two boxes are ticked in your site-specific settings,

and these two as well,

After running a scan, Scrutiny will offer to show you these issues,

You'll have to fix-and-rescan until there's nothing reported. (When you make certain fixes, that may reveal new pages to Scrutiny for testing).

2. Fix broken links and images

Once those are fixed, there may be some broken links and broken images to fix too (I was copying stuff onto a new server and trying to only copy what was needed. There are inevitably things that you miss...) Scrutiny will report these and make them easy to find.

3. Submit to Google.

Scrutiny can also generate the xml sitemap for you, listing your new pages (and images and pdf files too if you want).

Apparently Google treats the https:// version of your site as a separate 'property' in its Search Console (was Google Webmaster Tools). So you'll have to add the https:// site as a new property and upload the new sitemap.

4. Redirect

As part of the migration process, Google recommends that you then "Redirect your users and search engines to the HTTPS page or resource with server-side 301 HTTP redirects"  (full article here)





Wednesday 11 December 2019

Using Webscraper to extract data from paginated search result pages, another real case study


First we need to give WebScraper our starting URL. This may be as simple as performing the search in a browser and grabbing the url of the first page of results. In this case things are even easier, because the landing page on the website has the first pageful of results. So that's our starting URL.

After that, we want WebScraper to 'crawl' the pagination (ie next/previous, or page 1, 2 3 etc) and ignore any of the other links to other parts of the site. In this case, the pages have a nice clean querystring which only includes the page number:
So as far as the crawl goes, we can tell WebScraper to ignore any url which doesn't contain "?page=". We're in the Advanced section of the Scan tab for these next few steps.
In some cases, the pages we want the data from are a further click from the search results pages. Let's call those 'detail pages'. It's possible to tell WS to stick to the scan rule(s) we've just set up, but to also visit and only scrape data from detail pages. This is done by setting up rules on the 'Output filter' tab. We don't need to do that here but this article goes into more detail about that.

In our case we're still going to use the output filter to make sure that we only scrape data from pages that we're interested in. We're starting within the 'directory' /names. So the first part, "url does contain /names" shouldn't be necessary*. There's a silly quirk here which is that the first page won't stick at ?page=1, it redirects back to /names. Therefore we can't use /names?page= here if we're going to capture the data from that first page. I notice that there are detail pages in the form /names/somename  so in order to filter those from the results, I've added a second rule here, "URL does not contain names/"
If we're lucky, the site will be responsive and won't clam up on us after a limited number of responses, and our activities will go unnoticed. However, it's always a good policy to put in a little throttling (the lower the number, the longer the scan will take). 
Don't worry about the number of threads too much here, the above setting takes that into account and it's beneficial to use multiple threads, even when throttling right down).

Finally, we set up our columns. If you don't see the Output file columns tab, click Complex setup (below the Go button). By default, you'll see two columns set up already, for page URL and page title. Page URL is often useful later, page title not so much. You can delete one or both of those if you don't want them in the output file.

In this example, we want to extract the price, which appears (if we 'view source') as
data-business-name-price="2795". This is very easy to turn into a regex, we just replace the part we want to collect with (.+?)
I had a little confusion over whether this appeared on the page with single-quotes or double-quotes, and so to cover the possibility that the site isn't consistent with these, I also replaced the quotes in the expression with ["'] (square brackets containing a double and single quote).

(Setting up the other columns for this job was virtually identical to the above step).

If necessary, WebScraper has a Helper tool to help us create and test that Regex. (View>Helper for classes/Regex.)

The Test button on the Output columns tab will run a limited scan, which should take a few seconds and put a few rows in the Results tab. We can confirm that the data we want is being collected.

Note that, by default, Webscraper outputs to csv and puts the data from one page on one row. That's fine if we want one piece of data from each page (or extract a number of different pieces into different columns) but here we have multiple results on each page.

Choosing json as an output format gets around this because the format allows for arrays within arrays. If we need csv as the output then WS will initially put the data in a single cell, separated with a separator character (pipe by default) and then has an option on the Post-process tab to split those rows and put each item of data on a separate row. This is done when exporting, so won't be previewed on the Results tab.

That's the setup. Go runs the scan, and Export exports the data.
*remembering to answer 'Directory' when asked whether the starting url is a page or directory



Sunday 8 December 2019

Manual for Webscraper for Mac

Webscraper is becoming one of our more popular apps and it's growing in terms of its features and options too.
When I use it to do a job for myself or for someone else, I am having trouble myself remembering the rules for certain features.

So I've written a manual in Plain English which should help everyone whether they are a first-time user or just need to check a detail.

https://peacockmedia.software/mac/webscraper/manual.html

There are links on the app's home page and support page. I'll put a link under the app's Help menu too.


Monday 25 November 2019

Plain Text Stickies

How often do you want to retain the styles when you copy text from one place to another? Almost never? Me neither. It frustrates me that the simple copy and paste, which has been with us for a very long time, includes styles by default*.

But that's not why we're here, This new app isn't some kind of clipboard manager or system extension that gives you system-wide plain text pasting.

I keep a lot of notes. My memory was never good and it's worse now that I'm old. Stickies is still with us and has seemingly always been in Mac OS (actually since system 7 - that's 1991's OS 7, not OSX 10.7) and I have always loved it. It's almost perfect. You can have as many as you like showing, choose different colours for different topics and they just show up as they were when you quit and restart the system. There's no need to get bogged down in saving them as files, just keep them open as long as you need them and then close them when done.

The newer Notes syncs your notes across Apple devices but doesn't allow you to scatter various-coloured notes across your desktop.

This is *NOT* how I like
my notes to end up looking
In both of these cases, and with the myriad third-party 'sticky note' apps**, rich text is the default and there is no option for a note to be plain text. Some even let you choose the type of note (list, rtf, image, etc) but without 'plain text' as an option.

It doesn't have to mean that your notes are all in a boring default monospace or sans-serif font. You can choose a font, font size and font colour for each note, while the text itself remains plain-text.

The closest match I've found for my strange plain-text craving is good old TextEdit, which allows you to set a document to be plain text (and even set the default for new documents to be plain text). For some time I have kept a number of these plain-text TextEdit documents open, resized and positioned where I like. If only it would allow you to set the background colour differently for each window (it does) and then remember that background-colour setting when you quit and re-start TextEdit (it doesn't)  then that would almost do.

Am I alone in this strange obsession?

Time will tell.

The new app is available as a free download. See here.

Update 12 Dec 2019: A newer version which adds some features including optional iCloud syncing (keeps notes in sync across Macs) is in beta and also available for free download.



*You have to take further actions to get rid of that styling, such as  using 'paste as plain text'. To make matters worse, a 'paste as plain text' isn't a standard OS feature and may have a different keyboard shortcut depending on the app you're pasting into and there may not even be a shortcut for it or a 'paste as plain text' at all.

** Forgive me if you have written a third-party app which does do sticky notes with plain text by default. I didn't find it, or if I did, there were other reasons why I didn't continue using it.

Monday 4 November 2019

Heads-up: new project - Mac Deals

This is a short 'heads-up' post about a new side project I've started called Mac Deals.

(Or possibly Mac Software Deals or Mac App Deals, that's not fixed yet).

I'll be maintaining a database of Mac apps (within certain criteria) which are currently on offer.

This will include PeacockMedia apps from time to time, but will mainly feature those from other developers.

I've opened a mailing list especially for this and will mail out this list regularly.

It currently resides here: https://peacockmedia.software/mac-deals/deals.py

Thursday 19 September 2019

New feature for Website Watchman and spin-off website archiving app

I wouldn't be without Website Watchman scanning the PeacockMedia website on schedule every day. As with Time Machine, I can dial up any day in the past year and browse the site as it appeared on that day. And export all files associated with any page if I need to.

It also optionally alerts you to changes to your own or anyone else's web page(s).

It took a lot of development time last year. I have (literally) a couple of reports of websites that it has trouble with but on the whole it works well and I think the tricks it does are useful.

Since it went on sale early this year, it has sold. But not in the numbers I'd hoped. Maybe it's just one of those apps that people don't know that they need until they use it.

Here's the interesting part. Of the support requests I've had, more have been on one question than any other. Frustrated emails asking how to make it export the entire site that they've just scanned.   What those people are after is an app which does a 'one-shot' scan of a website and then saves all the files locally. It's a reasonable question because WW's web page talks about archiving a website.

My stock answer is that WW is designed to do two specific things, neither of which is running a single scan and exporting the entire thing.  Like Time Machine, It does hold an archive, which it builds over time, and that's designed to be browsed within WW and any files that you need from a given date are recoverable.

There is of course quite a well-known and long-standing app which does suck and save an entire site. I've had conversations with my users about why they're not using that.

So knowing about those requirements, and owning software which already nearly does that task, the time seems right to go ahead.

I've mused for a while about whether to make an enhancement to Website Watchman so that it can export the files for an entire site (remember that it is designed to make multiple scans of the same website and build an archive with a time dimension, so this option must involve selection of a date if more than one scan has been made) or whether what people are after is a very simple 'one-trick' app which allows you to type a starting URL, press Go, choose a save location and job done.

I've decided to do both. The new simple app will obviously be much cheaper and do one thing. Website Watchman will have the new full-site export added to its feature list. I'm using WebArch for the new app as a working title which may stick.

So where are these new things?

It's been more than a week since writing the news above. Even though we already had toes in this water with the archive functionality (such as it is) in Integrity Plus and Pro and Scrutiny, it turns out that it's no easy thing to make a local copy of a website browsable. In some cases it's easy and the site just works. But each site we tested was like opening a new Pandora's box, with new things that we hadn't taken into account. It's been an intense time but I'm really happy with the stage it's at now. I'd like to hear from users in cases where the local copy isn't browsable or problems are seen.

The new functionality is in Website Watchman from version 2.5.0. There's a button above the archive browser, and a menu item with keyboard shortcut under the File menu.

The new simple app is WebArch and is here for free download.

Friday 19 July 2019

Migrating to a secure (https://) website using Scrutiny 9

There is a more recent and updated version of this article here.

Yesterday I moved another website to https:// and thought I'd take the opportunity to make an updated version of this article. Scrutiny 9 has just been launched.

Google have long been pushing the move to https. Browsers now display an "insecure" message if your site isn't https://

Once the certificate is installed (which I won't go into) then you must weed out links to your http:// pages and pages that have 'mixed' or 'insecure' content, ie references to images, css, js and other files which are http://.

Scrutiny makes it easy to find these.

1. Find links to http pages and pages with insecure content.

First you have to make sure that you're giving your https:// address as your starting url, and make sure that these two boxes are ticked in your settings,

and these boxes ticked in your Preferences,

After running a scan, Scrutiny will offer to show you these issues. If you started at an https:// url, and you had the above boxes checked, then you'll automatically see this box (if there are any issues).
You'll have to fix-and-rescan until there's nothing reported. (When you make certain fixes, that may reveal new pages to Scrutiny for testing).

2. Fix broken links and images

Once those are fixed, there may be some broken links and broken images to fix too (I was copying stuff onto a new server and chose to only copy what was needed. There are inevitably things that you miss...) Scrutiny will report these and make them easy to find.

3. Submit to Google.

Scrutiny can also generate the xml sitemap for you, listing your new pages (and images and pdf files too if you want).

Apparently Google treats the https:// version of your site as a separate 'property' in its Search Console (was Google Webmaster Tools). So you'll have to add the https:// site as a new property and upload the new sitemap.

[update 15 Jul] I uploaded my sitemap on Jul 13, it was processed on Jul 14.

4. Redirect

As part of the migration process, Google recommends that you then "Redirect your users and search engines to the HTTPS page or resource with server-side 301 HTTP redirects"  (full article here)





Sunday 7 July 2019

Press Release - Integrity Pro v9 released

Integrity Pro version 9 is now fully released. It is a free update for existing licence holders.

The major new features are as follows:
  • Improved Link and Page inspectors. New tabs on the link inspector show all of a url's redirects and any warnings that were logged during the scan.

  • Warnings. A variety of things may now be logged during the scan. For example, a redirect chain or certain problems discovered with the html. If there are any such issues, they'll be highlighted in orange in the links views, and the details will be listed on the new Warnings tab of the Link Inspector.
  • Rechecking. This is an important part of your workflow. Check, fix, re-check. You may have 'fixed' a link by removing it from a page, or by editing the target url. In these cases, simply re-checking the url that Integrity reported as bad will not help. It's necessary to re-check the page that the link appeared on. Now you can ask Integrity to recheck a url, or the page that the url appeared on. And in either case, you can select multiple items before choosing the re-check command.
  • Internal changes. There are some important changes to the internal flow which will eliminate certain false positives.


More general information about Integrity Pro is here:
https://peacockmedia.software/mac/integrity-pro/

Friday 5 July 2019

Two Mac bundles: Web Maestro Bundle and Web Virtuoso Bundle

I recently answered a question about the overlap with some of our apps. The customer wanted to know which apps he needed in order to possess all of the functionality.

For example, Website Watchman goes much further with its archiving functionality than Integrity and Scrutiny. But Webscraper entirely contains the crawling and markdown conversion of HTML2MD

The answer was that he'd need three apps. It was clear that there should be a bundle option. So here it is.

There are two bundles, one containing Integrity Pro, which crawls a website checking for  broken links, SEO issues, spelling and generates XML sitemap.  The alternative bundle contains Scrutiny which has many advanced features over Integrity Pro, such as scheduling, js rendering.

These are the bundles.

Web Maestro Bundle:

Scrutiny: Link check, SEO checks, Spelling, Searching, Advanced features
Website Watchman: Monitor, Archive. Time Machine for your website
Webscraper: Extract and Convert data or entrire content. Extract content as html, markdown or plain text. Extract data from spans, divs etc using classes or ids. Or apply a Regex to the pages.


Web Virtuoso Bundle:

Integrity Pro: Link check, SEO checks, Spelling
Website Watchman: Monitor, Archive. Time Machine for your website
Webscraper: Extract and Convert data or entrire content. Extract content as html, markdown or plain text. Extract data from spans, divs etc using classes or ids. Or apply a Regex to the pages.


The link for accessing these bundles is:
https://peacockmedia.software/#bundles

If you're interested in an affiliate scheme which allows you to promote and earn from these bundles and the separate products, this is the sign-up form:
https://a.paddle.com/join/program/198


Wednesday 29 May 2019

Press Release - Scrutiny 9 Launched

Scrutiny version 9 is now fully released. It is a free update for v7 or v8 licence holders. There is a small upgrade fee for holders of a v5 / v6 licence. Details here: https://peacockmedia.software/mac/scrutiny/upgrade.html

The major new features are as follows:
  • Improved Link and Page inspectors. New tabs on the link inspector show all of a url's redirects and any warnings that were logged during the scan.

  • Warnings. A variety of things may now be logged during the scan. For example, a redirect chain or certain problems discovered with the html. If there are any such issues, they'll be highlighted in orange in the links views, and the details will be listed on the new Warnings tab of the Link Inspector.
  • Rechecking. This is an important part of your workflow. Check, fix, re-check. You may have 'fixed' a link by removing it from a page, or by editing the target url. In these cases, simply re-checking the url that Scrutiny reported as bad will not help. It's necessary to re-check the page that the link appeared on. Now you can ask Scrutiny to recheck a url, or the page that the url appeared on. And in either case, you can select multiple items before choosing the re-check command.
  • Internal changes. There are some important changes to the internal flow which will eliminate certain false positives.
  • Reporting. The summary of the 'full report' is customisable (in case you're checking a customer's site and want to add your own branding to the report). You now have more choice over which tables you include in csv format with that summary.





A full and detailed run-down of version 9's new features is here:
https://blog.peacockmedia.software/2019/05/scrutiny-version-9-preview-and-run-down.html

More general information about Scrutiny is here:
https://peacockmedia.software/mac/scrutiny/

Saturday 18 May 2019

Scrutiny version 9, a preview and run-down of new features

Like version 8, version 9 doesn't have any dramatic changes in the interface, so it'll be free for existing v7 or v8 licence holders. It'll remain a on-off purchase but there may be a small increase in price for new customers or upgraders from v6 or earlier.

But that's not to say that there aren't some important changes going on, which I'll outline here.

All of this applies to Integrity too, although the release of Scrutiny 9 will come first.

Inspectors and warnings

The biggest UI change is in the link inspector. It puts the list of redirects (if there are any) on a tab rather than a sheet, so the information is more obvious.  There is also a new 'Warnings' tab.
Traditionally in Integrity and Scrutiny, a link coloured orange means a warning, and in the past this meant only one thing - a redirect (which some users don't want to see, which is OK, there's an option to switch redirect warnings off.)

Now the orange warning could mean one or more of a number of things. While scanning, the engine may encounter things which may not be showstoppers but which the user might be grateful to know about. There hasn't been a place for such information. In version 9, these things are displayed in the Warnings tab and the link appears orange in the table if there are any warnings (including redirects, unless you have that switched off.)

Examples of things you may be warned about include more than one canonical tag on the target page, unterminated or improperly-terminated script tags or comments (ie <!--   with no  --> which could be interpreted as commenting out the rest of the page, though browsers usually seem to ignore the opening tag if there's no closing one). Redirect chains also appear in the warnings. The threshold for a chain can now be set in Preferences. Redirect chains were previously visible in the SEO results with a hard-coded threshold of 3.  

The number of things Scrutiny can alert you about in the warnings tab will increase in the future.

Strictly-speaking there's an important distinction between link properties and page properties. A link has link text, a target url and other attributes such as rel-nofollow.  A page is the file that your browser loads, it contains links to other pages.

A 'page inspector' has long been available in Scrutiny. It shows a large amount of information; meta data, headings, word count, the number of links on the page, the number of links *to* that page and more. A lot of this is of course visible in the SEO table.

Whether you see the link inspector or the page inspector depends on the context (for example, the SEO table is concerned with pages rather than links, so a double-click opens the page inspector.) But when viewing the properties of a link, you may want to see the page properties. (In the context of a link, you may think about the parent page or the target page, but the target may more usually come to mind). So it's now possible to see some basic information about the target on the 'target' tab of the link inspector and you can press a button to open the full page inspector.

[update 26 May] The page inspector has always had fields for the number of inbound links and outbound links. Now it has sortable tables showing the inbound and outbound links for the page being inspected:

Rechecking

This is an important function that Integrity / Scrutiny  have never done very well.

You run a large scan, you fix some stuff, and want to re-check. But you only want to re-check the things that were wrong during the first scan, not spend hours running another full scan.

Re-checking functionality has previously been limited. The interface has been changed quite recently to make it clear that you'll "Recheck this url", ie the url that the app found during the first scan.

Half of the time, possibly most of the time, you'll have 'fixed' the link by editing its target, or removing it from the page entirely.

The only way to handle this is to re-check the page(s) that your fixed link appears on. This apparently simple-sounding thing is by far the most complex and difficult of the v9 changes.

Version 9 still has the simple 're-check this url' available from the link inspector and various context menus (and can be used after a multiple selection). It also now has 'recheck the page this link appears on'. (also can be used with a multiple selection).

Reporting

This is another important area that has had less than its fair share of development in the past. Starting to offer services ourselves has prompted some improvements here.

Over time, simple functionality is superseded by better options. "On finish save bad links / SEO as csv" are no longer needed and have now gone because "On finish save report" does those things and more. This led to confusion when they all existed together, particularly for those users who like to switch everything on with abandon. Checking all the boxes would lead to the same csvs being saved multiple times and sometimes multiple save dialogs at the end of the scan.

The 'On finish' section of the settings now looks like this. If you want things saved automatically after a scheduled scan or manual 'scan with actions' then switch on 'Save report' and then choose exactly what you want included.

Reduced false positives

The remaining v9 changes are invisible. One of those is a fundamental change to the program flow. Previously link urls were 'unencoded' for storage / display and encoded for testing. In theory this should be fine, but I've seen some examples via the support desk where it's not fine. In one case a redirect was in place which redirected a version of the url containing a percent-encoding, but not the identical url without the percent-encoding. The character wasn't one that you'd usually encode. This unusual example shows that as a matter of principle, a crawler ought to store the url exactly as found on the page and use it exactly as found when making the http request. 'Unencoding' should only be done for display purposes.

When you read that, it'll seem obvious that a crawler should work that way. But it's the kind of decision you make early on in an app's development, and tend to work with as time goes on rather than spending a lot of time making fundamental changes to your app and risking breaking other things that work perfectly well.

Anyhow, that work is done,  it'll affect very few people, but for those people it'll reduce those false positives. (or 'true negatives' whichever way you want to look at a link reported bad that is actually fine.)

[update 23 May 2019] I take it back, possibly the hardest thing to do was to add the sftp option (v9 will allow   ftp / ftp with TLS (aka ftps) / sftp)  for when automatically ftp'ing sitemap or checking for orphaned pages.

update 10 Jun: A signed and notarized build of 9.0.3 is available. Please feel free to run it if you're prepared to contact us with any problems you notice. Please wait a while if you're a production user - ie if you already use Scrutiny and rely on it.

update 19 Jun: v9 is now the main release, with 8.4.1 still available for download.

Thursday 25 April 2019

Apple's notarization service

A big change that's been happening quietly in MacOS is Apple's Notarization service.

Ever since the App Store opened and was the only place to obtain software for the iPhone ('jailbreaking' excepted), I've been waiting for the sun to set on being able to download and install Mac apps from the web. Which is the core of my business. (Mac App Store sales amount to a small proportion of my income. That balance is fine, because Apple take a whopping 1/3 of the selling price).

Notarization is a step in that direction, although it still leaves developers free to distribute outside the app store. It means that Apple examine the app for malware. At this point they can't reject your app for any reason other than the malware search. They do specify 'hardened runtime' which is a tighter security constraint but I've  not found this to restrict functionality, as the Sandboxing requirement did when the App Store opened.

When the notarization service started last year, it was optional. Now Gatekeeper gives a more preferable message when an app is notarized, and it looks as if 10.15's Gatekeeper will refuse to install apps that haven't been notarized.

It's easy to feel threatened by this and imagine a future where Apple are vetting everything in the same way they do for the app store. For users that's a great thing, it guarantees them a certain standard of quality in any app they may be interested in. As a developer it feels like a constraint on my freedom to build and publish.

It genuinely seems geared towards reducing malware on the Mac. "This is a good thing" says John Martellaro in his column.

https://www.macobserver.com/columns-opinions/editorial/notarization-apple-greatly-reduce-malware-on-macs/?utm_source=macobserver&utm_medium=rss&utm_campaign=rss_everything

Wednesday 17 April 2019

New tool for analysing http response

Your web page looks great in your browser but a tool such as a link checker or SEO checker reports something unexpected.

It's an age-old story. I'm sure we've all been there a thousand times.

OK, maybe not, but it does happen. An http request can be very different depending on what client is sending it. Maybe the server gives a different response depending on the user-agent string or another http request header field. Maybe the page is 'soft 404'.

What's needed is a tool that allows you to make a custom http request and view exactly what's coming back; the status, the response header fields and the content. And maybe to experiment with different request settings.

I've been meaning to build such a tool for a long time. Incorporated into Scrutiny, or maybe a standalone app.

I've also been thinking of making some online tools. The ultimate plan is an online Integrity. The request / response tool is the ideal candidate for an online tool.

For a name, one of my unused domains seemed strangely appropriate. A few minutes with the Wacom and it has a logo.



It's early days and there are more features to add. But feel free to try it. If you see any problems, please let me know.

http://purpleyes.com

Monday 1 April 2019

Scraping links from Google search results

This tutorial assumes that you want to crawl the first few search results pages and collect the links from the results.

Remember that:

1. The search results that you get from WebScraper will be slightly different from the ones that you see in your browser. Google's search results are personalised. If someone else in a different place does the same search, they will get a different set of results from you, based on their browsing history. WebScraper doesn't use cookies and so it doesn't have a personality or browsing history.

2. Google limits the number of pages that you can see within a certain time. If you use this once with these settings, that'll be fine, but if you use it several times, it'll eventually fail. I believe Google allows each person to make 100 requests per hour before showing a CAPTCHA but I'm not sure about that number.  If you run this example a few times, it may stop working. If this happens, press the 'log in' button at the bottom of the scan tab. You should see the CAPTCHA. If you complete it you should be able to continue.  If this is a problem, adapt this tutorial for another search engine which doesn't have this limit.

We're using WebScraper for Mac which has some limits in the unregistered version.

1. The crawl
Your starting url looks like this: http://www.google.com/search?q=ten+best+cakes.   Set Crawl maximum to '1 click from home' because in this example we can reach the first ten pages of search results within one click of the starting url (see Pagination below).
One important point not ringed in the screenshot. Choose 'IE11 / Windows' for your user-agent string. Google serves different code depending on the browser. The regex below was written with the user-agent set to IE11/Windows.

2. Pagination 
We want to follow the links at the bottom of the page for the next page of search results, and nothing else. (These settings are about the crawl, not about collecting data). So we set up a rule that says "ignore urls that don't contain &start= " (see above)

3. Output file
Add a column, choose Regex. The expression is <div class="r"><a href="(.*?)"
You can bin the other two default columns if you want to. (I didn't bother and you'll see them in the final results at the bottom of this article.)

4. Separating our results
WebScraper is designed to crawl a site and extract a piece of data from each page. Each new row in the output file represents a page from the crawl. Here we want to collect multiple data from each page. So the scraped urls from each search results page will appear in a single cell (separated by a special character.) So we want to ask WebScraper to split these onto separate rows, which it does when it exports.
5. Run
Press the >Go button, and you'll see the results fill up in the Results tab. As mentioned, each row is one page of the crawl, you'll need to export to split the results onto separate rows.
 Here's the output in Finder/quicklook. The csv should open in your favourite spreadsheet app.

Wednesday 27 March 2019

C64 / 6502 IDE for Mac

This post seems to fit equally well here on the PM blog as it does on the separate blog for my hobby project. To avoid duplication I'll post it there and link to it from here.

https://newstuffforoldstuff.blogspot.com/2019/03/homebrew-game-for-c64-part-5-c64-6502.html


Version 9 of Integrity and Scrutiny underway

As with version 8, the next major release is likely to look very similar on the surface.

The biggest change will be more information of the 'warning' kind.

So far it's been easy to isolate or export 'bad links'. These tend to be a 4xx or 5xx server response codes, so you can already see at a glance what's wrong. You may have seen links marked with orange, which usually denotes a redirect. And you can easily filter those and export them if you want.
But there are many other warnings that Integrity and Scrutiny could report.

At this point, this won't mean a full validation of the html (that's another story) but there are certain things that the crawling engine encounters as it parses the page for links. One example is the presence of two canonical tags (it happens more often than you'd think). If they contain different urls, it's unclear what Google would do. There are a number of situations like this that the engine handles, and which the user might like to know about.

There are other types of problem, such as an old perennial; one or more "../" ('up a directory') at the start of a relative url which technically takes the url above the server's http root. (Some users take the line "if it works in my browser, then it's fine and your software is incorrectly reporting an error", and for that reason there's a preference to make the app as tolerant as browsers tend to be with these links.)

In short, there are various warnings that Integrity and Scrutiny could list. Version 9 will make use of the orange highlighting. There will be tabs within the link inspector so that it can display more information, and the warnings relating to that link (if any) will be one of these tabs.

Reporting / exporting is important to users. So there will be another main table (with export option) with the links results to display a list of each warning, by link / page.

This will be the main focus but there will be other features. Better re-checking is on the list. This is early days, so if there's anything else that you'd like to see, please do tell!

Saturday 23 March 2019

Press release - Slight change of name and full launch for Website Watchman



Press release
Mon 25 Mar 2019


begins---------------

Watchman from PeacockMedia is now fully released and it's now called Website Watchman. The original name clashed with another project, but Website Watchman better sums up the app.

It'll be useful to anyone with a mac and a website. It does a very impressive trick. As it scans a website and checks all of the files that make up a page; html, js, css, images linked documents. It archives any changes. So the webarchive that it builds is four-dimensional. You can browse the pages, and then browse the versions of that page back through time. You view a 'living' version of the page, not just a snapshot.

As the name suggests, you can watch a single page, part of a website or a whole website, or even part of a page, using various filters including a regex, and be alerted to changes.

Thank you for your attention and any exposure that you can give Website Watchman.

https://peacockmedia.software/mac/watchman

ends---------------

Thursday 21 March 2019

Hidden gems: Scrutiny's site search

Yes, you can use Google to search a particular site. But you can't search the entire source, or show pages which *don't* contain something. For example, you may want to search your website for pages that don't have your Google Analytics code.

The usual options that you've set up in your site config apply to the crawl: Blacklisting / whitelisting of pages, link and level limits and lots of other options, including authentication.

Here's where you enter your search term (or search terms - you can search for multiple terms at once, the results will have a column showing which term(s) were found on the page).

There's a 'reverse' button so that you can see pages that *don't* contain your term. Your search term can be a regular expression (Regex) if you want to match a pattern. You can search the source or the visible text.

Method

1. Add your site to Scrutiny if it's not there already and check the basic settings. See https://peacockmedia.software/mac/scrutiny/manual/v9/en.lproj/getting-started.html

2. Instead of 'Scan now', open 'More tasks' and choose 'Search Site'
3. enter your search term (or multiple search terms) and review the other options in the dialog





Note

1. If you're searching the entire source, make sure that your search term matches the way it appears in the source. Today I was confused when I searched for a sentence on Integrity's french page; "Le vérificateur de liens pour vos sites internet". The accented character appears in the source (correctly encoded) as &eacute;  If you search the body text rather than source, any such entities will be 'unencoded' before checking.

2. There's a global setting for what should be included / excluded from the visible text of a page. (Preferences > General > Content.)You may want to ignore the contents of navs and footers, for example. One of these options is 'only look at the contents of paragraph and heading tags'. If this is switched on, then quite a lot is excluded. Visible text may not necessarily be in <p> tags.

The search dialog above now has a friendly warning which informs you if you're searching body text and if you've got some of these exclusions switched on.

The download for Scrutiny is here and includes a free 30-day trial.

Wednesday 13 March 2019

Raspberry Pi Zero W - baby steps

I don't know why I've waited so long to do this. I love messing around with home automation, and here's a fully-functional computer with wireless and 20-odd GPIO (input-ouput) pins.

I've also been meaning to begin using Linux (technically I already do, I use MacOS's Terminal a lot, and have set up and run a couple of Ubuntu servers with python and mysql). The Pi has a desktop, USBs for keyboard and mouse, HDMI output. This one has 1Ghz processor 256Mb of ram and for a HD, whatever free space is on the card that you put in. All of this on a board which is half the size of a credit card and costs less than a tenner (or less than a fiver for the non-wireless version).

When we finally invent time travel, or at least find a way to communicate across time (as per Gibson's The Peripheral which I heartily recommend)  my teenage self will be astounded by this information. I remember the day when I first heard the word 'megabyte'. It wasn't far off the day that I felt very powerful after plugging an 8k expansion into my computer.

Anyhow, back to the plot. What I've learned so far is that that the 'less than a tenner' Raspberry Pi Zero W is 'bare bones'. I've bought a few bits and pieces that have cost much more than the computer(!) including a header for the GPI pins, a breadboard & components kit, pre-loaded micro SD (effectively the HD and OS), mini HDMI to proper-size HDMI adaptor.

Monday 11 March 2019

Website archiving - Watchman's commercial release

[NB since version 2.1.0, we have had to make a slight change to the name, its full title is now Website Watchman.]

It has been a (deliberately) long road but Watchman for Mac now has its first commercial release.
This product does such a cool job that I've long believed that it could be as important to us as Integrity and Scrutiny. So I've been afraid to rush things. Version zero was officially beta, and a useful time for discovering shortcomings and improving the functionality. Version one was free. Downloads were healthy and feedback slim, which I take as a good sign. Finally it's now released with a trial period and reasonable introductory price tag. Users of version one are welcome to continue to use it, but it obviously won't get updates.

So what does it do? In a few words. "Monitor and archive a website".

There are apps that monitor a url and alert you to changes. There are apps that scan an entire website and archive it.

Watchman can scan a single page, part of a website or a whole website. It can do this on schedule - hourly, daily, weekly, monthly. It can alert you to changes. It builds a web archive which you can view (using Watchman itself or the free 'WebArchive Viewer' which is included in the dmg). You can browse the urls that it has scanned, and for each, view how that page looked on a particular day.

We're not talking about screenshots but a 'living' copy of the page. Watchman looks for and archives changes in every file, html, css, js and other linked files such as pdfs.  You can obviously export that page as a screenshot or a collection of the files making up that page, as they stood on that date.

A 'must have' for every website owner?

Try Watchman for free / buy at the introductory price.

Friday 8 March 2019

SID tune for C64 homebrew game - part 1

My enthusiasm for this project has surprised even me, and to avoid this blog filling up with my ramblings about making this 8-bit game, and to keep all of those posts in one suitable place, I've moved this post and the others to their own blog.

This post has moved to:

https://newstuffforoldstuff.blogspot.com/2019/03/homebrew-game-for-c64-part-1-sid-tune.html






Monday 4 March 2019

Website archiving utility, version 2

Watchman for Mac is a utility that can scan a single page or a whole site on schedule, it'll archive the page(s) and alert the user to any changes in the code or visible text.

As it builds its archive it's possible to browse the historical versions of the pages, as they stood on particular dates. It displays a 'living' version of the historical pages, with javascript and stylesheets also archived.

We've just made a version 2 beta available. It features a 'non-UI' or headless mode, which means that it can remain in the background and not interrupt the user when a scheduled scan starts. Its windows can still be accessed from a status bar menu.

https://peacockmedia.software/mac/watchman/

Version 1 is still available. It's free and will remain free. The new beta is also available and free at present, but version 2 will eventually require a licence.

Wednesday 27 February 2019

Vic20 Programmer's Reference Guide

Having become more and more interested in the computers that started it all for me, I've been poking (geddit?) around in my attic among my collection of 8-bit computers, running emulators and buying hardware (leads and adaptors) and starting to use some of these machines again.

I have quite a collection - there was a time when you could easily find them at car boot sales for a fiver,  and I think that one of the Vic20s I have is the one I was given as a teenager.

I've always thought that I'd return to these computers, maybe in retirement, but it's happened a bit sooner.
There's something very exciting about these early home 8-bit machines. You're presented with a flashing cursor and in order to do anything, you need to type a command or two in basic (LOAD and RUN if you wanted to play a game). I believe this fact is why so many of my generation became so interested in software development, and it's something that was lost when a/ many turned to consoles because they only wanted to play games and b/ computers gained more graphic interfaces, separating the user from the workings.

Back to the point. I've begun writing a game for C64. I couldn't help myself. After much searching of the attic, I couldn't find a copy of the C64 Programmer's Reference Guide. I easily found it as a scanned pdf online, which is fine, but it's harder to use that than simply flicking through the pages of a book. (I will buy a copy, they come up on eBay. No doubt I will then find a copy I already owned.)

One of the things I really wanted was a reference guide for the instruction set. Although the C64 has a 6510 processor and the Vic20 a 6502, the two processors are identical but for a small difference. And I did find a Vic20 Prog Ref Guide, and the instruction set reference is identical.
It'll be handy when I make the Vic20 version of my new game.

It's very well-thumbed - I was as obsessed with programming as I am now. This is clearly my own original copy. It has some lined A4 with notes in my handwriting. It feels weird handling the book, a little bit like I've gone back 38 years in time, grabbed the book and brought it back with me.
(This page has my own instructions for using the monitor programme that I wrote in machine code, as a tool for writing more machine code. I used to write my programmes on paper and assemble them by hand before typing in the hex. I'm sure the monitor is on a tape somewhere in the attic.)

End note:
As part of my foray into this new-old world, I've discovered that there are some amazing people producing hardware and games (on cartridge/disc/tape) for these old computers. To support and encourage that, I've begun this project.

Friday 18 January 2019

scraping Yelp for phone numbers of all plumbers in California (or whatever in wherever)

I've written similar tutorials to this one before, but I've made the screenshots and written the tutorial this morning to help someone and wanted to preserve the information here.

We're using the Webscraper app. The procedure below will work while you're in demo mode, but the number of results will be limited.

We enter our starting url. Perform your search on Yelp and then click through to the next page of results. Note how the url changes as you click through. In this case it goes:
https://www.yelp.com/search?find_desc=Plumbers&find_loc=California&ns=1&start=0
https://www.yelp.com/search?find_desc=Plumbers&find_loc=California&ns=1&start=20
https://www.yelp.com/search?find_desc=Plumbers&find_loc=California&ns=1&start=40
etc.

(I added in &start=0 on the first one, that part isn't there when you first go to the results, but this avoids some duplication due to the default page and &start=0 being the same page).

So our starting url should be:
https://www.yelp.com/search?find_desc=Plumbers&find_loc=California&ns=1&start=0

In order to crawl through the results pages, we limit our crawl to urls that match the pattern we've observed above. We can do that by asking the crawl to ignore any urls that don't contain
?find_desc=Plumbers&find_loc=California&ns=1&start=

In this case we're going to additionally click the links in the results, so that we can scrape the information we want from the businesses' pages. This is done on the 'Output filter' tab. Check 'Filter output' and enter these rules:
URL contains /biz/
and URL contains ?osq=Plumbers
(The phone numbers and business names are right there on the results pages, we could grab them from there, but for this exercise we're clicking through to the business page to grab the info from there. It has advantages.)

Finally we need to set up the columns in our output file and specify what information we want to grab. On the business page, the name of the business is in the h1 tags, so we can simply select that. The phone number is helpfully in a div called 'biz-phone' so that's easy to set up too.

Then we run by pressing the Go button. In an unlicensed copy of the WebScraper app, you should see 10 results. Once licensed, the app should crawl through the pagination and collect all (in this case) 200+ results.

Limitations

I was able to get all of the results (matching those available while using the browser) for this particular category. For some others I noticed that Yelp didn't seem to want to serve more than 25 pages of results, even when the page said that there were more pages. Skipping straight to the 25th page and then clicking 'Next' resulted in a page with hints about searching.

This isn't the same as becoming blacklisted, which will happen when you have made too many requests in a given time. This is obvious because you then can't access Yelp in your browser without changing your IP address. One measure to avoid this problem is to use ProxyCrawl which is a service that you can use by getting yourself an account (free initially), switch on 'Use ProxyCrawl' in WebScraper and enter your token in Preferences.