Wednesday, 11 December 2019

Using Webscraper to extract data from paginated search result pages, another real case study


First we need to give WebScraper our starting URL. This may be as simple as performing the search in a browser and grabbing the url of the first page of results. In this case things are even easier, because the landing page on the website has the first pageful of results. So that's our starting URL.

After that, we want WebScraper to 'crawl' the pagination (ie next/previous, or page 1, 2 3 etc) and ignore any of the other links to other parts of the site. In this case, the pages have a nice clean querystring which only includes the page number:
So as far as the crawl goes, we can tell WebScraper to ignore any url which doesn't contain "?page=". We're in the Advanced section of the Scan tab for these next few steps.
In some cases, the pages we want the data from are a further click from the search results pages. Let's call those 'detail pages'. It's possible to tell WS to stick to the scan rule(s) we've just set up, but to also visit and only scrape data from detail pages. This is done by setting up rules on the 'Output filter' tab. We don't need to do that here but this article goes into more detail about that.

In our case we're still going to use the output filter to make sure that we only scrape data from pages that we're interested in. We're starting within the 'directory' /names. So the first part, "url does contain /names" shouldn't be necessary*. There's a silly quirk here which is that the first page won't stick at ?page=1, it redirects back to /names. Therefore we can't use /names?page= here if we're going to capture the data from that first page. I notice that there are detail pages in the form /names/somename  so in order to filter those from the results, I've added a second rule here, "URL does not contain names/"
If we're lucky, the site will be responsive and won't clam up on us after a limited number of responses, and our activities will go unnoticed. However, it's always a good policy to put in a little throttling (the lower the number, the longer the scan will take). 
Don't worry about the number of threads too much here, the above setting takes that into account and it's beneficial to use multiple threads, even when throttling right down).

Finally, we set up our columns. If you don't see the Output file columns tab, click Complex setup (below the Go button). By default, you'll see two columns set up already, for page URL and page title. Page URL is often useful later, page title not so much. You can delete one or both of those if you don't want them in the output file.

In this example, we want to extract the price, which appears (if we 'view source') as
data-business-name-price="2795". This is very easy to turn into a regex, we just replace the part we want to collect with (.+?)
I had a little confusion over whether this appeared on the page with single-quotes or double-quotes, and so to cover the possibility that the site isn't consistent with these, I also replaced the quotes in the expression with ["'] (square brackets containing a double and single quote).

(Setting up the other columns for this job was virtually identical to the above step).

If necessary, WebScraper has a Helper tool to help us create and test that Regex. (View>Helper for classes/Regex.)

The Test button on the Output columns tab will run a limited scan, which should take a few seconds and put a few rows in the Results tab. We can confirm that the data we want is being collected.

Note that, by default, Webscraper outputs to csv and puts the data from one page on one row. That's fine if we want one piece of data from each page (or extract a number of different pieces into different columns) but here we have multiple results on each page.

Choosing json as an output format gets around this because the format allows for arrays within arrays. If we need csv as the output then WS will initially put the data in a single cell, separated with a separator character (pipe by default) and then has an option on the Post-process tab to split those rows and put each item of data on a separate row. This is done when exporting, so won't be previewed on the Results tab.

That's the setup. Go runs the scan, and Export exports the data.
*remembering to answer 'Directory' when asked whether the starting url is a page or directory



Sunday, 8 December 2019

Manual for Webscraper for Mac

Webscraper is becoming one of our more popular apps and it's growing in terms of its features and options too.
When I use it to do a job for myself or for someone else, I am having trouble myself remembering the rules for certain features.

So I've written a manual in Plain English which should help everyone whether they are a first-time user or just need to check a detail.

https://peacockmedia.software/mac/webscraper/manual.html

There are links on the app's home page and support page. I'll put a link under the app's Help menu too.


Monday, 25 November 2019

Plain Text Stickies

How often do you want to retain the styles when you copy text from one place to another? Almost never? Me neither. It frustrates me that the simple copy and paste, which has been with us for a very long time, includes styles by default*.

But that's not why we're here, This new app isn't some kind of clipboard manager or system extension that gives you system-wide plain text pasting.

I keep a lot of notes. My memory was never good and it's worse now that I'm old. Stickies is still with us and has seemingly always been in Mac OS (actually since system 7 - that's 1991's OS 7, not OSX 10.7) and I have always loved it. It's almost perfect. You can have as many as you like showing, choose different colours for different topics and they just show up as they were when you quit and restart the system. There's no need to get bogged down in saving them as files, just keep them open as long as you need them and then close them when done.

The newer Notes syncs your notes across Apple devices but doesn't allow you to scatter various-coloured notes across your desktop.

This is *NOT* how I like
my notes to end up looking
In both of these cases, and with the myriad third-party 'sticky note' apps**, rich text is the default and there is no option for a note to be plain text. Some even let you choose the type of note (list, rtf, image, etc) but without 'plain text' as an option.

It doesn't have to mean that your notes are all in a boring default monospace or sans-serif font. You can choose a font, font size and font colour for each note, while the text itself remains plain-text.

The closest match I've found for my strange plain-text craving is good old TextEdit, which allows you to set a document to be plain text (and even set the default for new documents to be plain text). For some time I have kept a number of these plain-text TextEdit documents open, resized and positioned where I like. If only it would allow you to set the background colour differently for each window (it does) and then remember that background-colour setting when you quit and re-start TextEdit (it doesn't)  then that would almost do.

Am I alone in this strange obsession?

Time will tell.

The new app is available as a free download. See here.

Update 12 Dec 2019: A newer version which adds some features including optional iCloud syncing (keeps notes in sync across Macs) is in beta and also available for free download.



*You have to take further actions to get rid of that styling, such as  using 'paste as plain text'. To make matters worse, a 'paste as plain text' isn't a standard OS feature and may have a different keyboard shortcut depending on the app you're pasting into and there may not even be a shortcut for it or a 'paste as plain text' at all.

** Forgive me if you have written a third-party app which does do sticky notes with plain text by default. I didn't find it, or if I did, there were other reasons why I didn't continue using it.

Monday, 4 November 2019

Heads-up: new project - Mac Deals

This is a short 'heads-up' post about a new side project I've started called Mac Deals.

(Or possibly Mac Software Deals or Mac App Deals, that's not fixed yet).

I'll be maintaining a database of Mac apps (within certain criteria) which are currently on offer.

This will include PeacockMedia apps from time to time, but will mainly feature those from other developers.

I've opened a mailing list especially for this and will mail out this list regularly.

It currently resides here: https://peacockmedia.software/mac-deals/deals.py

Thursday, 19 September 2019

New feature for Website Watchman and spin-off website archiving app

I wouldn't be without Website Watchman scanning the PeacockMedia website on schedule every day. As with Time Machine, I can dial up any day in the past year and browse the site as it appeared on that day. And export all files associated with any page if I need to.

It also optionally alerts you to changes to your own or anyone else's web page(s).

It took a lot of development time last year. I have (literally) a couple of reports of websites that it has trouble with but on the whole it works well and I think the tricks it does are useful.

Since it went on sale early this year, it has sold. But not in the numbers I'd hoped. Maybe it's just one of those apps that people don't know that they need until they use it.

Here's the interesting part. Of the support requests I've had, more have been on one question than any other. Frustrated emails asking how to make it export the entire site that they've just scanned.   What those people are after is an app which does a 'one-shot' scan of a website and then saves all the files locally. It's a reasonable question because WW's web page talks about archiving a website.

My stock answer is that WW is designed to do two specific things, neither of which is running a single scan and exporting the entire thing.  Like Time Machine, It does hold an archive, which it builds over time, and that's designed to be browsed within WW and any files that you need from a given date are recoverable.

There is of course quite a well-known and long-standing app which does suck and save an entire site. I've had conversations with my users about why they're not using that.

So knowing about those requirements, and owning software which already nearly does that task, the time seems right to go ahead.

I've mused for a while about whether to make an enhancement to Website Watchman so that it can export the files for an entire site (remember that it is designed to make multiple scans of the same website and build an archive with a time dimension, so this option must involve selection of a date if more than one scan has been made) or whether what people are after is a very simple 'one-trick' app which allows you to type a starting URL, press Go, choose a save location and job done.

I've decided to do both. The new simple app will obviously be much cheaper and do one thing. Website Watchman will have the new full-site export added to its feature list. I'm using WebArch for the new app as a working title which may stick.

So where are these new things?

It's been more than a week since writing the news above. Even though we already had toes in this water with the archive functionality (such as it is) in Integrity Plus and Pro and Scrutiny, it turns out that it's no easy thing to make a local copy of a website browsable. In some cases it's easy and the site just works. But each site we tested was like opening a new Pandora's box, with new things that we hadn't taken into account. It's been an intense time but I'm really happy with the stage it's at now. I'd like to hear from users in cases where the local copy isn't browsable or problems are seen.

The new functionality is in Website Watchman from version 2.5.0. There's a button above the archive browser, and a menu item with keyboard shortcut under the File menu.

The new simple app is WebArch and is here for free download.

Friday, 19 July 2019

Migrating to a secure (https://) website using Scrutiny 9

Yesterday I moved another website to https:// and thought I'd take the opportunity to make an updated version of this article. Scrutiny 9 has just been launched.

Google have long been pushing the move to https. Browsers now display an "insecure" message if your site isn't https://

Once the certificate is installed (which I won't go into) then you must weed out links to your http:// pages and pages that have 'mixed' or 'insecure' content, ie references to images, css, js and other files which are http://.

Scrutiny makes it easy to find these.

1. Find links to http pages and pages with insecure content.

First you have to make sure that you're giving your https:// address as your starting url, and make sure that these two boxes are ticked in your settings,

and these boxes ticked in your Preferences,

After running a scan, Scrutiny will offer to show you these issues. If you started at an https:// url, and you had the above boxes checked, then you'll automatically see this box (if there are any issues).
You'll have to fix-and-rescan until there's nothing reported. (When you make certain fixes, that may reveal new pages to Scrutiny for testing).

2. Fix broken links and images

Once those are fixed, there may be some broken links and broken images to fix too (I was copying stuff onto a new server and chose to only copy what was needed. There are inevitably things that you miss...) Scrutiny will report these and make them easy to find.

3. Submit to Google.

Scrutiny can also generate the xml sitemap for you, listing your new pages (and images and pdf files too if you want).

Apparently Google treats the https:// version of your site as a separate 'property' in its Search Console (was Google Webmaster Tools). So you'll have to add the https:// site as a new property and upload the new sitemap.

[update 15 Jul] I uploaded my sitemap on Jul 13, it was processed on Jul 14.

4. Redirect

As part of the migration process, Google recommends that you then "Redirect your users and search engines to the HTTPS page or resource with server-side 301 HTTP redirects"  (full article here)





Sunday, 7 July 2019

Press Release - Integrity Pro v9 released

Integrity Pro version 9 is now fully released. It is a free update for existing licence holders.

The major new features are as follows:
  • Improved Link and Page inspectors. New tabs on the link inspector show all of a url's redirects and any warnings that were logged during the scan.

  • Warnings. A variety of things may now be logged during the scan. For example, a redirect chain or certain problems discovered with the html. If there are any such issues, they'll be highlighted in orange in the links views, and the details will be listed on the new Warnings tab of the Link Inspector.
  • Rechecking. This is an important part of your workflow. Check, fix, re-check. You may have 'fixed' a link by removing it from a page, or by editing the target url. In these cases, simply re-checking the url that Integrity reported as bad will not help. It's necessary to re-check the page that the link appeared on. Now you can ask Integrity to recheck a url, or the page that the url appeared on. And in either case, you can select multiple items before choosing the re-check command.
  • Internal changes. There are some important changes to the internal flow which will eliminate certain false positives.


More general information about Integrity Pro is here:
https://peacockmedia.software/mac/integrity-pro/