Tuesday, 4 February 2020

How to extract a table from html and save to csv (web to spreadsheet)

WebScraper users have sometimes asked about extracting data contained in tables on multiple pages.

Tables on multiple pages


That's fine if the table is for layout, or if there's just one bit of info that you want to grab from each, identifiable using a class or id.

But to take the whole table raises some questions - how do you map the web table to your output file? It may work if you can identify a similar table on all pages (matching columns) so that each one can be appended and match up, and if the first row is always headings (or marked up as th) and can be ignored, except for maybe the first one.

It's a scenario with a lot of ifs and buts, which means that it may be one of those problems that's best dealt on a case-by-case basis rather than trying to make a configurable app handle it. (if you do have this requirement, please do get in touch.)

Table from a single page


But this week someone asked about extracting a table from a single web page. It's pretty simple to copy the source from the web page, paste it into an online tool, or copy the table from the web page and paste into a spreadsheet app like Numbers or Excel and that was my answer.

But this set me thinking about the job of parsing html and extracting the table data ready for saving in whatever format.

At the core of this is a cocoa class for parsing the html and extracting the table (or tables if there are more than one on the page). With a view to possibly building this into WebScraper to allow it to do the 'tables on multiple pages' task, or for having this ready, should the need arise to use this in a custom app for a one-off job, I've now written that parser and built a small free app around it.

That app is the imaginatively-titled HTMLTabletoCSV which is now available here.


Wednesday, 29 January 2020

Upgrading the hard disk in a G4 iMac

I bought this iMac new in 2003 and used it as my main machine for some time. I was writing software at the time, had a number of products out, but it wasn't my full-time business then. Eventually it was packed away to make room for newer machines.
Isn't it beautiful? OSX 10.4 is a really great version of the operating system, and this is the time when iTunes was at its height. I'd long had it in mind that it would make a great jukebox for kitchen or bedroom, with its great Harmon Kardon speakers with optional subwoofer. 

To be useful again, mine needed:

  • Wireless card (I never did fit one)
  • Hard drive upgrade* (the standard 80Gb one isn't big enough for half of my music collection)
  • Replacement CD drive** (mine had broken somewhere along the line)
  • New battery (small alkaline battery for keeping the time and date)

The parts I expected to be hard were easy, and the parts that should have been easy were hard. I'm documenting my experiences here because there are some things that may be useful to others.

I'd already collected some of these parts, intending to do the work one day. Last week something reminded me of Cro-Mag Rally. It's a great game and the Mac version is now free, thank you Pangea. The OSX version needs a PowerPC mac, and my clamshell iBook proved too slow. so out came the G4 iMac.
 It wasn't remembering the time, which of course means that the battery is dead, and a battery inside an almost 20-year old computer is bad news. So perhaps the time was right to carry out those other upgrades at the same time.
The wireless card and adding RAM are pretty much a trapdoor job (well, behind the stainless steel cover, which has 4 captive crosshead screws. Very simple.) No worries if that's all you have to do.

NB, 10.4 (Tiger) allows connection with WPA2, which is what I needed at home. and the Airport Extreme is working fine. 10.3 and earlier don't have the WPA options, only WEP. Whether those earlier systems can be made to work with a modern router, I can't say.
I had to go deeper. You need to go one layer further even to replace that battery.  This is the new hard drive and replacement CD drive. The latter came from an eBay seller, used but working (allegedly). The HD and my new battery are from the BookYard, I recommend them (as a UK buyer), their website makes it easy find the right parts for your particular machine. The service was efficient and swift.
I have read that delving more deeply into the machine is an advanced task and I was expecting it to be harder than it really was. It's actually very simple. I'm not going to give instructions, there are plenty elsewhere. That's the old CD/HD assembly coming out, note the horrible amount of dust in there. With those out, it's possible to vacuum out the fan and all of the nooks and crannies, taking anti-static precautions.

One point of uncertainty for me was about the jumper settings for the new HD and CD drives. The replacement CD came with a single ribbon cable with a 'tap' halfway for the HD. The existing ones each had their own cables (there are two sockets on my board). I suspect that the CD drive came from a different model (there are several variations and three with 17" screens - 800Ghz, 1Ghz and 1.25Ghz.) Besides the difference with the cable, the little metal screen at the front was slightly different and I had to nab the that part from the old old drive.

Back to the jumper settings. The replacement CD drive came jumpered as a slave, which may be consistent with it being on the same ribbon cable as its HD. With my existing drives, each had their own cable and each was jumpered for 'cable select'. I decided to use my existing cables and jumper both drives for cable select as before. That worked.

NB - if you're going further than the trapdoor, there's some thermal paste that you must clean up and apply new. See the iFixit article for details.

 There's the replacement battery in and with everything back together, it chimed and as expected there's the icon for 'no startup volume'.
This is where the games really started. The new CD drive worked and I expected it to be easy to make a fresh OS install on the new blank drive, but I had a very frustrating time.

I have numerous discs from various computers. However, not all were readable and some were upgrade discs rather than install. Those that I downloaded from the web and burned to CD were either not bootable or simply would not boot. (If you're reading this and know why, please leave the answer in the comments.)

I still had the drive that I'd taken out, and 'restoring' or cloning an existing volume to a new one should be very easy using Disk Utility, which comes on every system and every bootable install disc.

At first I opened up the computer again and plugged in the old drive in the CD drive's socket on the motherboard. This worked, I could boot from it and perform the 'restore'.

However, it didn't work first time - the clone on the new drive wouldn't boot. I had to try again and again. Because my first method was a little makeshift, I devised this - it's a firewire  'caddy' made from an old LaCie firewire hard drive. I opened it up, unplugged its own HD and plugged in my computer's HD and voila! It plugs into the reassembled computer and can boot the computer (you can boot from a firewire external drive but not a USB one).
The old drive had three different versions of OSX installed on different partitions. Back in the day this allowed me to boot into  OSX 10.2, 10.3 or 10.4. Very useful for testing my software.

My first attempts at cloning the 10.4 partition failed a couple of times. The first time I thought it was because I hadn't checked 'erase' before doing the copy. But I don't know why the second attempt wasn't successful.

I'd read mixed opinions about the maximum size for the boot volume with this machine. I thought this might be the problem for a little while but I can tell you that I have a 500GB drive with a single partition / volume, and that it now boots.

For my final attempt I tried booting into 10.3 (Panther) and using its version of Disk Utility. (This I think being the system that came with the machine). I don't know whether this made the difference, or whether it was because I reformatted the new drive at that point. (It had come apparently formatted, so I hadn't bothered before). That new clone of the 10.3 volume did successfully boot. After that I successfully used an upgrade disc to take it to 10.4, and then a multipart update downloaded from the Apple developer site to take it to 10.4.11

Now I have a mac with a system that really doesn't feel out of date, a version of iTunes (7) which is a joy to use, a 500GB drive, wireless (10.4 allows me to connect with WPA2). A useful and usable machine, and one which looks amazing.




* Why not a SSD? I had intended to do this for some time. But a ATA/IDE/PATA  SSD isn't so easy to find, and I read that the limitations of the ATA system mean that the speed benefit is lost. A brand new spinning disc is cheaper and still quiet. So that's what I went for.
** I refer to the drive as a CD drive in this post. Actually it's a combined CD reader / writer and DVD reader, known as a 'superdrive'.

Tuesday, 31 December 2019

Finding mixed / insecure website content using Scrutiny

It's been a while since some browsers have been warning that a page is insecure. I read recently that Google Chrome will start blocking HTTP resources in HTTPS pages.

If you've not yet migrated your website to https:// then you're probably thinking about doing it now.

Once the certificate is installed (which I won't go into) then you must weed out links to your http:// pages and pages that have 'mixed' or 'insecure' content, ie references to images, css, js and other files which are http://.

Scrutiny makes it easy to find these.

If you're not a Mac user or you'd simply like me to do it for you, I'm able to supply a mixed content report for a modest one-off price. It will list

  • pages with links to internal http: pages
  • pages which use resources (images, style sheets, etc) which are http
  • https:// pages which have a canonical which is http://
  • https:// urls which redirect to a http:// url


If you're interested in using Scrutiny to do this yourself, read on.

1. Find links to http pages and pages with insecure content.

First you have to give Scrutiny your https:// address as your starting url, and make sure that these two boxes are ticked in your site-specific settings,

and these two as well,

After running a scan, Scrutiny will offer to show you these issues,

You'll have to fix-and-rescan until there's nothing reported. (When you make certain fixes, that may reveal new pages to Scrutiny for testing).

2. Fix broken links and images

Once those are fixed, there may be some broken links and broken images to fix too (I was copying stuff onto a new server and trying to only copy what was needed. There are inevitably things that you miss...) Scrutiny will report these and make them easy to find.

3. Submit to Google.

Scrutiny can also generate the xml sitemap for you, listing your new pages (and images and pdf files too if you want).

Apparently Google treats the https:// version of your site as a separate 'property' in its Search Console (was Google Webmaster Tools). So you'll have to add the https:// site as a new property and upload the new sitemap.

4. Redirect

As part of the migration process, Google recommends that you then "Redirect your users and search engines to the HTTPS page or resource with server-side 301 HTTP redirects"  (full article here)





Wednesday, 11 December 2019

Using Webscraper to extract data from paginated search result pages, another real case study


First we need to give WebScraper our starting URL. This may be as simple as performing the search in a browser and grabbing the url of the first page of results. In this case things are even easier, because the landing page on the website has the first pageful of results. So that's our starting URL.

After that, we want WebScraper to 'crawl' the pagination (ie next/previous, or page 1, 2 3 etc) and ignore any of the other links to other parts of the site. In this case, the pages have a nice clean querystring which only includes the page number:
So as far as the crawl goes, we can tell WebScraper to ignore any url which doesn't contain "?page=". We're in the Advanced section of the Scan tab for these next few steps.
In some cases, the pages we want the data from are a further click from the search results pages. Let's call those 'detail pages'. It's possible to tell WS to stick to the scan rule(s) we've just set up, but to also visit and only scrape data from detail pages. This is done by setting up rules on the 'Output filter' tab. We don't need to do that here but this article goes into more detail about that.

In our case we're still going to use the output filter to make sure that we only scrape data from pages that we're interested in. We're starting within the 'directory' /names. So the first part, "url does contain /names" shouldn't be necessary*. There's a silly quirk here which is that the first page won't stick at ?page=1, it redirects back to /names. Therefore we can't use /names?page= here if we're going to capture the data from that first page. I notice that there are detail pages in the form /names/somename  so in order to filter those from the results, I've added a second rule here, "URL does not contain names/"
If we're lucky, the site will be responsive and won't clam up on us after a limited number of responses, and our activities will go unnoticed. However, it's always a good policy to put in a little throttling (the lower the number, the longer the scan will take). 
Don't worry about the number of threads too much here, the above setting takes that into account and it's beneficial to use multiple threads, even when throttling right down).

Finally, we set up our columns. If you don't see the Output file columns tab, click Complex setup (below the Go button). By default, you'll see two columns set up already, for page URL and page title. Page URL is often useful later, page title not so much. You can delete one or both of those if you don't want them in the output file.

In this example, we want to extract the price, which appears (if we 'view source') as
data-business-name-price="2795". This is very easy to turn into a regex, we just replace the part we want to collect with (.+?)
I had a little confusion over whether this appeared on the page with single-quotes or double-quotes, and so to cover the possibility that the site isn't consistent with these, I also replaced the quotes in the expression with ["'] (square brackets containing a double and single quote).

(Setting up the other columns for this job was virtually identical to the above step).

If necessary, WebScraper has a Helper tool to help us create and test that Regex. (View>Helper for classes/Regex.)

The Test button on the Output columns tab will run a limited scan, which should take a few seconds and put a few rows in the Results tab. We can confirm that the data we want is being collected.

Note that, by default, Webscraper outputs to csv and puts the data from one page on one row. That's fine if we want one piece of data from each page (or extract a number of different pieces into different columns) but here we have multiple results on each page.

Choosing json as an output format gets around this because the format allows for arrays within arrays. If we need csv as the output then WS will initially put the data in a single cell, separated with a separator character (pipe by default) and then has an option on the Post-process tab to split those rows and put each item of data on a separate row. This is done when exporting, so won't be previewed on the Results tab.

That's the setup. Go runs the scan, and Export exports the data.
*remembering to answer 'Directory' when asked whether the starting url is a page or directory



Sunday, 8 December 2019

Manual for Webscraper for Mac

Webscraper is becoming one of our more popular apps and it's growing in terms of its features and options too.
When I use it to do a job for myself or for someone else, I am having trouble myself remembering the rules for certain features.

So I've written a manual in Plain English which should help everyone whether they are a first-time user or just need to check a detail.

https://peacockmedia.software/mac/webscraper/manual.html

There are links on the app's home page and support page. I'll put a link under the app's Help menu too.


Monday, 25 November 2019

Plain Text Stickies

How often do you want to retain the styles when you copy text from one place to another? Almost never? Me neither. It frustrates me that the simple copy and paste, which has been with us for a very long time, includes styles by default*.

But that's not why we're here, This new app isn't some kind of clipboard manager or system extension that gives you system-wide plain text pasting.

I keep a lot of notes. My memory was never good and it's worse now that I'm old. Stickies is still with us and has seemingly always been in Mac OS (actually since system 7 - that's 1991's OS 7, not OSX 10.7) and I have always loved it. It's almost perfect. You can have as many as you like showing, choose different colours for different topics and they just show up as they were when you quit and restart the system. There's no need to get bogged down in saving them as files, just keep them open as long as you need them and then close them when done.

The newer Notes syncs your notes across Apple devices but doesn't allow you to scatter various-coloured notes across your desktop.

This is *NOT* how I like
my notes to end up looking
In both of these cases, and with the myriad third-party 'sticky note' apps**, rich text is the default and there is no option for a note to be plain text. Some even let you choose the type of note (list, rtf, image, etc) but without 'plain text' as an option.

It doesn't have to mean that your notes are all in a boring default monospace or sans-serif font. You can choose a font, font size and font colour for each note, while the text itself remains plain-text.

The closest match I've found for my strange plain-text craving is good old TextEdit, which allows you to set a document to be plain text (and even set the default for new documents to be plain text). For some time I have kept a number of these plain-text TextEdit documents open, resized and positioned where I like. If only it would allow you to set the background colour differently for each window (it does) and then remember that background-colour setting when you quit and re-start TextEdit (it doesn't)  then that would almost do.

Am I alone in this strange obsession?

Time will tell.

The new app is available as a free download. See here.

Update 12 Dec 2019: A newer version which adds some features including optional iCloud syncing (keeps notes in sync across Macs) is in beta and also available for free download.



*You have to take further actions to get rid of that styling, such as  using 'paste as plain text'. To make matters worse, a 'paste as plain text' isn't a standard OS feature and may have a different keyboard shortcut depending on the app you're pasting into and there may not even be a shortcut for it or a 'paste as plain text' at all.

** Forgive me if you have written a third-party app which does do sticky notes with plain text by default. I didn't find it, or if I did, there were other reasons why I didn't continue using it.

Monday, 4 November 2019

Heads-up: new project - Mac Deals

This is a short 'heads-up' post about a new side project I've started called Mac Deals.

(Or possibly Mac Software Deals or Mac App Deals, that's not fixed yet).

I'll be maintaining a database of Mac apps (within certain criteria) which are currently on offer.

This will include PeacockMedia apps from time to time, but will mainly feature those from other developers.

I've opened a mailing list especially for this and will mail out this list regularly.

It currently resides here: https://peacockmedia.software/mac-deals/deals.py