Tuesday, 4 February 2020

How to extract a table from html and save to csv (web to spreadsheet)

WebScraper users have sometimes asked about extracting data contained in tables on multiple pages.

Tables on multiple pages

That's fine if the table is for layout, or if there's just one bit of info that you want to grab from each, identifiable using a class or id.

But to take the whole table raises some questions - how do you map the web table to your output file? It may work if you can identify a similar table on all pages (matching columns) so that each one can be appended and match up, and if the first row is always headings (or marked up as th) and can be ignored, except for maybe the first one.

It's a scenario with a lot of ifs and buts, which means that it may be one of those problems that's best dealt on a case-by-case basis rather than trying to make a configurable app handle it. (if you do have this requirement, please do get in touch.)

Table from a single page

But this week someone asked about extracting a table from a single web page. It's pretty simple to copy the source from the web page, paste it into an online tool, or copy the table from the web page and paste into a spreadsheet app like Numbers or Excel and that was my answer.

But this set me thinking about the job of parsing html and extracting the table data ready for saving in whatever format.

At the core of this is a cocoa class for parsing the html and extracting the table (or tables if there are more than one on the page). With a view to possibly building this into WebScraper to allow it to do the 'tables on multiple pages' task, or for having this ready, should the need arise to use this in a custom app for a one-off job, I've now written that parser and built a small free app around it.

That app is the imaginatively-titled HTMLTabletoCSV which is now available here.