Showing posts with label web page. Show all posts
Showing posts with label web page. Show all posts

Tuesday, 4 February 2020

How to extract a table from html and save to csv (web to spreadsheet)

WebScraper users have sometimes asked about extracting data contained in tables on multiple pages.

Tables on multiple pages


That's fine if the table is for layout, or if there's just one bit of info that you want to grab from each, identifiable using a class or id.

But to take the whole table raises some questions - how do you map the web table to your output file? It may work if you can identify a similar table on all pages (matching columns) so that each one can be appended and match up, and if the first row is always headings (or marked up as th) and can be ignored, except for maybe the first one.

It's a scenario with a lot of ifs and buts, which means that it may be one of those problems that's best dealt on a case-by-case basis rather than trying to make a configurable app handle it. (if you do have this requirement, please do get in touch.)

Table from a single page


But this week someone asked about extracting a table from a single web page. It's pretty simple to copy the source from the web page, paste it into an online tool, or copy the table from the web page and paste into a spreadsheet app like Numbers or Excel and that was my answer.

But this set me thinking about the job of parsing html and extracting the table data ready for saving in whatever format.

At the core of this is a cocoa class for parsing the html and extracting the table (or tables if there are more than one on the page). With a view to possibly building this into WebScraper to allow it to do the 'tables on multiple pages' task, or for having this ready, should the need arise to use this in a custom app for a one-off job, I've now written that parser and built a small free app around it.

That app is the imaginatively-titled HTMLTabletoCSV which is now available here.


Monday, 4 June 2018

Test HTML validation and accessibility checkpoints of whole website

Didn't Scrutiny used to do this?

Yes, but when the w3c validator's 'nu' engine came online, it broke Scrutiny's ability to test every page. The 'nu' engine no longer returned the number of errors and warnings in the response header, which Scrutiny had used as a fast way to get basic stats for each page. It also stopped responding after a limited number of requests (some Scrutiny users have large websites).

Alternative solutions

After exploring some other options (notably html tidy, which is installed on every mac) it appears that the W3C service now offers a web service which is responding well and we haven't seen it clam up after a large number of fast requests (even when using a large number of threads).

The work in progress is called Tidiness (obviously a reference to tidy, which we've been experimenting with).

It contains a newer version of tidy than the one installed on your Mac. However, the html validation results are useful but not as definitive as the ones from the W3C service.

So Tidiness as it stands is a bit of a hybrid. It crawls your website, passing each page to the W3C service (as a web service). If you like you can switch to tidy for the validation, which makes things much quicker as everything is then running locally. If you like, you can simultaneously make accessibility checks at level 1,2 or 3, with all of the results presented together.

Here are some shots.



[update 3 Jan 2021] Due to lack of interest, this project is mothballed. We have since built more html validation tests into Scrutiny / Integrity Pro and that functionality will be expanded through 2021. If you are interested in Tidiness - please tell us.

Friday, 27 April 2018

HTMLtoMD and Webscraper - a comparison converting & archiving a site as MarkDown

I noticed a few people visiting the website as a result of searching for "webpage to markdown" and similar.

Our app HTMLtoMD was designed to do exactly that, and is free. But it was experimental and hasn't received very much development in a long time.

It still does its job, but our app Webscraper can also do that job, with many more options, and is a much more active product, it's selling and is under regular development now.

I thought it was worth writing a quick article to compare the output and features of the two apps.

First HTMLtoMD. It's designed for one job and so is simple to use. Just type or paste the address of your page or homepage into the address bar and press Go. If you choose the single page option, you'll immediately see the page as markdown, and can copy that markdown to paste somewhere else.
If you don't choose the single page option, the app will scan your site, and then offer to save the pages as markdown.

Here's the big difference with WebScraper. HTMLtoMD will save each page separately as a separate markdown file:

So here's how to do the same thing with Webscraper. It has many more options (which I won't go into here.) The advanced ones tend not to be visible until you need them. So here's WebScraper when it opens.
For this task, simply leave everything at defaults, enter your website name and make sure you choose "collect content as markdown" from the 'simple setup' as shown above. Then press Go.

For this first example I left the output file format set to csv.  When the scan has run, the Results table shows how the csv file will look.  In this example we have three field names;  url, title, content - note that the 'complex setup' allows you to choose more fields, for example you might like to collect the meta description too.
You may have noticed in one of the screenshots above that the options for output include csv, json and txt. Unlike HTMLtoMD, Webscraper always saves everything as a single file. The csv output is shown above. The text file is also worth looking at here. Although it is a single file, it is structured.   I've highlighted the field headings in the screenshot below. As I mentioned above, you can set things up so that more information (eg meta description) is captured.


[Update]

It occurred to me after writing the article above, that collecting images is a closely-related task. WebScraper can download and save all images while performing its scan. Here are the options, you'll need to switch to the 'complex setup' to access these options:


If you're unsure about any of this, or if you can't immediately see how to do what you need to do, please do get in touch. It's really useful to know what jobs people are using our software for (confidential of course). Even if it's not possible, we may have suggestions. We have other apps that can do similar but different things, and have even written custom apps for specific jobs.

Tuesday, 3 January 2017

Crawling a website that requires authentication

This is a big subject and gets bigger and more complicated as website become increasingly clever at preventing non-human visitors from being able to log in.

My post How to use Scrutiny to test a website which requires authentication has been updated a number of times in its history and I've just updated it again to include a relatively recent Scrutiny feature. It's a simple trick involving a browser window within Scrutiny which allows you to log into your site. If there's a tracking cookie, that's then retained for Scrutiny's scan.

It used to be possible to simply log in using Safari - Safari's cookies seem to have been systemwide, but after Yosemite, a browser's cookies seem to be specific to that browser.

The reason for this all being on my mind today is that I've just worked the same technique into WebScraper. I wanted to compile a list of some website addresses from pages within a social networking site which is only visible to authenticated users.



Webscraper doesn't have the full authentication features of Scrutiny but I think this method will work with the majority of websites which require authentication.

(This feature, and others, are in Webscraper 1.3 which will be available very shortly)
SaveSave

Friday, 6 December 2013

Creating a FAQs web page using Clipassist

Clipassist version 3.4 has the ability to export a selected folder or all clips as a web page with expandable answers. A real-life example of this being used is here: http://peacockmedia.software/integrity_support.html

Here's how.

1. Organise yourself a folder containing the clips that you want to include. Give each a short name. Note that at present they're sorted alphabetically by the short name, so you can use a, b, c to order them. A future version will allow you to drag to re-order.

2. A new feature of Clipassist 3.4 is the 'Full question' field. This is what appears as the question on the web page. Access it with the 'reveal' button or View > Show Full Question

3. The text in the main pane will appear as the answer to the question.

4. Fill in meta data keywords to help you search for this item later on.

5. Choose File > Export Folder to HTML or File > Export All to HTML

6. The exported file will open and function as a web page in its own right but it is designed for you to be able to copy and and paste the questions and answers into your existing page. The Qs and As are tagged <h2> and <p> so they should pick up your existing styles, and they have a class ("faq") so that you can customise their style further.


7. Note that if you copy and paste the Qs and As, you'll need to include the jquery stuff and styles in the head of the exported page to your site.