PeacockMedia: scraping email addresses with WebScraper

Monday, 16 March 2020

scraping email addresses with WebScraper

This has been a popular question from Webscraper users.

It has already been possible to use WebScraper to scrape email addresses with some caveats. a/ It required setting up, using your favourite regular expression for matching email addresses (these are easy to find online if you're not a regex demon) and b/ it'll only work if the email addresses appear unobfuscated in the source or visible text. It has long been popular to use some method of hiding email addresses from bots.

To help with this particular task, you can now set up WebScraper more easily (as from version 4.11.0). 'Email Addresses' now appears in the drop-down buttons for the simple and complex setups.

Here's the simple setup. It couldn't be easier; 'Email Addresses' appears in the second drop-down if you choose 'Content' in the first. Skip to the Post-processing tab >>

Output file columns

For the complex setup, by default you get the URL and Title columns by default. You may like to keep those so that you can see which page each email address appears on. Or delete them if you simply want a list of email addresses. Then add a column. As with the simple setup, choose Content and then Email Addresses.

Results tab

At this point (After running the scan or a test) each page is presented on a row of this table. If there are multiple email addresses on a page, they'll be listed in a single cell separated by a pipe. We'll fix that later. There may also be big gaps where pages don't contain an email address. That's also something we can fix.

During the scan, at the point where email addresses are scraped from a page, the results are uniqued. So if the same address appears multiple times on the same page (which is likely if the address appears as a link - it may be in the link and in the visible text) then it'll only appear once in that row on the Results tab.

Post-processing tab

Here's where we can tidy up the output. The first checkbox will split multiple results onto separate rows. The third checkbox will skip rows where there are no results. The drop-down button will contain a list of your output columns, choose your email address column. As the label says, these things will be done when you export your output to csv.

Preferences

The default expression used for this task will match strings like xxxx@xxxx.xxx I found that this can match certain images that have an @ symbol in their filename. If you wish to improve the regular expression, then simply change it in this field in Preferences.

Export

Here's my output for this example (I chose not to include the url and title columns). Note that the same address appears a lot. At time of writing WebScraper doesn't have a 'unique' option on the post-processing tab but that's under consideration. Also note that caveat b at the top of this article still applies.