It has already been possible to use WebScraper to scrape email addresses with some caveats. a/ It required setting up, using your favourite regular expression for matching email addresses (these are easy to find online if you're not a regex demon) and b/ it'll only work if the email addresses appear unobfuscated in the source or visible text. It has long been popular to use some method of hiding email addresses from bots.
To help with this particular task, you can now set up WebScraper more easily (as from version 4.11.0). 'Email Addresses' now appears in the drop-down buttons for the simple and complex setups.
Here's the simple setup. It couldn't be easier; 'Email Addresses' appears in the second drop-down if you choose 'Content' in the first. Skip to the Post-processing tab >>
Output file columns
For the complex setup, by default you get the URL and Title columns by default. You may like to keep those so that you can see which page each email address appears on. Or delete them if you simply want a list of email addresses. Then add a column. As with the simple setup, choose Content and then Email Addresses.Results tab
At this point (After running the scan or a test) each page is presented on a row of this table. If there are multiple email addresses on a page, they'll be listed in a single cell separated by a pipe. We'll fix that later. There may also be big gaps where pages don't contain an email address. That's also something we can fix.During the scan, at the point where email addresses are scraped from a page, the results are uniqued. So if the same address appears multiple times on the same page (which is likely if the address appears as a link - it may be in the link and in the visible text) then it'll only appear once in that row on the Results tab.
No comments:
Post a Comment