HTML-Scraping with RegEx

by Oct 20, 2011

To scrape valuable information from websites with PowerShell you can download the HTML code and then use regular expressions to extract what you are after. That's not hard. Here is a sample:

$webclient = New-Object System.Net.WebClient
$html = $webclient.DownloadString('http://www.cnn.com') | Out-String

$headerpattern = '(?i)<h1>(.*?)</h1>'

$header = ([regex]$headerpattern).Matches($html) |
  ForEach-Object { $_.Groups[1].Value }

$header

It downloads the HTML content from www.cnn.com and then extracts all <h1>…</h1> headers. That way, you get a quick headline overview.

 

Twitter This Tip!
ReTweet this Tip!