Scraping Information from Web Pages

by Oct 6, 2010

Regular expressions are a great way of identifying and retrieving text patterns. Take a look at the next code fragment as it defines a RegEx engine that searches for HTML divs with a "post-summary" attribute, then reads the PowerShell team blog and returns all summaries from all posts in clear text:

$regex = [RegEx]'<div class="post-summary">(.*?)</div>'

$url = 'http://blogs.msdn.com/b/powershell/'
$wc = New-Object System.Net.WebClient
$content = $wc.DownloadString($url)

$regex.Matches($content) | Foreach-Object { $_.Groups[1].Value }

Twitter This Tip!
ReTweet this Tip!