HTML-Scraping with RegEx

by ps1Oct 20, 2011

To scrape valuable information from websites with PowerShell you can download the HTML code and then use regular expressions to extract what you are after. That's not hard. Here is a sample:

$webclient = New-Object System.Net.WebClient
$html = $webclient.DownloadString('http://www.cnn.com') | Out-String

$headerpattern = '(?i)<h1>(.*?)</h1>'

$header = ([regex]$headerpattern).Matches($html) |
  ForEach-Object { $_.Groups[1].Value }

$header

It downloads the HTML content from www.cnn.com and then extracts all <h1>…</h1> headers. That way, you get a quick headline overview.

ReTweet this Tip!

Free Trial

SQL Diagnostic Manager

SQL Compliance Manager

SQL Secure

SQL Safe Backup

SQL Inventory Manager

SQL Admin Toolset

Cross-Platform Product

Aqua Data Studio

ER/Studio

Free Tools

Free Trial

Resources

Support

Events

Contact Sales

Customers

Free Trial

Enterprises

Database

Cloud Services

Applications

HTML-Scraping with RegEx

Categories