Automating “Live” Websites

by Oct 8, 2018

Occasionally, there is the need to automate tasks on websites that have been opened manually. Maybe you need to log into internal web pages first using some web forms. Provided the website is hosted in Internet Explorer (not Edge or any 3rd-party browser), you can use a COM interface to access the live browser content.

This can even be valuable for plain “HTML-scraping” when you visit dynamic web pages. A pure WebClient (or the cmdlet Invoke-WebRequest) would always return only the static HTML which is not what users see in their browsers. When you use a real browser to show website content, your scripts can access the full HTML that drives the display.

To test-drive this, open Internet Explorer or Edge, and navigate to a website of your choice. In our example, we navigate to www.powershellmagazine.com.

$obj = New-Object -ComObject Shell.Application
$browser = $obj.Windows() | 
    Where-Object FullName -like '*iexplore.exe' |
    # adjust the below to match your URL
    Where-Object LocationUrl -like '*powershellmagazine.com*' |
    # take the first browser that matches in case there are
    # more than one
    Select-Object -First 1

In $browser, you now have access to the object model of the live browser. If $browser is empty, make sure you adjusted the filter for LocationUrl in the code so it matches your URL. Do not forget the asterisks at both ends.

If you wanted to scrape all images off the website, this is how you would get the list of images:

 
$browser.Document.images | Out-GridView 

Likewise, if you wanted to scrape information off the website content, this line returns the page HTML:

 
PS> $browser.Document.building.innerHTML

You could now use regular expressions to scrape content. There is one limitation though: if you need to perform additional actions in the context of the logged-in web visitor, you are out of luck. For example, if you wanted to download files that require a web login to access, you would have to invoke the download process via the Internet Explorer object model.

You would not be able to use Invoke-WebRequest or another simple web client to download the file because PowerShell runs in its own context, and to the website, appears as an anonymous visitor.

Using the Internet Explorer object model to perform more advanced actions such as downloading files or videos isn’t entirely impossible. It is just very complex because essentially, you would need to send clicks and key strokes to the user interface.

Twitter This Tip! ReTweet this Tip!