Scraping Quotes from WikiQuote

by Aug 24, 2017

Here is a fun script that may not work forever. It takes one or more topics you choose, then returns one or more random quotes taken from the wikiquote webpages:

PS> Get-Quote 

If you don't know anything about computers, just remember that they are machines that do exactly w...

PS> Get-Quote -Topics men

Text                                                                     Author                      
----                                                                     ------                      
But man is not made for defeat. A man can be destroyed but not defeated.  Ernest Hemingway (1899–1...

PS> Get-Quote -Topics jewelry
WARNING: Topic 'jewelry'  not found. Try a different one!

PS> Get-Quote -Topics jewel

 Cynicism isn't smarter, it's only safer. There's nothing fluffy about optimism . … People have th...

The script below does so by first loading the HTML content, and then using a regular expression to scrape the quotes contained inside the HTML. This of course works only if there is a pattern that the script can use. At the time of writing, all quotes on wikiquotes used this HTML scheme:


So below code searches for this pattern, then polishes the text found inside the structure(s): HTML tags like links need to be removed, and multiple spaces need to be turned in one space (which is handled by the nested function Remove-Tag).

Here is the code:

function Get-Quote ($Topics='Computer', $Count=1)
    function Remove-Tag ($Text)
        $tagCount = 0
        $text = -join $Text.ToCharArray().Foreach{
                '<'  { $tagCount++}
                '>'  { $tagCount-- ' '}
                default { if ($tagCount -eq 0) {$_} }
        $text -replace '\s{2,}', ' '

    $pattern = "(?im)<li>(.*?)<ul><li>(.*?)</li></ul></li>"
    Foreach ($topic in $topics)
        $url = "$Topic"
            $content = Invoke-WebRequest -Uri $url -UseBasicParsing -ErrorAction Stop
        catch [System.Net.WebException]
            Write-Warning "Topic '$Topic' not found. Try a different one!"

        $html = $content.Content.Replace("`n",'').Replace("`r",'')
        [Regex]::Matches($html, $pattern) |
        ForEach-Object {
                Text = Remove-Tag $_.Groups[1].Value
                Author = Remove-Tag $_.Groups[2].Value
                Topic = $Topic
        } | Get-Random -Count $Count

Get-Quote -Topic Car
Get-Quote -Topic Jewel
Get-Quote -Topic PowerShell

