Scraping Quotes from WikiQuote

by Aug 24, 2017

Here is a fun script that may not work forever. It takes one or more topics you choose, then returns one or more random quotes taken from the wikiquote webpages:

 
PS> Get-Quote 

Text                                                                                                 
----                                                                                                 
If you don't know anything about computers, just remember that they are machines that do exactly w...



PS> Get-Quote -Topics men

Text                                                                     Author                      
----                                                                     ------                      
But man is not made for defeat. A man can be destroyed but not defeated.  Ernest Hemingway (1899–1...



PS> Get-Quote -Topics jewelry
WARNING: Topic 'jewelry'  not found. Try a different one!

PS> Get-Quote -Topics jewel

Text                                                                                                 
----                                                                                                 
 Cynicism isn't smarter, it's only safer. There's nothing fluffy about optimism . … People have th...
 

The script below does so by first loading the HTML content, and then using a regular expression to scrape the quotes contained inside the HTML. This of course works only if there is a pattern that the script can use. At the time of writing, all quotes on wikiquotes used this HTML scheme:

<li><ul>Quote<ul><li>Author</li></ul></li>

So below code searches for this pattern, then polishes the text found inside the structure(s): HTML tags like links need to be removed, and multiple spaces need to be turned in one space (which is handled by the nested function Remove-Tag).

Here is the code:

function Get-Quote ($Topics='Computer', $Count=1)
{
    function Remove-Tag ($Text)
    {
        $tagCount = 0
        $text = -join $Text.ToCharArray().Foreach{
            switch($_)
            {
                '<'  { $tagCount++}
                '>'  { $tagCount-- ' '}
                default { if ($tagCount -eq 0) {$_} }
            }
        
        } 
        $text -replace '\s{2,}', ' '
    }

    $pattern = "(?im)<li>(.*?)<ul><li>(.*?)</li></ul></li>"
    
    Foreach ($topic in $topics)
    {
        $url = "https://en.wikiquote.org/wiki/$Topic"
    
        try
        {
            $content = Invoke-WebRequest -Uri $url -UseBasicParsing -ErrorAction Stop
        }
        catch [System.Net.WebException]
        {
            Write-Warning "Topic '$Topic' not found. Try a different one!"
            return
        }

        $html = $content.Content.Replace("`n",'').Replace("`r",'')
        [Regex]::Matches($html, $pattern) |
        ForEach-Object {
            [PSCustomObject]@{
                Text = Remove-Tag $_.Groups[1].Value
                Author = Remove-Tag $_.Groups[2].Value
                Topic = $Topic
            }
        } | Get-Random -Count $Count
    }
}



Get-Quote
Get-Quote -Topic Car
Get-Quote -Topic Jewel
Get-Quote -Topic PowerShell

Twitter This Tip! ReTweet this Tip!