Splitting Text Into Words

by Apr 28, 2009

If you ever need to read in a file and split file content into words, there are a couple of gotchas to keep in mind. First off, remember that Get-Content reads files line by line. To apply regular expressions or split operations on the entire text, you should first convert all lines to one text using Out-String:

$text = Get-Content k:eula.1031.txt | Out-String

Out-String has one major disadvantage as it uses a fixed maximum line width so words may be truncated. A better approach is Join, found in the .NET String class:

$text = [String]::Join(' ', (Get-Content k:eula.1031.txt))

Once you have the complete text stored in $text, you can then split it into words. Often, people use simple text split operations like this:

$words = $a.Split(" `t=", [stringsplitoptions]::RemoveEmptyEntries)

This would use a space, a tab or an equal character to identify word boundaries and remove empty entries. However, this approach is not very dependable because there are a lot more non-word characters to handle.
You should try a better approach of using regular expressions for splitting like this example:

[regex]::Split($text, '[s,.]') |
Where-Object { $_ -like 'a*' } |
Group-Object |
Sort-Object {$_.Name.Length} -descending

Here, any white space character, comma or dot is used to separate words. Still, this approach is not perfect. Therefore, a much better approach leaves it to regular expressions to identify word boundaries. Use Matches() instead of Split() to match explicit instances of words (w+) separated by word boundaries (b):

[regex]::Matches($text, 'bw+b') |
ForEach-Object { $_.Value } |
Group-Object |
Sort-Object Count -descending |
Select-Object -first 10