If you ever need to read in a file and split file content into words, there are a couple of gotchas to keep in mind. First off, remember that Get-Content reads files line by line. To apply regular expressions or split operations on the entire text, you should first convert all lines to one text using Out-String:
Out-String has one major disadvantage as it uses a fixed maximum line width so words may be truncated. A better approach is Join, found in the .NET String class:
Once you have the complete text stored in $text, you can then split it into words. Often, people use simple text split operations like this:
This would use a space, a tab or an equal character to identify word boundaries and remove empty entries. However, this approach is not very dependable because there are a lot more non-word characters to handle.
You should try a better approach of using regular expressions for splitting like this example:
Where-Object { $_ -like 'a*' } |
Group-Object |
Sort-Object {$_.Name.Length} -descending
Here, any white space character, comma or dot is used to separate words. Still, this approach is not perfect. Therefore, a much better approach leaves it to regular expressions to identify word boundaries. Use Matches() instead of Split() to match explicit instances of words (w+) separated by word boundaries (b):
ForEach-Object { $_.Value } |
Group-Object |
Sort-Object Count -descending |
Select-Object -first 10