RubyGuides
Share this post!

N-gram Analysis for Fun and Profit

What would you do if you are given a big collection of text and you want to extract some meaning out of it? A good start is to break up your text into n-grams.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text.
– Wikipedia

For example:

If we take the phrase “Hello there, how are you?” then the unigrams (ngrams of one element) would be: "Hello", "there", "how", "are", "you", and the bigrams (ngrams of two elements): ["Hello", "there"], ["there", "how"], ["how", "are"], ["are", "you"].

If you learn better with images here is a picture of that.

ngram analysis

Downloading Sample Data

Before we can get our hands dirty we will need some sample data. If you don’t have any to work with you could download a few Wikipedia or blog articles. In this particular case, I decided to download some IRC logs from #ruby freenode’s channel.

The logs can be found here: irclog.whitequark.org/ruby

A note on data formats:

If a plain text version of the resource you want to analyze is not available, then you can use Nokogiri to parse the page and extract the data. In this case, the irc logs are available in plain text by appending .txt at the end of the url so we will take advantage of that.

This class will download and save the data for us:

This is a pretty straightforward class. We use RestClient as our HTTP client and then we save the results in a file so we don’t have to request them multiple times while we make modifications to our program.

Analyzing the data

Now that we have our data we can analyze it. Here is a simple Ngram class. In this class we use the Array#each_cons method which produces the ngrams.

Since this method returns an Enumerator we need to call to_a on it to get an Array.

Then we put everything together using a loop, Hash#merge! and Enumerable#sort_by:

Note: the get_trigrams_for_date method is not here for brevity, but you can find it on github.

This is what the output looks like:

As you can see wanting to do things is very popular in #ruby 🙂

Conclusion

Now it’s your turn, crack open your editor and start playing around with some n-gram analysis. Another way to see n-grams in action is the Google Ngram Viewer.

Natural language processing (NLP) can be a fascinating subject, wikipedia has a good overview of the topic.

You can find the complete code for this post here: https://github.com/matugm/ngram-analysis/blob/master/irc_histogram.rb