RubyGuides
Share this post!

N-gram Analysis for Fun and Profit

What would you do if you are given a big collection of text and you want to extract some meaning out of it? A good start is to break up your text into n-grams.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text.
– Wikipedia

For example:

If we take the phrase “Hello there, how are you?” then the unigrams (ngrams of one element) would be: "Hello", "there", "how", "are", "you", and the bigrams (ngrams of two elements): ["Hello", "there"], ["there", "how"], ["how", "are"], ["are", "you"].

If you learn better with images here is a picture of that.

ngram analysis

Downloading Sample Data

Before we can get our hands dirty we will need some sample data. If you don’t have any to work with you could download a few Wikipedia or blog articles. In this particular case, I decided to download some IRC logs from #ruby freenode’s channel.

The logs can be found here: irclog.whitequark.org/ruby

A note on data formats:

If a plain text version of the resource you want to analyze is not available, then you can use Nokogiri to parse the page and extract the data. In this case, the irc logs are available in plain text by appending .txt at the end of the url so we will take advantage of that.

This class will download and save the data for us:

require 'restclient'

class LogParser
  LOG_DIR  = 'irc_logs'

  def initialize(date)
    @date = date
    @log_name = "#{LOG_DIR}/irc-log-#{@date}.txt"
  end

  def download_page(url)
    return log_contents if File.exist? @log_name
    RestClient.get(url).body
  end

  def save_page(page)
    File.open(@log_name, "w+") { |f| f.puts page }
  end

  def log_contents
    File.readlines(@log_name).join
  end

  def get_messages
    page = download_page("http://irclog.whitequark.org/ruby/#{@date}.txt")
    save_page(page)
    page
  end
end

log = LogParser.new("2015-04-15")
msg = log.get_messages

This is a pretty straightforward class. We use RestClient as our HTTP client and then we save the results in a file so we don’t have to request them multiple times while we make modifications to our program.

Analyzing the data

Now that we have our data we can analyze it. Here is a simple Ngram class. In this class we use the Array#each_cons method which produces the ngrams.

Since this method returns an Enumerator we need to call to_a on it to get an Array.

class Ngram
  def initialize(input)
    @input = input
  end

  def ngrams(n)
    @input.split.each_cons(n).to_a
  end
end

Then we put everything together using a loop, Hash#merge! and Enumerable#sort_by:

# Filter words that appear less times than this
MIN_REPETITIONS = 20

total = {}

# Get the logs for the first 15 days of the month and return the bigrams
(1..15).each do |n|
  day = '%02d' % [n]
  total.merge!(get_trigrams_for_date "2015-04-#{day}") { |k, old, new| old + new }
end

# Sort in descending order
total = total.sort_by { |k, v| -v }.reject { |k, v| v < MIN_REPETITIONS }

total.each { |k, v| puts "#{v} => #{k}" }

Note: the get_trigrams_for_date method is not here for brevity, but you can find it on github.

This is what the output looks like:

112 => i want to
83  => link for more
82  => is there a
71  => you want to
66  => i don't know
66  => i have a
65  => i need to

As you can see wanting to do things is very popular in #ruby 🙂

Conclusion

Now it’s your turn, crack open your editor and start playing around with some n-gram analysis. Another way to see n-grams in action is the Google Ngram Viewer.

Natural language processing (NLP) can be a fascinating subject, wikipedia has a good overview of the topic.

You can find the complete code for this post here: https://github.com/matugm/ngram-analysis/blob/master/irc_histogram.rb