RubyGuides
Share this post!

Parsing HTML in Ruby

If you have ever tried to write a scrapping tool you probably had to deal with parsing HTML. This task can be a bit difficult if you don’t have the right tools. Ruby has this wonderful library called Nokogiri, which makes HTML parsing a walk in the park.

Let’s see some examples.

First install the nokogiri gem with:  gem install nokogiri

Extracting the title

Then create the following script, which contains a basic HTML snippet that will be parsed by Nokogiri. The output will be the page title.

Extracting anchor links

So that was pretty easy, wasn’t it?

Well, it’s doesn’t get much harder than that. For example, if we want all the links from a page we need to use the xpath method on the object we get back from Nokogiri, then we can print the individual attributes of the tag or the text inside the tags:

And that’s it, as you may have already guessed the xpath method uses the Xpath query language, for more info on xpath check out this link.

You can also use CSS selectors, which I find a lot easier to work with. You just need to replace the xpath method with the css method.

Example:

Note: The difference between at_css & css is that the first one only returns the first matched element, but the latter returns ALL matched elements.

To find the correct CSS selector can use your browser’s developer tools.

Summary

In this post you learned about Nokogiri, a tool used to parse (make sense of) HTML source code. You also learned how to use to extract data from the HTML, like the page’s title.

For more on Nokogiri read the documentation here: http://www.rubydoc.info/github/sparklemotion/nokogiri