If you have ever tried to write a scrapping tool you probably had to deal with parsing HTML. This task can be a bit difficult if you don’t have the right tools. Ruby has this wonderful library called Nokogiri, which makes HTML parsing a walk in the park.
Let’s see some examples.
First install the nokogiri gem with: gem install nokogiri
Then create the following script, which contains a basic HTML snippet that will be parsed by Nokogiri. The output will be the page title.
require 'nokogiri' html = "
testactual content here..." parsed_data = Nokogiri::HTML.parse(html) puts parsed_data.title => "test"
So that was pretty easy, wasn’t it?
Well, it’s doesn’t get much harder than that. For example, if we want all the links from a page we need to use the
xpath method on the object we get back from Nokogiri, then we can print the individual attributes of the tag or the text inside the tags:
parsed_data = Nokogiri::HTML.parse html anchor_tags = parsed_data.xpath("//a[@href]") puts anchor_tags.first[:href] + " " + anchor_tags.first.text
And that’s it, as you may have already guessed the xpath method uses the Xpath query language, for more info on xpath check out this link.
You can also use CSS selectors, which I find a lot easier to work with. You just need to replace the xpath method with the css method.
parsed_data = Nokogiri::HTML.parse(html) anchor_tag = parsed_data.at_css("a") puts anchor_tag.text
Note: The difference between
cssis that the first one only returns the first matched element, but the latter returns ALL matched elements.
To find the correct CSS selector can use your browser’s developer tools.
In this post you learned about Nokogiri, a tool used to parse (make sense of) HTML source code. You also learned how to use to extract data from the HTML, like the page’s title.
For more on Nokogiri read the documentation here: http://www.rubydoc.info/github/sparklemotion/nokogiri