Getting Started with Nokogiri and XML in Ruby

13
Jun
2012

Getting Started with Nokogiri and XML in Ruby

Here's a short post on getting started with Nokogiri - a Ruby gem that wraps libxml.

I'm writing this because well, the docs at Nokogiri kind of suck.

I wanted to read a simple XML document. My XPath fu was a little rusty, although all I wanted to do was read some attributes from a root element, some element values off of the root, and then a short collection of items (very similar to an Atom document).

My main bone of contention with the Nokogiri docs was their use of the @doc.xpath("//character") search operator at the very beginning of their parsing tutorial.

How about we start from the beginning:

Here is a sample XML document. Save this to your local disk, install the Nokogiri gem, and fire up IRB.

<Collection version="2.0" id="74j5hc4je3b9">
  <Name>A Funfair in Bangkok</Name>
  <PermaLink>Funfair in Bangkok</PermaLink>
  <PermaLinkIsName>True</PermaLinkIsName>
  <Description>A small funfair near On Nut in Bangkok.</Description>
  <Date>2009-08-03T00:00:00</Date>
  <IsHidden>False</IsHidden>
  <Items>
    <Item filename="AGC_1998.jpg">
      <Title>Funfair in Bangkok</Title>
      <Caption>A small funfair near On Nut in Bangkok.</Caption>
      <Authors>Anthony Bouch</Authors>
      <Copyright>Copyright © Anthony Bouch</Copyright>
      <CreatedDate>2009-08-07T19:22:08</CreatedDate>
      <Keywords>
        <Keyword>Funfair</Keyword>
        <Keyword>Bangkok</Keyword>
        <Keyword>Thailand</Keyword>
      </Keywords>
      <ThumbnailSize width="133" height="200" />
      <PreviewSize width="532" height="800" />
      <OriginalSize width="2279" height="3425" />
    </Item>
    <Item filename="AGC_1164.jpg" iscover="True">
      <Title>Bumper Cars at a Funfair in Bangkok</Title>
      <Caption>Bumper cars at a small funfair near On Nut in Bangkok.</Caption>
      <Authors>Anthony Bouch</Authors>
      <Copyright>Copyright © Anthony Bouch</Copyright>
      <CreatedDate>2009-08-03T22:08:24</CreatedDate>
      <Keywords>
        <Keyword>Bumper Cars</Keyword>
        <Keyword>Funfair</Keyword>
        <Keyword>Bangkok</Keyword>
        <Keyword>Thailand</Keyword>
      </Keywords>
      <ThumbnailSize width="200" height="133" />
      <PreviewSize width="800" height="532" />
      <OriginalSize width="3725" height="2479" />
    </Item>
  </Items>
</Collection>

From our IRB prompt - the first thing we'll do is require nokogiri.

>> require 'nokogiri'
=> true

Now let's load our XML document.

>> f = File.open("/path/to/the/collection.xml")
=> #<File:/path/to/the/collection.xml>
>> doc = Nokogiri::XML(f)
=> # You'll see the XML document output to the console.

The first thing we'd like to do is select the id attribute from the root. There's two ways you can do this.

>> doc.at_xpath("/*/@id")
=> #<Nokogiri::XML::Attr:0x3ff90e073644 name="id" value="74j5hc4je3b9">

Which will return the XML Attribute (which inherits from Node). You can use .value, .text, or .inner_text against the returned object to retrieve the actual value. Notice we've used the at_xpath method to select the element. xpath on its own will return a node array (with just one element in this case).

The second method to get a root attribute, is to select the root element first using.

>> root = doc.root
>> # again here you'll see the complete XML document output to the console.

Now we can access the id attribute using a convenient array notation - returning the value immediately, or the XPath statement for an attribute which again will return an XML::Attr object from which we can retrieve the value.

>> root["id"]
=> "74j5hc4je3b9"

or

>> root.at_xpath("@id")
=> #<Nokogiri::XML::Attr:0x3ff90e073644 name="id" value="74j5hc4je3b9">

Since we're already positioned at the root element of the document, selecting elements beneath the root is easy.

>> root.at_xpath("Name")
=> #<Nokogiri::XML::Element:0x3ff90e072dfc name="Name" children=[#<Nokogiri::XML::Text:0x3ff90e072bf4 "A Funfair in Bangkok">]>

You can use root.at_xpath("Name").text to retrieve the text value, but only if you're absolutely sure the element is present, otherwise you'll get an undefined method for nil:NilClass exception.

Now lets select the items in our document, returning a node array of items that we can iterate over.

>> items = root.xpath("Items/Item")
=> #You'll see the xml for our two items output to the console.
>> items.count
=> 2

We can select an attribute for an item using the convenient index style syntax, or a regular XPath select with the @ sign.

>> items[0]["filename"]
=> "AGC_1998.jpg"

And of course we can repeat and rinse with all of our element selectors, as well as move further down the structure of the document and select the keywords.

>> items[0].at_xpath("Title")
=> #<Nokogiri::XML::Element:0x3ff90e07e580 name="Title" children=[#<Nokogiri::XML::Text:0x3ff90e07e364 "Funfair in Bangkok">]>

And very lastly - although this is a very different use case, and for some reason the first one that the Nokogiri parsing tutorial decided to present, is the // XPath search operator which will search and return all elements at all levels for a matching element name.

>> doc.xpath("//Keywords")
=> #returning an array of Keyword elements across the entire document, including at the root, and item levels.

Last but not least - we'll close our file.

>> f.close

Of course the better way to do this in code is to use the File.open(path) do |f| end; block to ensure that the file is closed at the end of our Nokogiri session.

And there you have it. Hope this helps anyone else who is using Nokogiri for the first time and would like to get started with very basic XPath queries to select attributes and elements from a simple XML document.

Category: 
Tags: 

Comments

Thanks for this post! Was getting confused with Nokogiri's numerous selectors, but this wrapped it up pretty nicely!

Very useful article but wow are these alternating backgrounds distracting and annoying.

Ah - you mean the pictures on the site background. Yes you can pause those. And well, they're my pictures and I quite like the effect.

@JJ you do realize you can pause that right?