If you use HTTPBuilder to crawl web pages and extract information, you would have noticed that it uses the Groovy’s XML Support for parsing HTML. Groovy’s GPath is powerful, but HTML has something more powerful (not to mention simple, easy and intuitive) for selection, CSS Selectors. jQuery has proved css selectors are indeed, the best way for DOM manipulation.

CSS Selectors are available for Java using this library

http://github.com/chrsan/css-selectors

I wrote a small facade class CSSSelector to expose css selectors, the Groovy way.

http://code.google.com/p/css-selector-httpbuilder/

Here’s an example,

import groovyx.net.http.CSSSelector
import groovyx.net.http.HTTPBuilder
import org.cyberneko.html.parsers.DOMParser
import org.xml.sax.InputSource

def http = new HTTPBuilder('http://www.google.com/');

http.parser.'text/html' = {resp ->
  DOMParser p = new DOMParser();
  def content = resp.getEntity().getContent()
  p.parse(new InputSource(content));
  return new CSSSelector(p.getDocument());
}

def html = http.get(path: 'search', query: [q: 'groovy'])

// print search result titles, may not work as Google search result page keeps changing
html.'ol li h3 a'.each{
  println it.text();
}

// same output as above, but in a different way
println html.query('ol li h3 a')*.text()

The CSSSelector class has no dependencies on HTTPBuilder. It can be used with any library in Groovy. If you want to use it with a Java library, you can use css-selectors in github