Extracting Tabular Data from HTML Tables

Despite having been a significant catalyst for the success of the world-wide-web, use of the humble 'ol table has been driven to the backroads with current conventions of CSS presentation and structural markup. The rhetoric of semantic XHTML has relegated tables to the secondary sounding status of presenting tablular data. Sounds pretty boring, especially with the moves towards XUL, XAML, and standards like RSS, RDF and Atom becoming increasingly viable for distributing web content.

I'm as guilty as anyone for deriding tables as overly presentation specific and not particularly relevant to semantic content, but on a recent project I came to the realisation that the basic HTML table syntax is a perfect lightweight format for encoding and extracting simple item based data using HTML class attributes to define metadata. Such a format could be used like an analogue to RSS, with the added bonus that the raw information can also be accessed visually by default in a web browser.

The key to making this work is to exploit SAX based parsing of the table elements, allowing a high degree of flexibility from very simple code. SAX parsing is built in to the native PHP distribution (see: xml_set_object) but for this example, we will use the PEAR::XML_HTMLSax library.

To understand how this works, examine the structure of the following table:

<!-- Each row represents an 'item' of key-value pairs -->
<table>
	<tr id="1">
		<td class="title">Title of Item 1</td>
		<td class="description">Lorem ipsum dolor sit amet.</td>		
		<td class="author">Cicero</td>		
		<td class="date">7-09-2004</td>		
	</tr>
	<tr id="2">
		<td class="title">Title of Item 2</td>
		<td class="description">Now is the winter of our discontent.</td>		
		<td class="author">Richard</td>		
		<td class="date">7-09-2004</td>		
	</tr>
</table>

Note that this example doesn't demonstrate the use of table header or table body elements, which are now more commonly used in semantic table mark-up. For the time being, we'll keep things as simple as we possibly can.

This table markup is fairly linear, and could represent the structure of a database table with the primary key being set as the id attribute on the table row, and the field names being set as class attributes on each table cell.

This structure seems fairly easy to encode, but to do anything useful with it, we need to be able to extract each row with the class names and cell data being mapped to associative key-value pairs. Enter the SAX Handler:

/**
 * SAX handler to extract rows as 'items' from an HTML table
 */
class TableReader {
	var $items = array();
	var $item = array();

	/**
	 * returns extracted rows/items as an array
	 */
	function getItems() {
		return $this->items;
	}

	/**
	 * gets the value of a class attribute
	 */
	function getClass($attrs) {
		foreach($attrs as $name=>$value) {
			if ($name == 'class') {
				return $value;
			} else {
				return null;
			}
		}
	}

	/**
	 * open tag element handler
	 */
	function tagOpen(&$parser, $name, $attrs) {
		switch($name) {
			case 'tr':
				$this->item['id'] = $attrs['id'];
				break;
			case 'td':
				if ($class = $this->getClass($attrs)) {
					$this->currentKey = $class;
				}
				break;
			}
		}
	}

	/**
	 * tag data element handler (reads contents of the tag)
	 */
	function tagData(&$parser, $data) {
		if (isset($this->currentKey)) {
			$this->item[$this->currentKey] = $data;
			unset($this->currentKey);
		}
	}

	/**
	 * close tag element handler
	 */
	function tagClose(&$parser, $name, $attrs) {
		if ($name == 'tr') {
			$this->items[] = $this->item;
			$this->item = array();
		}
	}
}

The way of the SAX handler might not be immediately intuitive if you're used to thinking about transforming XML into arrays that directly represent the tag structure. Pay close attention to the tagOpen() method - it's the switch that will get repeatedly triggered when the parser comes across the start of a tag declaration. In this case, there are only two case conditions needed to handle the parser callbacks: the table row, which begins reading a new item, starting with the id of the current row, and the table cell, which sets up the current class name ready to read in the value data from the tag contents. Based on this, the tagData() method appends the contents of the current table cell to the appropriate item array key and the tagClose() method checks if the close tag is a table row, in which case the item properties get appended to the items array then reset to start reading in the next row.

Now for the procedural punch, we'll actually put this object to use:

// make the handler and parser objects
$handler = &new TableReader;
$parser = &new XML_HTMLSax;

// set the parser to callback on the TableReader
$parser->set_object($handler);
$parser->set_element_handler('tagOpen','tagClose');
$parser->set_data_handler('tagData');

// extract the HTML table to a string
$content = file_get_contents('table.html');

// parse the table content
$parser->parse($content);

// get the extracted data
$items = $table->getItems();

$items[0]['title']; // prints 'Title of Item 1'
$items[1]['author']; // prints 'Richard'

// loop it out
foreach($items as $item) {
	echo $item['title'];
	echo $item['description'];
}

As I mentioned above, the interesting thing about this approach is that it can be viewed as tabular data in a web browser as well as being a 'pure' data source. You could easily add more elements to the table to improve the presentation too - as long as they don't contain id or class attributes, they won't affect the way the table parses.

On a recent PHP project I used this approach as the basis for archiving an image collection consisting of jpg files and an associated database table. The interface allowed users to make a selection of images, and save this selection as a zip file. The content from the zip files had to be able to be imported back into the system, essentially providing a way to 'rebuild' a collection or share files between multiple collections.

Encoding the metadata for each image in a table row made the archive process very straightforward to implement. Each time an archive of image files is generated, the HTML table is saved at the root level of the zip. Importing the images is simply a matter of extracting and copying the files out of the zip, and inserting the item data from the html table into the database table.

For a more standardized application, the metadata should actually be encoded in the excif headers of each jpg file. But I really like the transparency of storing data in the HTML table format. If a user manually extracted the contents of the zip, they could simply click on the HTML archive to view the contents of the collection in their web browser. It makes an interesting addendum to the concept that XML should be human and machine readable.

My intention is not to suggest you should ignore semantic XML approaches in favour of tabular data - in fact, if you follow Tantek's advice, you should encode machine readable data in attributes, rather than inside open/close tags, and you can of course use an declaration to display the XML data visually in the latest browsers using CSS or XSL, as illustrated on Sam Ruby's Atom feed.

So what is this good for? I think what makes the approach unique and potentially useful is the flexibility of metadata fields- as long as each table row is defined by an id attribute, the SAX handler shown here could be used to extract, any key-value pairs encoded in the table cells - within the constraints given by the id/class attributes of course. Parsing a more complex table structure (where the data entities are in vertical columns rather than horizontal rows for example) would involve a bit more work for the element handlers. Also note that I've left out a few conditions that would make the parsing process more robust. In a production environment, you would want to pay a bit more attention to this.

Overall, the most significant aspect of this experiment for me wasn't the results achieved, it was the way that my previous assumptions and dogma about tables was overturned, allowing me to see something old and familiar in a totally new light. I think this is an essential aspect of programming and design - and perhaps any learning.