September 10 2004
Despite having been a significant
catalyst for the success of the world-wide-web, use of the humble 'ol
I'm as guilty as anyone for deriding tables as overly presentation specific
and not particularly relevant to semantic content, but on a recent project
I came to the realisation that the basic HTML table syntax is a perfect
lightweight format for encoding and extracting simple item based data, and
could be used like an analogue to RSS, with the added bonus that the raw
information can also be accessed visually by default in a web browser. The key to making this work is to exploit SAX
based parsing of the table elements, allowing a high degree of flexibility
from very simple code. SAX parsing is built in to the native PHP distribution
(see: xml_set_object) but
for this example, we will use the PEAR::XML_HTMLSax
library. To understand how this works, examine the structure of the following table: table.html: Note that this example doesn't demonstrate the use of This table markup is fairly linear, and could represent the structure of
a database table with the primary key being set as the id attribute on the
This structure seems fairly easy to encode, but to do anything useful with
it, we need to be able to extract each row with the class names and cell
data being mapped to associative key-value pairs. Enter the SAX Handler: TableReader.php: The way of the SAX handler might not be immediately intuitive if you're
used to thinking about transforming XML into arrays that directly represent
the tag structure. Pay close attention to the tagOpen()
method - it's the switch that will get repeatedly triggered when the parser
comes across the start of a tag declaration. In this case, there are only
two case conditions needed to handle the parser callbacks: the Now for the procedural punch, we'll actually put this object to use: As I mentioned above, the interesting thing about this approach is that
it can be viewed as tabular data in a web browser as well as being a 'pure'
data source. You could easily add more elements to the table to improve
the presentation too - as long as they don't contain id or class attributes,
they won't affect the way the table parses. On a recent PHP project I used this approach as the basis for archiving
an image collection consisting of jpg files and an associated database
table. The interface allowed users to make a selection of images, and
save this selection as a zip file. The content from the zip files had
to be able to be imported back into the system, essentially providing
a way to 'rebuild' a collection or share files between multiple collections. Encoding the metadata for each image in a table row made the archive
process very straightforward to implement. Each time an archive of image
files is generated, the HTML table is saved at the root level of the zip.
Importing the images is simply a matter of extracting and copying the
files out of the zip, and inserting the item data from the html table
into the database table. For a more standardized application, the metadata should actually be
encoded in the excif
headers of each jpg file. But I really like the transparency of storing
data in the HTML table format. If a user manually extracted the contents
of the zip, they could simply click on the HTML archive to view the contents
of the collection in their web browser. It makes an interesting addendum
to the concept that XML should be human and machine readable. My intention is not to suggest you should ignore semantic XML approaches
in favour of tabular data - in fact, if you follow Tantek's
advice, you should encode machine readable data in attributes, rather
than inside open/close tags, and you can of course use an declaration to display the XML data visually in the latest browsers
using CSS or XSL, as illustrated on Sam Ruby's Atom feed. So-o what is this good for? I think what makes the approach unique and
potentially useful is the flexibility of metadata fields- as long as each
The most significant aspect of this experiment wasn't the results achived,
it was the way that my previous assumptions and dogma about tables was overturned,
allowing me to see something old and familiar in a totally new light. I
think this is an essential aspect of programming and design - and perhaps
any learning. This Note Asides has been drived to the backroads with current conventions
of CSS presentation and structural markup. The rhetoric of semantic XHTML
has relegated tables to the secondary sounding status of presenting
tablular data. Sounds pretty boring, especially with the moves towards
XUL, XAML,
and standards like RSS,
RDF and Atom
becoming increasingly viable for distributing web content.
Title of Item 1
Lorem ipsum dolor sit amet.
Cicero
7-09-2004
Title of Item 2
Now is the winter of our discontent.
Richard
7-09-2004
or elements, which are now more commonly
used in semantic table mark-up. For the time being, we'll keep things as
simple as we possibly can.
, and the field names being set as class
attributes on each .
/**
* SAX handler to extract rows as 'items' from an HTML table
*/
class TableReader {
var $items = array( );
var $item = array( );
function TableReader( ) { }
/**
* returns extracted rows/items as an array
*/
function getItems( ) {
return $this->items;
}
/**
* gets the value of a class attribute
*/
function getClass($attrs) {
foreach($attrs as $name=>$value) {
if ($name == 'class') {
return $value;
} else {
return null;
}
}
}
/**
* open tag element handler
*/
function tagOpen(&$parser,$name,$attrs) {
switch($name) {
case 'tr':
$this->item['id'] = $attrs['id'];
break;
case 'td':
if ($class = $this->getClass($attrs)) {
$this->currentKey = $class;
}
break;
}
}
/**
* tag data element handler (reads contents of the tag)
*/
function tagData(&$parser,$data) {
if (isset($this->currentKey)) {
$this->item[$this->currentKey] = $data;
unset($this->currentKey);
}
}
/**
* close tag element handler
*/
function tagClose(&$parser,$name,$attrs) {
if ($name == 'tr') {
$this->items[] = $this->item;
$this->item = array();
}
}
}
,
which begins reading a new item, starting with the id of the current row,
and the , in which case the item properties get appended
to the items array then reset to start reading in the next row., which sets up the current class name ready to read in
the value data from the tag contents. Based on this, the tagData()
method appends the contents of the current to the appropriate
item array key and the tagClose() method checks if
the close tag is a // make the handler and parser objects
$handler =& new TableReader;
$parser =& new XML_HTMLSax;
// set the parser to callback on the TableReader
$parser->set_object($handler);
$parser->set_element_handler('tagOpen','tagClose');
$parser->set_data_handler('tagData');
// extract the HTML table to a string
$content = file_get_contents('table.html');
// parse the table content
$parser->parse($content);
// get the extracted data
$items = $table->getItems();
$items[0]['title']; // prints 'Title of Item 1'
$items[1]['author']; // prints 'Richard'
// loop it out
foreach($items as $item) {
echo $item['title'];
echo $item['description'];
}
is defined by an id attribute, the
SAX handler shown here could be used to extract, any key-value
pairs encoded in the - within the constraints
of the id/class attributes of course. Parsing a more complex table structure
(where the data entities are in vertical columns rather than horizontal
rows for example) would involve a bit more work for the element handlers.
Also note that I've left out a few conditions that would make the parsing
process more robust. In a production environment, you would want to pay
a bit more attention to this.