Posters

[1] W3C Recommendation for XQuery Full-Text and its Implementation in Enterprise Database Systems

Dušan Petković, University of Applied Sciences Rosenheim

Recently, in March 2011, W3C released the final specification for XQuery Full-Text search called Xquery and Xpath Full-Text 1.0. XQuery is the standard language for querying XML documents. It is developed by XML Query Working Group. This language is designed to query sources with XML content, including both databases and documents. A Full-Text query searches for text-based information using terms and applying special operators. In other words, the search process is against documents, and returning results are again documents.

In this paper, we give the overview of the specification of W3C Recommendation for XQuery and Xpath Full-Text. After that, we show to what extent this specification is implemented in enterprise database systems IBM DB2, Oracle and MS SQL Server. In addition, we show the non-standard functions that are implemented in some of these database systems and the relation of these functions to the standard ones.

http://www.fh-rosenheim.de/~petkovic

[2] User-land parallel processing of XSLT

Volodymyr Mykhailyk, IntelliArts Consulting

If you need to process big amounts of XML data faster you have 2 choices: optimize transformation or process documents in parallel.

Idea of this poster is:

Show what approaches can be used to transform small and big documents in parallel.
Present custom solution with multithreading support implemented on top of Saxon-HE.
Compare user-land implementation with built-in solutions.

[3] An Embedded GATE XProc Pipeline for Publishing Scottish Building Regulations as Open Linked Data

Lewis McGibbney, Glasgow Caledonian University, Bimal Kumar, Glasgow Caledonian University

In recent years, both socially and politically our perceptions surrounding how our legislation is produced, authored, amended and published have changed beyond all formal recognition. The fundamental driver behind this global paradigm shift is our necessity to increase the openness and transparency of our democratic states, whilst in the process embracing our digital world by linking past, present and future segments of our society. A huge part of this process involves making Government information more accessible to its users. The planning procedures in Scotland are designed to control positive change within towns and cities but also to protect local historic buildings and areas of natural beauty making sure that the land is developed to every-one's long term interest. In practice however the task of ensuring that compliance is met is blurred by the extremely complex, highly subjective, performance based legislation produced as a result of dated drafting work-flows. This research proposes a novel methodology for authoring and publishing Scottish Statutory Instruments, in particular Scottish Technical Standards as Open Linked Data for widespread use at local and national level as well as by professionals and the public alike. This research describes an XProc managed authoring work-flow for developing an RDF/XML-based version of the Scottish Technical Standards, modelled on ongoing work at Legislation.gov.uk: the most successful global implementation of Open Legislation. The processing pipeline has three phases, 1). data cleaning and mapping of legacy elements to Crown Legislation Markup Language. 2). resolving identifiers and data annotation with GATE through entity extraction based upon international construction dictionary definitions and industry defined terminology. 3) Validation against Crown Legislation Schemas prior to RDF/XML transformation. This research could be referred to as the initial stages of a semantic annotation process with the ultimate goal of opening up the information in these documents as a semantically rich open dataset(s) which will make their usage much more transparent and meaningful.

Additional information including a brief outline of the primary author's filed of research can be seen on his homepage in the link below.

http://www.gcu.ac.uk/ebe/staff/researchstudents/lewisjohnmcgibbney/

[4] Schematron Testing Framework

Tony Graham, Mentea

A suite of Schematron tests contains many contexts where a bug in a document will make a Schematron assert fail or a report succeed, so it follows that for any new test suite and any reasonably sized but buggy document set, there will straight away be many assert and report messages produced by the tests. When that happens, how can you be sure your Schematron tests all worked as expected? How can you separate the expected results from the unexpected? What’s needed is a way to characterise the Schematron tests before you start as reporting only what they should, no more, and no less.

stf is a XProc pipeline that runs a Schematron test suite on test documents (that you create) and winnows out the expected results and report just the unexpected. stf uses a processing instruction (PI) in each of a set of (typically, small) test documents to indicate the test’s expected asserts and reports: the expected results are ignored, and all you see is what’s extra or missing. And when you have no more unexpected results from your test documents, you’re ready to use the Schematron on your real documents.

http://inasmuch.as/2011/12/21/schematron-testing-framework/

[5] In-Memory Representations of XML Documents with Low Memory Footprint

Stelios Joannou, University of Leicester, Rajeev Raman, University of Leicester

The SiXML initiative at the University of Leicester aims at using succinct, or space-efficient, data structures to reduce the memory footprint needed to hold and process XML documents in memory. Unlike normal data compression, SiXML's in-memory representation can often be manipulated in a similar manner to a standard XML document representation with very little slowdown. For example, SiXDOM, an implementation of DOM, has a memory footprint smaller (typically than the XML file on disk, and is able to support all the operations in DOM Core Level 1-3, except those that involve modification/updates to the tree). Tests show that SiXDOM 1.0 is only about 1.8 times slower than Xerces, and uses memory typically less than 50% of the file size while Xerces-C uses 500% of the file.

We report on the latest release of SiXDOM (SiXDOM 1.2), which has features over the last version (SDOM 1.0) presented at XML Prague 2010 including:

Fast, memory efficient parsing, using the Xerces-C and EXPAT SAX parser.
Cross platform support for languages like Java, TCL using interface bindings.
Upgrade to 64-bit, so has already been successfully tested to parse documents of size 4x larger than the largest document parsed in SDOM 1.0.

We also report on current research in succinct data structures with a view towards improving XML representations. These include dynamizing succinct representations - allowing us to efficiently make changes to the underlying data structures - which will eventually lead to support for the dynamic operations in DOM, and experiments with operating on the XML document represented as a directed acyclic graph obtained by sharing identical subtrees.

[6] Defining an (XML?) Vocabulary for Heterogeneous Audiences – Pains and Benefits

Felix Sasaki, DFKI

Web-focused vocabularies are being made available in more and more formats like RDFa, Microdata or Microformats. These examples are only the tip of the iceberg. In the deep Web the same vocabulary may be available as XML, within relational databases, proprietary CMS storage etc. The situation is complicated by different usages for the data: publishing on the Web, analysis, merging with other (non Web) data sources etc.

We report about a project that aims at defining a metadata vocabulary to be available on the Web (e.g. as RDFa), in various deep Web, XML-based and other formats, and as part of different workflows. The purpose of the metadata vocabulary is to enhance language related technologies (e.g. machine translation) and workflows (e.g. translation and localization).

XML plays a crucial role as a format for producing content. The challenge for us is to convince XML author and tool developers to take the pain of implementing awareness of the metadata, even if it is not consumed in the XML tool chain. We argue that making the effort of bridging to different data formats will be beneficial for the XML authors too: they can reach out to new, (global) audiences and customers.