$le-tex logo$

tokenized-to-tree

An XProc/XSLT Library For Patching Back Tokenization/Analysis Results Into Marked-up Text

Gerrit Imsieke (@gimsieke), le-tex publishing services (@letexml)

Problems addressed / Applications

Adding linguistic analysis information to .docx documents
Tagging a PDF’s line and page breaks in the TEI source document
Inserting links into tagged documents

… with a single, customizable, open-source XProc/XSLT library

Linguistic analysis of .docx files

Word file about Celts

Linguistic analysis of .docx files

linguistically anylzed file

Linguistic analysis of .docx files

linguistically anylzed file

Tagging a PDF’s line breaks in the TEI source document
Taken from: Uwe Johnson Werkausgabe. Ein Vorhaben der Berlin-Brandenburgischen Akademie der Wissenschaften an der Universität Rostock

Inserting links into tagged documents

See the synthetic ttt-linking-demo repo on Github

It links occurrences (in the paragraphs) of chapter titles to the respective chapters.

The similar real-life application (for Deutscher Apotheker Verlag) linked occurrences of an entry title to the primary entry in major reference works (thousands of pages).

→ see demo source

Inserting links into tagged documents

Step-by-step inspection of the linking scenario. Macroscopic steps:

p:xslt with custom prepare-target-list.xsl
ttt:prepare-input (library step with 3 XSLT passes)
p:xslt with custom find-candidates.xsl
ttt:process-paras (library step with 3 XSLT passes)
ttt:merge-results (library step with 3 XSLT pass)

→ demo (diff between input and linked output is on next slide)

Diff

the diff between sample input and linked output

Output of the bespoke `find-candidates.xsl` pass

A ttt:para element with the normalized input para and
the regex-matching-based tokenization result (indented for better fit into this slide)

Performance Detail: Converting token start/end milestones into spanning elements

Occurs as last XSLT pass in ttt:process-paras

“Upward projection” method for pulling up milestones to immediately beneath the paragraph
Only then may they be connected into spanning elements
Why is it a good idea to split up the input into paragraph-like units?
- Performance scales roughly with node count times splitting point count
- 10 paras with 20 elements and 4 milestones each ⇒ (10×20)×(10×4) = 8000 when processing as single chunk, 10×20×4 = 800 individually

Common properties of the three tasks

Deeply nested XML
Flat-string-based tokenization, often by regex matching
Enriching the tokens with analysis results:
- part of speech information
- page/line numbers from a PDF
- link targets
Merging the new token structure with the source XML

⇒ Overlapping markup

Interface of mark-linebreaks.xpl

(The PDF scenario)

Pulling up milestones is optional; not used here
Good showcase for multiple output ports in XProc

Details of mark-linebreaks.xpl

Advantages of XProc in orchestrating these pipelines

Functional language
Excellent encapsulation
Thereby great customizability (config files or overriding XSLT on input ports)
In contrast to other functional languages: Multiple “return values” (output ports) whose consumption can be deferred to when they are needed (ttt:prepare-input on the previous slide)
Wealth of other libraries, in particular for dealing with zipped XML formats such as .docx, IDML, or EPUB (for ex. docx2hub, hub2docx)

Thank you!

Library: github.com/transpect/tokenized-to-tree

DocBook linking sample application: github.com/gimsieke/ttt-linking-demo/

Paper: archive.xmlprague.cz/2018/files/xmlprague-2018-proceedings.pdf (p. 229)

Video: on the Youtube XML Prague channel

tokenized-to-tree

An XProc/XSLT Library For Patching Back Tokenization/Analysis Results Into Marked-up Text

Problems addressed / Applications

Linguistic analysis of .docx files

Linguistic analysis of .docx files

Linguistic analysis of .docx files

Tagging a PDF’s line breaks in the TEI source document Taken from: Uwe Johnson Werkausgabe. Ein Vorhaben der Berlin-Brandenburgischen Akademie der Wissenschaften an der Universität Rostock

Tagging a PDF’s line breaks in the TEI source document Taken from: Uwe Johnson Werkausgabe. Ein Vorhaben der Berlin-Brandenburgischen Akademie der Wissenschaften an der Universität Rostock

Tagging a PDF’s line breaks in the TEI source document Taken from: Uwe Johnson Werkausgabe. Ein Vorhaben der Berlin-Brandenburgischen Akademie der Wissenschaften an der Universität Rostock

Inserting links into tagged documents

Inserting links into tagged documents

Diff

Output of the bespoke find-candidates.xsl pass

Performance Detail: Converting token start/end milestones into spanning elements

Common properties of the three tasks

Interface of mark-linebreaks.xpl

Details of mark-linebreaks.xpl

Advantages of XProc in orchestrating these pipelines

Thank you!

Tagging a PDF’s line breaks in the TEI source document
Taken from: Uwe Johnson Werkausgabe. Ein Vorhaben der Berlin-Brandenburgischen Akademie der Wissenschaften an der Universität Rostock

Tagging a PDF’s line breaks in the TEI source document
Taken from: Uwe Johnson Werkausgabe. Ein Vorhaben der Berlin-Brandenburgischen Akademie der Wissenschaften an der Universität Rostock

Tagging a PDF’s line breaks in the TEI source document
Taken from: Uwe Johnson Werkausgabe. Ein Vorhaben der Berlin-Brandenburgischen Akademie der Wissenschaften an der Universität Rostock

Output of the bespoke `find-candidates.xsl` pass