Anatomy of an OTMI File
From OpenTextMining
This section presents a walkthrough of an OTMI file using a standard rules notation (here Augmented Backus-Naur Form, or ABNF, productions) which are collected together in Annex B. This defines the abstract schema for OTMI, while a concrete implementation of OTMI using the Atom Entry Document serialization was presented in the previous section.
@@ this section needs OTMI XML examples along with the grammar @@
The following datatypes are used in this grammar: DATETIME, URI, and UTF-8.
DATETIME = ; see RFC 3339 [RFC3339], e.g. 2006-08-03T00:00:00Z
This (incomplete) rule defines the standard representation for timestamps on the Internet. The comment refers the reader to the relevant RFC for consultation.
URI = ; see RFC 3986 [RFC3986] for generic URI syntax
Again, another (incomplete) rule defines the generic syntax for identifiers used on the Web – the familiar Uniform Resource Identifier, or URI. And again, the comment refers the reader to the relevant RFC.
UTF-8 = ; see RFC 3629 [RFC3629] for UTF-8 transformation format encoding
And finally, yet another (incomplete) rule defines UTF-8 as one of the standard transformation format encodings to represent Unicode (or UCS – Universal Character Set) characters and is the 8-bit Unicode (or UCS) Transformation Format. It should be noted that UTF-8 is the default encoding for XML documents.
The body of every OTMI file is the data payload which can be defined as
data = stoplist section *( figure ) [ references ]
What this rule says is that an OTMI data payload consists of a word stoplist, followed by a text section, zero or more figure’s, and an optional references section.
The word stoplist is defined simply enough as
stoplist = URI
and is just a network reference (using a familiar URI Web identifier) to an external file listing the actual stopword list (if any) that has been applied. (The expectation would be that the URI be network retrievable. That is, one would probably anticipate a regular HTTP URI being used here, although this is not mandated. The important thing is that the stoplist be identified so that is can be acquired – or at least be known about – through whatever means are available to the user.)
The heart of any OTMI file though is the section block. This may be more involved than might first appear at first glance as section’s can be nested and thus section actually defines the root of a section hierarchy, or tree of section’s. The section block is essentially the core of OTMI and the production rule is
@@ This isn’t right – doesn’t properly account for leaf nodes. @@
section = 1*( section name ) / vectors snippets [ text ]
@@ Shouldn’t it be something like - well no @@
section = 1*( section name ) / section-body section-body = vectors snippets [ text ]
This recursive rule states that section incorporates one or more section’s, each of which incorporates one or more section’s or has terminal content (or ‘leaf’ content, following the usual tree analogy) of word vectors and text snippets followed by an optional text block.
Note that OTMI section's are named. That is, each section has a name attribute. While the set of name’s is not fixed to any standard list, at least there is an expectation that all section’s have name’s – and hence be typed – and so made more amenable to further analysis.
Word vectors are defined as
vectors = split-regex 1*( vector )
which simply says that the vectors block contains the regular expression (or regex) split-regex that was used to split the text to generate the actual word vector’s. Following this is a list of one or more actual word vector’s.
The main contribution that OTMI delivers is a listing of word vector’s for a given piece of content. These are specified according to the following rule:
vector = term count
That is, a word vector consists of a term (commonly a word, but better known as a token) together with a count of the number of occurrences of that term within the text that has been processed. Processing of this text involves the removal of all structured markup – such as XML element tags, the replacement of any storage ‘entities’ (or in other cases maybe TeX macros, or other placeholders) that may be present, and the removal of any words that occur within any word stoplist that is being used. The productions for term and count are:
term = UTF-8
This rule for term defines it as being a generic UTF-8 string, while the rule for count
count = 1*( DIGIT )
defines it simply as being an integer number.
Text snippets are defined as
snippets = split-regex 1*( snippet )
This rule directly mirrors the vectors rule above. Again the snippets block contains the regular expression (or regex) split-regex that was used to split the text to generate the actual text snippet’s. Following this is a list of one or more actual text snippet’s.
An actual snippet is defined as
snippet = phrase
where phrase here denotes an arbitrary piece of text (typically UTF-8). Currently these are being presented in the NPG implementation as complete sentences but could be presented as smaller units of text such as phrases within sentences if more obfuscation would be required.
Text can be disclosed according to publisher preference either as raw-text (with XML markup removed) or as reduced-text (with XML markup removed and with a word {{{stoplist}}} applied), i.e.
text = raw-text / reduced-text
The text production allows full text to be made available at different levels of transparency for selected objects, e.g. to disclose first paragraphs, teasers, etc.
raw-text = UTF-8 ; flattened and cleaned (i.e. entities replaced) text from XML
reduced-text = UTF-8; 'raw-text' with all stopwords removed
Text from figure’s is made available according to the following rule:
figure = title caption / title / caption
title = raw-text
caption = raw-text
That is, figure text comprises title and/or caption. Both title and caption are disclosed as text (i.e. raw-text or reduced-text).
@@ Why would we limit to figures and exclude other objects such as tables, etc.? Or should we rather define a generic object of which figure is a specific instance which we are choosing to use for expedience only? @@
The optional references element is defined as
references = 1*( ref-id ) refs-noid / 1*( ref-id ) / refs-noid
ref-id = URI ; ID for ref
refs-noid = 1*( DIGIT ) ; count of refs with no ID
Having looked at the contents of the data payload of an OTMI file we can now turn to the bigger picture of the overall contents of an OTMI file which can be summarized by the following rule as:
OTMI = key-data properties data
That is, an OTMI file consists of key-data, followed by properties, followed by the actual data payload as defined above. The rule for key-data defines key properties that should be disclosed by all OTMI files.
key-data = title id link published updated 1*( author ) rights
The rule for key-data defines key properties that should be disclosed by all OTMI files.
title = UTF-8
id = URI
link = href
href = URI
published = DATETIME
updated = DATETIME author = name
name = UTF-8
rights = UTF-8
Note that author is currently only granulated to name level.
The rule for properties defines descriptive product-specific properties that may be disclosed by OTMI files.
properties = ; descriptive metadata specific to product
This (incomplete) rule defines properties as being a set of descriptive metadata specific to the product. This will, of course, vary from product type to product type. For example, for a journal article one might require the following properties (here defined in terms of PRISM – or ‘Publisher Requirements for Industry Standard Metadata’):
properties = publicationName \ volume \ number \ startingPage \ endingPage \ issn \ eIssn
where
publicationName = UTF-8
volume = UTF-8
number = UTF-8
startingPage = UTF-8
endingPage = UTF-8
issn = ISSN
eIssn = ISSN
ISSN = 4*4 ( DIGIT ) ["-" ] 3*3 ( DIGIT ) ( DIGIT / "X" )
