XTranscript

XTranscript is an online tool for converting transcripts saved in mainstream document formats (such as Microsoft Word, Open Document, PDF or TXT) into a lightweight XML format.

It has been developed in the field of linguistics to enable combined qualitative and quantitative studies of spoken language. Converting transcripts into XML allows for powerful and mature XML processing tools, such as XPath and XQuery, to be used to search or summarise features of the transcripts.

Upload and convert your transcripts to XML

Features

XTranscript currently offers two configurations for the conversion process:

Basic: Utterances will be detected and the text will be tokenised (split into words).
Conversation Analysis notation: In addition to the utterances, Jefferson notation (and a few known extensions) will be identified and recorded in XML elements.
- View the CA-as-XML schema
- See our guide to querying the XML using XPath or XQuery

Part-of-speech (grammatical) tagging can also be performed for English texts using the Stanford CoreNLP library. The Stanford tagger uses the Penn TreeBank tag-set. Summary of the tag-set.

If you are interested in converting your own notation then please get in touch and we'll see what we can do.

Results of Conversion

Given the text:

    1   BOB:    he's pretty special
    2           still beat everybody else
    3   SALLY:  yeah
    ...

The resulting XML file will have the structure:

    <transcript file="example.doc">
        <body>
            <u who="BOB" n="1" line="1">
                he 's pretty special
                still beat everybody else
            </u>
            <u who="SALLY" n="2" line="3">
                yeah
            </u>
            ...
        </body>
    </transcript>

The utterances will contain a who attribute to identify the speaker, be numbered, and the line numbers from the original file will be included where possible.

With part-of-speech tagging turned on, word elements will be also be included:

    <transcript file="example.doc">
        <body>
            <u who="BOB" n="1" line="1">
                <w n="1.1" pos="PRP" lemma="he">he</w>
                <w n="1.2" pos="VBZ" lemma="be">'s</w>
                <w n="1.3" pos="RB" lemma="pretty">pretty</w>
                <w n="1.4" pos="JJ" lemma="special">special</w>
                ...
            </u>
            ...
        </body>
    </transcript>

Note that only the text is stored as text elements in the XML, and all other data is stored within attributes. This makes extracting textual content easier and also allows the XML files to be used in Corpus Linguistic analysis software (such as AntConc and WordSmith).

Error Messages

Messages be will displayed in the XTranscript interface or as tags at the end of the generated XML file. Some of these will give information about errors that occurred during the conversion process.

A common warning you may see relates to the 'autoclosing' of annotations. This happens when only an opening or closing bracket was found where both are required. In which case XTranscript finds the longest span of text that the annotation can be applied to. This is a suitable solution for overlaps which may be written as:

    4   BOB:    won another [medal
    5   SALLY:              [yeah

Autoclosing may also happen when different annotations overlap, which would generate invalid XML without a change being made. In these cases, one of the annotations will be moved so that it is contained within the other. For example:

    6   SALLY:  still [beat (everybody] yes)

The parentheses will be closed at the end of 'everybody', i.e. where the square brackets end. To accurately deal with these situations, either the original file or the generated XML would need to be edited (e.g. to create two sets of parentheses instead of one, but please note the effect this would have on quantitative studies).

These changes are made to ensure that the resulting file is 'well-formed', which allows the XML to be used with other XML processing software. XTranscript performs a check of the generated XML to inform the user if it is well-formed or not. There is a good chance that if this check produces an error, you will need to contact us to look into the problem in more detail.

Useful Publications about XML

Ruehlemann, C. & M. Gee (2018) 'Conversation Analysis and the XML method'. Gesprächsforschung [Discourse and Conversation Analysis] (18) 274-296.

Ruehlemann, C., A. Bagoutdinov & M.B. O'Donnell (2015) 'Modest XPath and XQuery for corpora: Exploiting deep XML annotation'. ICAME Journal (39) 47-84. Companion website and resources.

Hardie, A. (2014) 'Modest XML for corpora: Not a standard, but a suggestion'. ICAME Journal (38) 73-103.

Useful Software for Working with XML

BaseX: An open-source database system for searching XML documents.

oXygen: A powerful commercial XML editor with search features.

W3C XML Validator: An online service to validate (and find errors in) XML documents.

XML can be edited using plain text editors, such as Notepad or TextEdit, which are built-in to Windows and Mac OSX respectively. Alternatively, these plain text editors provide some additional features (like syntax highlighting): Notepad++ (Windows), BBEdit (Mac OSX).

eXistdb: An open-source system (which runs as a web service) for editing and searching XML.

Related Projects

The Corpus of Academic Spoken English (CASE) is being compiled by a team of researchers at Trier University of Applied Sciences. Birmingham City University is one partner whos students are taking part in the study. The first iteration of XTranscript was developed for use in the CASE project.

Before XTranscript, programming scripts were developed to convert CA style annotations to XML as part of a combined qualitative and quantitative study of overlaps and interruptions. This paper shows examples of adding further annotations (in the form of attributes) to the converted XML.
Lawson, R. & U. Lutzky (2016) 'Not getting a word in edgeways? Language, gender, and identity in a British comedy panel show' Discourse, Context & Media (13) 143-153.

Notes

Style (bold, italic, underline) can only be extracted from Microsoft Word and Open Document formats.
The maximum file size for uploads is 10MB.
XTranscript does not encrypt your data at any point and it will store the uploaded and converted files on the server for a period of time. Get in touch if you need to convert files securely.
XTranscript should be considered experimental software. We provide no guarantee of availability and users may experience errors. Please let us know if you do, so we can try to fix them.