Annotated Corpus of Hittite clauses

(Letters, Instructions and Prayers)

Dr. Andrej V. Sideltsev
Maria Molina
Contact us: maria.molina@me.com

Russian Academy of Sciences
Institute of linguistics
Department of Anatolian and Celtic Studies


The aim of the project is to develop an online syntactically annotated corpus of Hittite, a dead Indo-European language of 18–12 cc. BC attested in cuneiform script on clay tablets.

Nowadays new electronic and online corpora for different languages emerge every new day. Hittite remains practically the only major Indo-European language with a significant corpus of texts that does not have any electronic syntactically annotated corpus. As the oldest attested Indo-European language, Hittite proves to be more and more interesting for the researchers. Lots of papers on its linguistics, including syntax and morphology, have been published during the last decade, so the need of an online Hittite corpus is more and more compelling.

The project of a syntactically annotated Hittite corpus started in Moscow in May 2014 at the Institute of linguistics (RAS, Moscow). The corpus is based on two publications, Hittite letters by [Hoffner 2009] and Hittite instructions [Miller 2013]. Hittite prayers are to be included as well, basing on a recently published online corpus by Philipps Universitaet Marburg's project: Gebete der Hethiter.

All letters and instructions have been analyzed and digitalized according to the project guidelines. Starting from January 2016, a MsSQL relational database system is being developed for the corpus, with an online search interface and limited public access to the database materials.

The basic element of the database is a clause (a simple sentence). By January 2016 the database reached the amount of 3897 clauses. All linguistic material is yet to be parsed, tokenized, and lemmatized. All clauses available online are to be annotated for focus-comment and word order (SOV). Syntactic annotation is based on Penn Treebank tagsets with language specific additions; a tree generator is built in the site and works according to the principles of Stanford Tregex software.

The future corpus structure is planned in the following manner:

  1. Lemmas database and dependency treebank (a lexeme is a basic element).
    1. Metadata for the word (text index + line, clause index, lexeme index).
    2. Lemmatization table (syllable transliteration of a lexeme, normalized spelling, translation, grammar, dependency rating).
  2. Clause database and constituency treebank (a clause is a basic element).
    1. Metadata for the text (publication, duplicate, time, text index).
    2. Metadata for the clause (text index, paragraph, lines, clause index).
    3. Clause table (syllable transliteration, normalized spelling, English translation).
    4. Syntactic annotation (a number of relevant features, such as word order etc.)
    5. Phrase structure (treebank).
    6. Pragmatic annotation (focus-comment).

In 2016-2017 the Hittite corpus project is being developed under the corpus program of the Department of historical and philological Sciences of Russian Academy of Sciences: Development of a Hittite Corpus, #0182-2015-0003.



References:
Hoffner, H. A. Jr. (2009). Letters from the Hittite Kingdom. Atlanta: Society of Biblical Literature.
Miller, J. (2013). Royal Hittite Instructions and Related Administrative Texts. Atlanta: Society of Biblical Literature.
Beckman et al. (2011) G. Beckman, T. Bryce and E. Cline. The Ahhiyawa Texts. Atlanta: Society of Biblical literature.
Gebete der Hethiter: http://www.hethport.uni-wuerzburg.de/txhet_gebet/textindex.php?g=gebet&x=x