Annotated Corpus of Hittite clauses

(Letters, Instructions and Prayers)

Dr. Andrej V. Sideltsev
Maria Molina
Contact us: maria.molina@me.com

Russian Academy of Sciences
Institute of linguistics
Department of Anatolian and Celtic Studies


The aim of the project is to develop an online syntactically annotated corpus of Hittite, a dead Indo-European language of 18–12 cc. BC attested in cuneiform script on clay tablets.

Nowadays new electronic and online corpora for different languages emerge every new day. Hittite remains practically the only major Indo-European language with a significant corpus of texts that does not have any electronic syntactically annotated corpus. As the oldest attested Indo-European language, Hittite proves to be more and more interesting for the researchers. Lots of papers on its linguistics, including syntax and morphology, have been published during the last decade, so the need of an online Hittite corpus is more and more compelling.

The project of a syntactically annotated Hittite corpus started in Moscow in May 2014 at the Institute of linguistics (RAS, Moscow). It is based on two publications, Hittite letters by [Hoffner 2009] and Hittite instructions [Miller 2013]. Hittite prayers are planned to be included as well. All letters and instructions have been analyzed and digitalized according to the project guidelines, and a database has been built using FileMakerPro 13 shell. Starting from January 2016, a MsSQL relational database system is being developed for the corpus, with an online search interface and limited public access to the database materials.

The basic element of the database is a clause (a simple sentence). By August 2016 the database reached the amount of 3800 clauses. All linguistic material is to be parsed, tokenized, and lemmatized. By the end of January of 2017 all clauses available online are to be annotated for focus-comment and word order (SOV). Syntactic annotation is based on Penn Treebank tagsets with language specific additions; a tree generator is built in the site and works according to the principles of Stanford Tregex software.

The future corpus structure is planned in the following manner:

  1. Lemmas database and dependency treebank (a lexeme is a basic element).
    1. Metadata for the word (text index + line, clause index, lexeme index).
    2. Lemmatization table (syllable transliteration of a lexeme, normalized spelling, translation, grammar, dependency rating).
  2. Clause database and constituency treebank (a clause is a basic element).
    1. Metadata for the text (publication, duplicate, time, text index).
    2. Metadata for the clause (text index, paragraph, lines, clause index).
    3. Clause table (syllable transliteration, normalized spelling, English translation).
    4. Syntactic annotation (a number of relevant features, such as word order etc.)
    5. Phrase structure (treebank).
    6. Pragmatic annotation (focus-comment).

In 2016 the Hittite corpus project was being developed under the corpus program of the Department of historical and philological Sciences of Russian Academy of Sciences: Development of a Hittite Corpus, #0182-2015-0003.



References:
Hoffner, H. A. Jr. (2009). Letters from the Hittite Kingdom. Atlanta: Society of Biblical Literature.
Miller, J. (2013). Royal Hittite Instructions and Related Administrative Texts. Atlanta: Society of Biblical Literature.
Beckman et al. (2011) G. Beckman, T. Bryce and E. Cline. The Ahhiyawa Texts. Atlanta: Society of Biblical literature.