Peter Grzybek

Word Length Frequencies and their Distribution in Slavic Texts
(Project Abstract)

The word, just like the sentence, is a central element for any (process of) text construction. Despite this central role, word length as a theoretical category in its own right has been largely neglected in linguistics and text-oriented disciplines. Only recently has the question of the frequency of occurrence of words of specific lengths ("word length frequencies") in texts (of a given language, a given author, a given genre, etc.) been theoretically integrated in systematic contexts, and only recently has a particular theory of word length distribution(s) been developed. Empirical results thus far available indeed show that the frequency with which one-, two-, three-, etc. syllable words occur in texts, is organized not chaotically, but by specific laws; knowledge of these laws allows deep insights into text structure and processing. As opposed to earlier assumptions that a single, unique law might be responsible for the frequencies of word length in texts, one now takes into consideration a flexible system of a super-ordinate basic model and particular modifications due to text-, author-, genre-, time-specific and other factors. Thus far, no systematic studies are available on word length frequencies in Slavic texts. Also, the problem of how the specific "peripheral"  factors influence word length frequency (distributions) has never been studied in detail. In this research project, these questions shall be approached systematically, using ca. 3,000 texts in three Slavic languages (Russian, Croatian, Slovenian). For this purpose, it will be necessary to construct a profound text data bank with the relevant meta-data; optionally, this data bank shall be made available for external usage as well, just as the text analysis software tools, which will have to be specially constructed (if possible, allowing for further developments, possibly including non-Slavic languages). The results of this projects will be important not only for the narrower field of Slavic linguistics and text science; they will likely provide (a) profound theoretical insight into deep structural rules of text construction and processing, and (b) give relevant information about the modifying role of the specific factors influencing word length frequency. Furthermore, it can be expected that the systematic study of word length and its frequency (distribution) in texts will prove to be an important contribution to the problem of text classification and genre discrimination. Since the regularities to be observed can be understood to be of importance for information processing in general (i.e., not only for language processing), and due to the statistical methods which will necessarily have to be applied in studying them, the present project represents an interdisciplinary attempt to bridge the "two cultures" of natural and cultural sciences.

