Articles

Is the Distribution of L-Motifs Inherited from the Word Lengths Distribution?In

The distribution of L-motifs (measured on a text T) is similar to the L-motifs distribution measured on the pseudotext T’ constructed by random transposition of all tokens within the text T. This inspires the suggestion that the distribution of L-motifs is inherited from the word length distribution (or, by other words, that the word length distribution of a text implies the distribution of L-motifs). The paper clearly shows that despite of the similarity, an L-motifs structure, independent of the word length distribution, can be detected.

Is the Distribution of L-Motifs Inherited from the Word Lengths Distribution?.

In

Reinhard Köhler (1984) proposed an idea that the linguistic constructs which have to be processed by the human parser consist of plain information (that is needed to be communicated) and the structure information, and that this can explain Menzerath's law. Our paper assumes that the amount of plain information and the amount of the structure information are mutually independent. A new model of the nested structure of text and Menzerath's law can be based on this assumption. A formula derived from the model is successfully tested and the results are compared to the classical Menzerath-Altmann law.

Menzerath's Law: The whole is greater than the sum of its parts.

In Gabriel Altmann, Radek Čech, Ján Mačutek, Ludmila Uhlířová (eds.)

Examining a large corpus of Greek texts we found that the average length of syllables in the disyllabic words is lower than the average length of the syllable in monosyllabic words and lower than the average length of syllables in tri-syllabic words. This peculiar phenomenon can be interpreted as a counterexample of the Menzerah's Law.

Distribution of the Menzerath’s Law on the Syllable Level in Greek texts.

In

This contribution deals with the use of quotations (repeated n-grams, the altorithm is tolerant to small lexical changes) in the works of medieval Arabic literature. The analysis is based on a 420 millions of words historical corpus of Arabic. Based on repeated quotations from work to work, a network is constructed and used for interpretation of various aspects of Arabic literature. Two short case studies are presented, concentrating on the centrality and relevance of individual works, and the analysis of a time depth and resulting impact of a given work in various periods.

Quotations, Relevance and Time Depth.

In

This article deals with the one of the oldest and most traditional fields in quantitative linguistics, the concept of vocabulary richness. Although there are several methods for vocabulary richness measurement, all of them are influenced by text size. Therefore, the authors propose a new way of vocabulary richness measurement without any text length dependence. In the second part of the article, the new method is used for a genre analysis in texts written by the Czech writer Karel Čapek. Furthermore, differences between authors and between languages are studied with this method.

Vocabulary Richness Measure in Genres.

The software used in the paper is available here.

In Ivan Obradović, Emmerich Kelih and Reinhard Kohler (Eds.)

Presented on the QUALICO 2012, Beograd

This paper shows that type-token relation, hapax-token relation and, generally, relation between types of certain frequency and tokens can be computed from the rank-frequency relation or from any type of frequency distribution and that type-token relation can be computed from the hapax-token relation. This paper shows that there is no need for any approximation or assumptions and that the formulae can be derived purely algebraically. The second part of the paper observes that, for a very large corpora, the ratio between the number of hapax legomena and types converges to a constant

Available here.

The software based on the model is available here.

In

The paper defines and shows how to use the

The software based on the metric is available here.

Rank-frequency Relation and Type-token Relation: Two Sides of the Same Coin

In

Presented on the CL 2011, Birmingham

In Arabic, mutual order of prepositional phrases syntactically dependent on one head is neither fixed nor random. This paper explores the factors affecting the order of prepositions

Available on the conference proceedings or on my website.

In

When comparing the use of two word types within one text, we can do it by comparing the contexts in which they occur. We pick all the tokens that occur e.g. immediatelly to the right of the word A and immediatelly to the right of the word B, thus getting two multiple subsets of text. This paper offers a method for comparing such subsets (and its use is not limited only to the field of linguistics). The method is based on comparing the cardinality of the intersection of the two multiple subsets and a model which characterizes the average cardinality of all possible subsets of a given length from the given text. The model is derived algebraically.

In

If we consider type-token relation to be a feature of text and not of language, we can approach a theoretically based and precise description of this relation. Such description will suit the demands of text linguistics better than the empirical laws that are used nowadays. This paper offers a model of the relation based on the combinatorial characterization of distribution of types in text. This method is subsequently used to formulate the model of hapax-token relation and the subject is generalized.

2. 8. 2008

Type-token & Hapax-token Relation: A Combinatorial Model. Software based on formulae from this article.

Published in

Czech Language

Konfidenční intervaly v empirické lingvistice.Czech Language

In

Empirical linguistics and confidence intervals

The paper attempts to introduce confidence intervals to the (Czech) empirical linguistics. First, classical inference tests are discussed claiming their inability to determine the real life significancy. Then confidence intervals are defined and the basic idea underlying the method for computing the confidence intervals for binary data is described. It is shown how the intervals can be useful when exploring binary quaternities and relations between two variables. The last section deals with the relevance of the method for the Czech linguistic discourse.

Konfidenční intervaly v empirické lingvistice..

(Experimental research on style-forming factors: first outcomes)

In

The paper introduces an experiment on the role of preparedness in writing. The experiment took place in 2010. Participants (N = 51; students of Charles University in Prague) were randomly divided into two groups: group N (N = 24) and group P (N = 27). Their main task was to describe the plot of a short animated film Quest. Group N started to write right after seeing each part of the film, group P had 5 minutes to prepare. Significant differences in the sentence length and number of revisions were shown between the two groups. It is claimed that preparedness is a valid styleforming factor, i.e. it influences both the process and the result of writing. Furthermore, the same method could be used for the analysis of the role of other style-forming factors in the writing process.

In Naše řeč 95/4 (2012) pp 181–186 ISSN 0027-8203

In

(Typography and the Islamic culture)

The article examins the phenomenon ot the typography in the course of the Islamic history. In the Islamic world printing by movable types and printblocks was unacceptable. The using such a technology to copy a text written in an Arabic script was illegal. The article asks how could the society resist the temptation of this innovation and describes the distressful influence of typography on the life of Muslims.

9. 5. 2007; Full version (Czech): Cesta k arabskému knihtisku na Blízkém východě.

2009 Abbreviated version published in the Nový Orient (64/2), Knihtisk v dějinách islámské kultury.