Tuesday, March 28, 2017

Why we need alignments in historical linguistics

Alignments have been discussed quite a few times in this blog. They are so extremely common in molecular biology that I doubt that there are any debates about their usefulness, apart from certain attempts to improve the modelling, especially in cases of non-colinear patterns (Kehr et al. 2014), or to speed up computation (Mathura and Adlakha 2016). In linguistics, on the other hand, alignments are rarely used, although initial attempts to arrange homologous words in a matrix go back to the early 20th century, as you can see from this example taken from Dixon and Koerber (1919: 61):

Early alignment from Dixon and Kroeber (1919)

This example is rather difficult to read for those not familiar with the annotation. The authors group homologous words across different indigenous languages from California. The group labels of the languages under investigation are given in abbreviated form at the very left of the matrix, and the actual varieties are listed in the next column. What follows is the actual alignment, along with comments in the last column. Regarding the alignments, the authors note on page 55:
A number of sets of cognates have been taken from their numbered place in this list and put at the end to allow of their being printed in columnar form, with a view to bringing out parallelisms that otherwise might fail to impress without detailed analysis and discussion. (Dixon and Kroeber 1919: 55)
In my opinion, this expresses nicely why alignments should be used more often in linguistics — due to the problem that our "alphabets" (the sound systems of languages) are undergoing constant change (see this earlier post for details regarding this claim), we need to infer both the scoring function between different sounds across different languages, and the alignment at the same time. If we look at the similarities the authors spotted, it should become obvious what I mean.

I am not yet sure how to interpret the data exactly, but if I am not mistaken, the authors claim that each of the column contains homologous material. So, they find a similarity between kaha in the first row (the language is Northern Wintun, according to the key to abbreviations in the book), and tu in the last row (Monterey Costanoan). The last column shows suffixes, which I think the authors exclude from their analysis, but I could not find additional information confirming this in their book.

The comment column illustrates another problem of representation, namely that the authors do not know how to handle cases of metathesis (or transpositions) consistently. The transposition of the parts of words is a process that is quite frequent in language evolution. It is very frequent in compounds consisting of modifier and modified, such as milk coffee in English, where milk modifies the coffee, while French, for example, puts the modifier after the main noun, expressing this as café au lait.

Nowadays, we can handle these cases consistently in linguistics, both in our data annotation and in the alignments, and we can even search for the structures automatically (see List et al. 2016). One hundred years ago, when Dixon and Kroeber worked out their comparison of the languages in California, they were pioneers who tried to increase the transparency of our discipline, and it is clear that their solutions are not completely satisfying from today's perspective.

It is extremely surprising for me that, despite these early attempts to make our homology judgments in linguistics more transparent, the practice of phonetic alignments is still rarely used by historical linguists. Indeed, the majority of them even think that it is a waste of time, or only useful for the purpose of teaching.

I was reminded of this when I looked at a recent proposal by Bengtson (2017, see also this blog for details) for deep genetic connections between Basque and North Caucasian languages. Note that the Basque language is traditionally considered as an isolate, i.e. a language whose nearest relatives we cannot find among the languages in the world. Many linguists have attempted to solve this puzzle by proposing various hypotheses (see Forni 2013 for an example of attempting to link Basque with Indo-European). Bengtson proposes various types of evidence, which I cannot really judge, as I do not know the languages under comparison, but finally, he also shows a list with potential homologs between Basque and North Caucasian varieties, which you find below.

Potential homologs between Basque and North Caucasian languages (Bengtson 2017)

If you are not a trained historical linguistic, and thus do not know what to do with this table, be assured that many historical linguists will feel similarly. As a rough explanation: the concepts are supposed to be very, very stable, being drawn from Sergey Yakhontov's list of 35 ultra-stable concepts, and I think that all words in one row are supposed to be etymologically related — that is, they should be potential homologs across all of the languages. If word forms are preceded by the asterisk symbol (*), this means that they are reconstructed, i.e. not reflected in written sources. But that is all I can tell you for the moment. Where I should start the comparison between the words remains a mystery for me, as I do not know which parts are supposed to be similar. Alignments would help us to see immediately where the author thinks that the historical similarities can be found — that is, we would see, which parts of the words are supposed to be homologous.

At this point in the post, I originally planned to provide you with an alignment of Bengtson's table, in order to illustrate the benefits of alignment in linguistics. Unfortunately, I had to admit to myself that I cannot do this, as I simply do not know where to align the words (apart from some rare trivial cases in the table).

I really hope that this will change in the future. Too often, our hypotheses in linguistics suffer from insufficient transparency with regards to the "proofs" and the evidence. I agree that it is very difficult to come up with good alignments in linguistics, especially if one regards cases of metathesis, unrelated parts, and general uncertainty. However, instead of giving in to the problem, we should follow the pioneering work of Dixon and Kroeber, and try to improve the way we present our data to both our colleagues and a broader public.

Theories such as the link between Basque and the North Caucasian languages are usually highly disputed in historical linguistics, and I do not know of any long range proposal that has gained broad acceptance during the last 50 years. Yet, maybe this is not because the proposals are not valid, but simply because those who are proposing these theories have failed to present their findings in a transparent and testable way.

  • Bengtson, J. (2017) The Euskaro-Caucasian Hypothesis. Current model. PDF.
  • Dixon, R. and A. Kroeber (1919) Linguistic families of California. University of California Press: Berkeley.
  • Forni, G. (2013) Evidence for Basque as an Indo-European language. The Journal of Indo-European Studies 41.1 & 2: 1-142.
  • Kehr, B., K. Trappe, M. Holtgrewe, and K. Reinert (2014) Genome alignment with graph data structures: a comparison. BMC Bioinformatics 15.1: 99.
  • List, J.-M., P. Lopez, and E. Bapteste (2016) Using sequence similarity networks to identify partial cognates in multilingual wordlists. In: Proceedings of the Association of Computational Linguistics 2016 (Volume 2: Short Papers). Association of Computational Linguistics, pp. 599-605.
  • Mathur, R. and N. Adlakha (2016) A graph theoretic model for prediction of reticulation events and phylogenetic networks for DNA sequences. Egyptian Journal of Basic and Applied Sciences 3.3: 263-271.