Tuesday, April 18, 2017

Multimedia phylogeny?


Evolutionary concepts have often been transferred to other fields of study, or derived independently in them, especially in anthropology in the broadest sense, covering all cultural products of the human mind. This includes phylogenetic studies of languages, texts, tales, artifacts, and so on — you will find many examples of such studies in this blog. One of the more recent applications has been to what is sometimes called multimedia phylogeny — the research field that "studies the problem of discovering phylogenetic dependencies in digital media".

I have noted before that phylogenetics in the biological sense is an analogy when applied to other fields, because only in biology is genetic information physically transferred between generations — in the other fields, cultural information transfer is all in the minds of the people, not in their genes (see False analogies between anthropology and biology). This analogy often becomes problematic when applied to other fields, because the practical application of bioinformatics techniques separates the informatics from the bio, and the mathematical analyses focus on trying to implement the informatics without any biological justification.


A recent paper that discusses the application of bioinformatics to multimedia phylogeny exemplifies the potential problems:
Guilherme D Marmerola, Marina A Oikawa, Zanoni Dias, Siome Goldenstein, Anderson Rocha (2017) On the reconstruction of text phylogeny trees: evaluation and analysis of textual relationships. PLoS One 11(12): e0167822.
The authors described their background information thus:
Articles on news portals and collaborative platforms (such as Wikipedia), source code, posts on social networks, and even scientific publications or literary works, are some examples in which textual content can be subject to changes in an evolutionary process. In this scenario, given a set of near-duplicate documents, it is worthwhile to find which one is the original and the history of changes that created the whole set. Such functionality would have immediate applications on news tracking services, detection of plagiarism, textual criticism, and copyright enforcement, for instance.
However, this is not an easy task, as textual features pointing to the documents' evolutionary direction may not be evident and are often dataset dependent. Moreover, side information, such as time stamps, are neither always available nor reliable. In this paper, we propose a framework for reliably reconstructing text phylogeny trees, and seamlessly exploring new approaches on a wide range of scenarios of text reusage. We employ and evaluate distinct combinations of dissimilarity measures and reconstruction strategies within the proposed framework.
So, their solution to the separation of bio from informatics is to try a range of techniques, none of which are based on any particular model of how phylogenetic changes might occur in text documents. All of these methods involve distance-based tree-building.

The essential problem, as I see it, is that without a model of change there is no reliable way to separate phylogenetic information from any other type of information. For example, similarity can arise from many sources, only some of which provide information about phylogenetic history — phylogenetic similarity is a form of "special similarity". In biology, other sources of similarity are usually lumped together as chance similarities, such as convergence, parallelism, etc. Without this basic separation of phylogenetic and chance similarity, it does not matter how many distance measures you use, or how many tree-building methods you employ — if you can't separate phylogeny from chance then you are wasting your time constructing a hypothetical  evolutionary history.

The authors' only saving grace is their claim that: "In text phylogeny, unlike stemmatology [the analysis of hand-written rather than digital texts], the fundamental aim is to find the relationships among near-duplicate text documents through the analysis of their transformations over time." The expectation, then, is that the phylogenetic similarity of the texts will be high, which will thus reduce the possibility of chance similarities. Sadly, it will also reduce the probability that the similarities will contain any phylogenetic information at all — this is the classic short-branches-are-hard-to-reconstruct problem in phylogenetics.

For digital texts, the authors employ three distance measures: edit distance, normalized compression distance, and cosine similarity. None of these are model-based in any phylogenetic sense (although the first one is used in alignment programs such as Clustal) — I have discussed this in the post on Non-model distances in phylogenetics. Their tree-building methods include: parsimony, support vector machines (a machine-learning form of classification), and random forests (a decision-tree form of classification). Once again, none of these is model-based in terms of textual changes.

A final issue is the insistence on trees as the model of a phylogeny. In stemmatology, for example, a network is a more obvious phylogenetic model, because hand-written texts can be copied from multiple sources. Indeed, this distinction plays an important role in the first application of phylogenetics to stemmatology (see the post on An outline history of phylogenetic trees and networks). Perhaps this is not an issue for "near-duplicate text documents", but it does seem like an unnecessary restriction. Moreover, one of the empirical examples used in the paper actually has a network history, which therefore does not match the authors' reconstructed tree.

Tuesday, April 11, 2017

Morgan Colman and English royal genealogies


I noted in an earlier post (Drawing family trees as trees) that from 1576 CE Scipione Ammirato, an Italian writer and historian, set up a cottage industry producing family trees for the nobility. Over the years, he was not the only person to try to make money this way.

In the English-speaking world, one of these was Morgan Colman (or Coleman), who produced an impressively large genealogy of King James I and Queen Anne, in 1608. Nathaniel Taylor has commented: "Of all the congratulatory heraldic and genealogical stuff prepared early in James’s reign, this might be the most impressive piece of genealogical diagrammatic typography."

Unfortunately, we do not have a complete copy of this family tree. It was published as a set of quarto-sized bifolded sheets that needed to be joined together. Below is a small image of the copy in the British Library, which gives you an idea of the intended arrangement, and its incompleteness (click to enlarge). Taylor has a larger PDF copy available here.


The WorldCat library catalog lists the work as "Most noble Henry ; heire (though not son)", which is the first line of the dedicatory verse at the top left. Elsewhere, I have seen it referred to as "The Genealogies of King James and Queen Anne his wife, from the Conquest".

It is usually described as "a genealogy of James I and Anne of Denmark in 10 folio sheets [sic], with their portraits in woodcut, accompanied by complimentary verses to Henry Prince of Wales, the Duke of York (Prince Charles) and Princess Elizabeth, and with the coats-of-arms of the nobles living in 1608 and of their wives."

A Christies auction notes the sale of an illuminated manuscript of the "Genealogy of the Kings of England, from William the Conqueror to Elizabeth 1", produced by Colman in 1592. The accompanying text reads (in part):
Colman, a scribe and heraldic painter, was steward and secretary to various eminent public figures, including successive Lord Keepers of the Great Seal, Sir John Puckering (1592-96) and Sir Thomas Egerton (1596-1603) who caused his election as MP for Newport, Cornwall in 1597. Heraldic and genealogical compositions were his speciality and in 1608 he had composed, and prepared for printing, genealogies of King James and his Queen published as ten large quarto sheets; in 1622 a payment records his work for James I in producing two large and beautiful tables for the King's lodgings in Whitehall and for making many of the genealogical tables for 'His Majesty's honour and service'. But these successes were a distant prospect in 1592 when he produced the present manuscript: in that year he petitioned for the post of York Herald and a second petition at about this date, possibly to Sir John Puckering, solicits the addressee's continued support for his advancement. This genealogy appears therefore to be part of a campaign to secure employment: the writer ends his summary of contents 'Wherein if the simplicity of well-meaning purpose, maie procure desired accept'on then rest persuaded that the industrious hand is fullie prepared spedelie to produce matter for more ample contentment.' The inclusion of Francis Bacon's arms at the end of his work shows that Colman had hopes of securing Bacon's patronage: by 1592 Bacon's political and legal career was well established, he was confidential adviser to the Earl of Essex, the Queen's favourite, and had hopes of high office. Colman, however, hedged his bets; another copy of this genealogy survives, though incomplete and lacking the arms of a recipient.
Colman apparently petitioned for the office of herald in the latter part of the reign of Queen Elizabeth I, but never obtained it.

Tuesday, April 4, 2017

Terry Gilliam's film career


Terence Vance Gilliam, the well-known film director, has been in the news recently, for trying yet again to film his movie The Man Who Killed Don Quixote. This movie started back in the early 1990s, and has now been up and down like a yo-yo for more than 25 years. Maybe he will complete it this time, which he didn't last year, or in 2010 or 2008 — and it is cinema legend what happened back in 2000 (as shown in the documentary Lost in La Mancha).

It has been said of Gilliam that "his directorial vision has secured his rightful place within the pantheon of substantive filmmakers as well as appreciative, if selective, audiences throughout his career." This means that his films often do well, but not all that well; he is more than an art-film maker, but not quite a mainstream director. You either love his movies or you don't — there is little or no middle ground.

Gilliam is probably best known for wanting to make what are called "independent" films but which require studio-scale funding, and then fighting with the studio executives over the finished product. He clearly wants to be an independent auteur but without the tight budget that normally goes with it. In other words, he makes his own bed and then has trouble lying in it


Being a director of some renown, there are plenty of people who have been interested in providing retrospectives and commentaries on Gilliam's career. After all, that sort of thing seems to be the principal activity in the arts world — you are either a creator or a commentator, or sometimes both (such as film commentator turned film director Peter Bogdanovich).

So, it might be worthwhile to look at what some of these commentators have thought about Gilliam's career, as represented by his directorial repertoire of completed films. This ignores his involvement with television animations and various commercials.

To date, the Gilliam directorial oeuvre consists of 12 feature-length movies:
  • Monty Python and the Holy Grail (1975)
  • Jabberwocky (1977)
  • Time Bandits (1981)
  • Brazil (1985)
  • The Adventures Of Baron Munchausen (1988)
  • The Fisher King (1991)
  • Twelve Monkeys (1995)
  • Fear And Loathing In Las Vegas (1998)
  • The Brothers Grimm (2005)
  • Tideland (2005)
  • The Imaginarium of Dr Parnassus (2009)
  • The Zero Theorem (2013)
and 5 short films:
  • Storytime (1968)
  • The Miracle of Flight (1974)
  • The Crimson Permanent Assurance (1983)
  • The Legend of Hallowdega (2010)
  • The Wholly Family (2011)
In the modern world, arts commentators tend to provide rankings of works of art, telling us which work is "best" and which "worst". If nothing else, this allows a mathematical analysis, although I am never quite sure how one goes about actually ranking works of art in some linear series.

The available commentaries that contain ranked lists of Gilliam's films include some personal choices:
some compilations from members of the public:
and some compilations from professional critics:
There is also a list based on the adjusted US box office grosses (Box Office Mojo); there is a combined score from multiple sources (Ultimate Movie Rankings; and the Top 10 Films site does not rank three of the films. I will ignore these latter three lists, since they are not directly comparable to the other lists.

Few commentators have included the short films in their discussion, and so I will start my analysis with the two sources who have done so. Here is a time-course graph of the 17 films as ranked independently by both IndieWire and IMDB.


Note that both lists agree that Gilliam was at his best (ie. he produced the top third of his works) during the middle period of his career; and that he hasn't produced anything of note this century. This does not bode well for the future success of The Man Who Killed Don Quixote. [Note: The failure of this movie to be made is responsible for the large gap between films from 1998 to 2005.]

We could now use a phylogenetic network as an exploratory data analysis to display the consensus rankings of the feature films (only), from all of the commentators listed above. As usual, I first used the manhattan distance to calculate the similarity of the different films based on their rankings. This was followed by a neighbor-net analysis to display the between-film similarities as a network. Films that are closely connected in the network are similar to each other based on their critic rankings, and those that are further apart are progressively more different from each other.


The network shows a straightforward pattern from the highest ranked films at the top-right to the lowest at the bottom-left. In the graph, the films are numbered in the order of their production (not their ranking!). So, six of Gilliam's first seven films as director are the highest-ranked ones, by consensus, with Jabberwocky plus his final five films as the lowest-ranked.

Most of the commentators selected Brazil as their number one film, with occasional votes for Monty Python and the Holy Grail. More than a half of the commentators selected The Brother Grimm as the worst film, with Tideland running a strong second.

There is nothing unusual about any of this, of course. It is a truism of social history that most people, whether they are artists or scientists, do their most interesting and influential work during the earlier part of their career. From Isaac Newton to Albert Einstein, most scientists coast through their careers after age 35, sometimes in their later years still collecting awards for the useful work they did 20 years before. The best-known exception was Louis Pasteur, who made significantly different major contributions to chemistry and biology during his 20s, 30s and 40s.

Well, artists are no different. Very few of them become famous during their later life, but instead continue to be "interesting" without being either as original or influential as they were in their earlier career. They are often well known and well respected, although just as often completely forgotten, or even unknown to later generations. Gilliam, at least, has not suffered the latter fate.

Tuesday, March 28, 2017

Why we need alignments in historical linguistics


Alignments have been discussed quite a few times in this blog. They are so extremely common in molecular biology that I doubt that there are any debates about their usefulness, apart from certain attempts to improve the modelling, especially in cases of non-colinear patterns (Kehr et al. 2014), or to speed up computation (Mathura and Adlakha 2016). In linguistics, on the other hand, alignments are rarely used, although initial attempts to arrange homologous words in a matrix go back to the early 20th century, as you can see from this example taken from Dixon and Koerber (1919: 61):

Early alignment from Dixon and Kroeber (1919)

This example is rather difficult to read for those not familiar with the annotation. The authors group homologous words across different indigenous languages from California. The group labels of the languages under investigation are given in abbreviated form at the very left of the matrix, and the actual varieties are listed in the next column. What follows is the actual alignment, along with comments in the last column. Regarding the alignments, the authors note on page 55:
A number of sets of cognates have been taken from their numbered place in this list and put at the end to allow of their being printed in columnar form, with a view to bringing out parallelisms that otherwise might fail to impress without detailed analysis and discussion. (Dixon and Kroeber 1919: 55)
In my opinion, this expresses nicely why alignments should be used more often in linguistics — due to the problem that our "alphabets" (the sound systems of languages) are undergoing constant change (see this earlier post for details regarding this claim), we need to infer both the scoring function between different sounds across different languages, and the alignment at the same time. If we look at the similarities the authors spotted, it should become obvious what I mean.

I am not yet sure how to interpret the data exactly, but if I am not mistaken, the authors claim that each of the column contains homologous material. So, they find a similarity between kaha in the first row (the language is Northern Wintun, according to the key to abbreviations in the book), and tu in the last row (Monterey Costanoan). The last column shows suffixes, which I think the authors exclude from their analysis, but I could not find additional information confirming this in their book.

The comment column illustrates another problem of representation, namely that the authors do not know how to handle cases of metathesis (or transpositions) consistently. The transposition of the parts of words is a process that is quite frequent in language evolution. It is very frequent in compounds consisting of modifier and modified, such as milk coffee in English, where milk modifies the coffee, while French, for example, puts the modifier after the main noun, expressing this as café au lait.

Nowadays, we can handle these cases consistently in linguistics, both in our data annotation and in the alignments, and we can even search for the structures automatically (see List et al. 2016). One hundred years ago, when Dixon and Kroeber worked out their comparison of the languages in California, they were pioneers who tried to increase the transparency of our discipline, and it is clear that their solutions are not completely satisfying from today's perspective.

It is extremely surprising for me that, despite these early attempts to make our homology judgments in linguistics more transparent, the practice of phonetic alignments is still rarely used by historical linguists. Indeed, the majority of them even think that it is a waste of time, or only useful for the purpose of teaching.

I was reminded of this when I looked at a recent proposal by Bengtson (2017, see also this blog for details) for deep genetic connections between Basque and North Caucasian languages. Note that the Basque language is traditionally considered as an isolate, i.e. a language whose nearest relatives we cannot find among the languages in the world. Many linguists have attempted to solve this puzzle by proposing various hypotheses (see Forni 2013 for an example of attempting to link Basque with Indo-European). Bengtson proposes various types of evidence, which I cannot really judge, as I do not know the languages under comparison, but finally, he also shows a list with potential homologs between Basque and North Caucasian varieties, which you find below.

Potential homologs between Basque and North Caucasian languages (Bengtson 2017)

If you are not a trained historical linguistic, and thus do not know what to do with this table, be assured that many historical linguists will feel similarly. As a rough explanation: the concepts are supposed to be very, very stable, being drawn from Sergey Yakhontov's list of 35 ultra-stable concepts, and I think that all words in one row are supposed to be etymologically related — that is, they should be potential homologs across all of the languages. If word forms are preceded by the asterisk symbol (*), this means that they are reconstructed, i.e. not reflected in written sources. But that is all I can tell you for the moment. Where I should start the comparison between the words remains a mystery for me, as I do not know which parts are supposed to be similar. Alignments would help us to see immediately where the author thinks that the historical similarities can be found — that is, we would see, which parts of the words are supposed to be homologous.

At this point in the post, I originally planned to provide you with an alignment of Bengtson's table, in order to illustrate the benefits of alignment in linguistics. Unfortunately, I had to admit to myself that I cannot do this, as I simply do not know where to align the words (apart from some rare trivial cases in the table).

I really hope that this will change in the future. Too often, our hypotheses in linguistics suffer from insufficient transparency with regards to the "proofs" and the evidence. I agree that it is very difficult to come up with good alignments in linguistics, especially if one regards cases of metathesis, unrelated parts, and general uncertainty. However, instead of giving in to the problem, we should follow the pioneering work of Dixon and Kroeber, and try to improve the way we present our data to both our colleagues and a broader public.

Theories such as the link between Basque and the North Caucasian languages are usually highly disputed in historical linguistics, and I do not know of any long range proposal that has gained broad acceptance during the last 50 years. Yet, maybe this is not because the proposals are not valid, but simply because those who are proposing these theories have failed to present their findings in a transparent and testable way.

References
  • Bengtson, J. (2017) The Euskaro-Caucasian Hypothesis. Current model. PDF.
  • Dixon, R. and A. Kroeber (1919) Linguistic families of California. University of California Press: Berkeley.
  • Forni, G. (2013) Evidence for Basque as an Indo-European language. The Journal of Indo-European Studies 41.1 & 2: 1-142.
  • Kehr, B., K. Trappe, M. Holtgrewe, and K. Reinert (2014) Genome alignment with graph data structures: a comparison. BMC Bioinformatics 15.1: 99.
  • List, J.-M., P. Lopez, and E. Bapteste (2016) Using sequence similarity networks to identify partial cognates in multilingual wordlists. In: Proceedings of the Association of Computational Linguistics 2016 (Volume 2: Short Papers). Association of Computational Linguistics, pp. 599-605.
  • Mathur, R. and N. Adlakha (2016) A graph theoretic model for prediction of reticulation events and phylogenetic networks for DNA sequences. Egyptian Journal of Basic and Applied Sciences 3.3: 263-271.

Tuesday, March 21, 2017

Computer viruses and phylogenetic networks


I have written before about the Phylogenetics of computer viruses. This is an example of the use of phylogenetics as a metaphor for the history of non-biological objects. By analogy, computer viruses and other malware can be seen to be phylogenetically related, because new viruses are usually generated using existing malicious computer code — that is, one virus "begets" another virus due to changes in its intrinsic attributes. In this sense the analogy is helpful, although there is no actual copying of anything resembling a genome — this is phenotype evolution not genotype evolution.

Furthermore, the model of historical change in computer viruses is often the same as that for biological viruses — recombination rather than substitution. That is, like real viruses, new computer viruses are often created by recombining chunks of functional information from pre-existing viruses, rather than by an accumulation of small changes. Coherent subsets of the current computer code are combined to form the new programs.


From this perspective, it is unexpected that the principal phylogenetic model in the study of computer viruses has been a tree rather than a network — a recombinational history requires a network representation, not a tree, and thus malware evolution is not tree-like. As noted by Liu et al. (2016): "Although tree-based models are the mainstream direction, they are not suited to represent the reticulation events which have happened in malware generation."

In my previous (2014) post, I noted only two known papers that used a network rather than a tree to represent malware evolution:
  • Goldberg et al. (1996) analyzed their data using what they call a phyloDAG, which is a directed network that can have multiple roots (it appears to be a type of minimum-spanning network; described in more detail in Phylogenetics of computer viruses);
  • Khoo & Lió (2011) used splits graphs rather than unrooted trees to display their data, although they did not specify the algorithm for producing their networks.
Unfortunately, malware researchers have continued to pursue the idea that a phylogeny is simply a form of classification, and have therefore stuck to the idea of producing a tree-like phylogeny using some form of hierarchical agglomerative clustering algorithm (eg. Bernardi et al. 2016).

More positively, however, some papers have appeared that have instead pursued the idea of using a network model rather than a tree:
  • Liu et al. (2016) provided median-joining networks, which are unrooted splits graphs, to display relationships within each of three different virus groups;
  • Jang et al. (2013) infered a directed acyclic graph using a minimum spanning tree algorithm, with a post-processing step to allow nodes to have multiple parents;
  • Anderson et al. (2014) presented a novel algorithm based on a graphical lasso, which builds the phylogeny as an undirected graph, to which directionality is then added using a post-hoc heuristic;
  • Oyen et al. (2016) "present a novel Bayesian network discovery algorithm for learning a DAG [directed acyclic graph] via statistical inference of conditional dependencies from observed data with an informative prior on the partial ordering of variables. Our approach leverages the information on edge direction that a human can provide and the edge presence inference which data can provide."
It is important to note that only the works producing a directed graphs can represent a phylogeny — the other works produce unrooted graphs that may or may not reflect phylogenetic history. The bayesian work of Oyen et al. (2016) is particularly interesting:
Directionality is inferred by the learning process, but in many cases it is difficult to infer, therefore prior information is included about the edge directions, either from human experts or a simple heuristic. This paper introduces a novel approach to combining human knowledge about the ordering of variables into a statistical learning algorithm for Bayesian structure discovery. The learning algorithm with our prior combines the complementary benefits of using statistical data to infer dependencies while leveraging human knowledge about the direction of dependencies.

References

Anderson B, Lane T, Hash C (2014) Malware phylogenetics based on the multiview graphical lasso. Lecture Notes in Computer Science 8819: 1-12.

Bernardi ML, Cimitile M, Mercaldo F (2016) Process mining meets malware evolution : a study of the behavior of malicious code. Proceedings of the 2016 Fourth International Symposium on Computing and Networking, pp 616-622. IEEE Computer Society Washington, DC.

Goldberg LA, Goldberg PW, Phillips CA, Sorkin GB (1996) Constructing computer virus phylogenies. Lecture Notes in Computer Science 1075: 253-270. [also Journal of Algorithms (1998) 26: 188-208]

Jang J, Woo M, Brumley D (2013) Towards automatic software lineage inference. Proceedings of the Twenty-Second USENIX Conference on Security, pp 81-96. USENIX Association, Berkeley, CA.

Khoo WM, Lió P (2011) Unity in diversity: phylogenetic-inspired techniques for reverse engineering and detection of malware families. Proceedings of the 2011 First Systems Security Workshop (SysSec'11), pp 3-10. IEEE Computer Society Washington, DC.

Liu J, Wang Y, Wang Y (2016) Inferring phylogenetic networks of malware families from API sequences. Proceedings of the 2016 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, pp 14-17. IEEE Computer Society Washington, DC.

Oyen D, Anderson B, Anderson-Cook C (2016) Bayesian networks with prior knowledge for malware phylogenetics. The Workshops of the Thirtieth AAAI Conference on Artificial Intelligence Artificial Intelligence for Cyber Security: Technical Report WS-16-03, pp 185-192. Association for the Advancement of Artificial Intelligence, Palo Alto, CA.

Tuesday, March 14, 2017

Detecting introgression versus hybridization


There has been considerable interest in recent years in developing methods that will detect hybridization in the presence of incomplete lineage sorting (ILS), which will allow the construction of a realistic hybridization network. Clearly, both ILS and hybridization create conflicting gene trees, which will lead to a very complex data-display network. However, if the ILS signals in the data can be used to construct a small collection of gene-tree groups, in which the gene trees within each group are congruent with a single species tree (under the ILS model), then the incongruence between groups can be used to construct a hybridization network. This network will then be an hypothesis for a realistic evolutionary network.

Recently, a paper has appeared that uses simulations to evaluate several of these methods:
Olga K. Kamneva and Noah A. Rosenberg (2017) Simulation-based evaluation of hybridization network reconstruction methods in the presence of incomplete lineage sorting. Evolutionary Bioinformatics 2017:13.
I am not a great fan of simulations, because they exist under very restricted and usually unrealistic mathematical conditions. They are, however, useful for exploring the mathematical properties of various methods, even if they are hard to connect to the biological properties.

My interpretations of the results from the particular scenarios explored by Kamneva and Rosenberg are:
  1. Most of the methods improve as the internal network edges increase in length.
  2. Most of the methods improve as the number of gene trees increases.
  3. Under good conditions the maximum-likelihood methods do better than the parsimony and consensus methods.
  4. The maximum-likelihood methods are more affected by gene-tree error than are the other methods.
  5. There are conditions under which none of the methods work well.
I doubt that any of this is controversial, in the sense that model-based methods usually work well when their models apply, but not necessarily otherwise. Reality is more complex than the models, and so the methods are likely to fail for real data.

For me, the most interesting part of the paper is the examination of balanced versus skewed parental contributions to the hybrid taxon. A balanced genetic contribution in the simulations is analogous to homoploid or polyploid hybridization, whereas a skewed contribution is analogous to introgression or horizontal gene transfer (HGT). The simulations seem to show that the methods examined do not deal very well with skewed contributions.

So, these methods may literally be hybridization-network methods only, with separate network methods needed for detecting introgression or HGT — for example, the admixture methods used for genomes (see the recent post on Producing admixture graphs).

This would mean that we cannot first produce networks with reticulations, and then afterwards explore what is causing the reticulations. Instead, we will need to decide on the possible biological mechanisms of reticulation before the analysis, and then mathematically explore possible networks that reflect those mechanisms.

This is not an issue for constructing trees, of course, since the only recognized mechanisms are speciation and extinction, both of which are explored post hoc rather than a priori. This is an important difference of networks versus trees.

Tuesday, March 7, 2017

Roundels and family trees


I have written before about the slow development of what has come to be known as the "family tree", including reducing human network relationships to a tree-like form (Reducing networks to trees), and presenting it as an actual tree (Drawing family trees as trees), rooted at the base (Does it matter which way up a tree is drawn?).

Most of the early representations of pedigrees had the people's names enclosed in a circle, called a "roundel", and it was these roundels that were connected to show the family relationships. One of the steps on the way to a tree was thus dropping this idea, so that the names could be connected directly.

Some of the diagrams with roundels that I have covered include:

c. 400 CE — The genealogy of Jesus Christ, Part I, Part II, Part III
c. 1000 CE — Genealogy of Cunigunde of Luxembourg
c. 1140 CE — Genealogy of the Carolingians
c. 1185 CE — Genealogy of the Welf dynasty
c. 1237 CE — Genealogy of the Ottonian dynasty

Interestingly, the earliest pedigrees that do not have roundels also date from this early period. As noted by Nathaniel Lane Taylor, the importance of this development is that: "the scribe relies on the power of the names themselves to anchor a diagram on the page, with lines simply taking the place of any syntax needed to describe the filiation." That is, no abstract iconography is needed.

I have already illustrated the earliest known example:
c. 1121 CE — The genealogy of Lambert of Saint-Omer


Taylor provides links to illustrations of the next known example:
   c. 1128, John of Worcester, Chronicle of World and English History (Corpus Christi College MS 157).
This book contains eight genealogies of Anglo-Saxon and Norman kings (pp. 47-54), one of which is shown above.

Taylor also refers to "one of the Arabic stemmata" illustrated in:
   Arthur Watson (1934) The Early Iconography of the Tree of Jesse. Oxford University Press,
I have not seen this book, but the illustrations are apparently confined to those from the 12th century, making the diagram contemporaneous with the two listed above. The Tree of Jesse normally appears in Medieval Christian art as a richly illustrated genealogy of Jesus in illuminated manuscripts, but apparently this one was an exception.

Tuesday, February 28, 2017

Models and processes in phylogenetic reconstruction


Since I started interdisciplinary work (linguistics and phylogenetics), I have repeatedly heard the expression "model-based". This expression often occurrs in the context of parsimony vs. maximum likelihood and Bayesian inference, and it is usually embedded in statements like "the advantage of ML is that it is model-based", or "but parsimony is not model-based". By now I assume that I get the gist of these sentences, but I am afraid that I often still do not get their point. The problem is the ambiguity of the word "model" in biology but also in linguistics.

What is a model? For me, a model is usually a formal way to describe a process that we deal with in our respective sciences, nothing more. If we talk about the phenomenon of lexical borrowing, for example, there are many distinct processes by which borrowing can happen.

A clearcut case is Chinese kāfēi 咖啡 "coffee". This word was obviously borrowed from some Western language not that long ago. I do not know the exact details (which would require a rather lengthy literature review and an inspection of older sources), but that the word is not too old in Chinese is obvious. The fact that the pronunciation comes close to the word for coffee in the largest European languages (French, English, German) is a further hint, since the longer a new word has survived after having been transplanted to another language, the more it resembles other words in that language regarding its phonological structure; and the syllable does not occur in other words in Chinese. We can depict the process with help of the following visualization:


Lexical borrowing: direct transfer
The visualization tells us a lot about a very rough and very basic idea as to how the borrowing of words proceeds in linguistics: Each word has a form and a function, and direct borrowing, as we could call this specific subprocess, proceeds by transferring both the form and the function from the donor language to the target language. This is a very specific type of borrowing, and many borrowing processes do not directly follow this pattern.

In the Chinese word xǐnǎo 洗脑 "brain-wash", for example, the form (the pronunciation) has not been transferred. But if we look at the morphological structure of xǐnǎo, being a compound consisting of the verb "to wash" and nǎo "the brain", it is clear that here Chinese borrowed only the meaning. We can visualize this as follows:
Lexical borrowing: meaning transfer

Unfortunately, I am already starting to simplify here. Chinese did not simply borrow the meaning, but it borrowed the expression, that is, the motivation to express this specific meaning in an analogous way to the expression in English. However, when borrowing meanings instead of full words, it is by no means straightforward to assume that the speakers will borrow exactly the same structure of expression they find in the donor language. The German equivalent of skyscraper, for example, is Wolkenkratzer, which literally translates as "cloudscraper".

There are many different ways to coin a good equivalent for "brain-wash" in any language of the world but which are not analogous to the English expression. One could, for example, also call it "head-wash", "empty-head", "turn-head", or "screw-mind"; and the only reason we call it "brain-wash" (instead of these others) is that this word was chosen at some time when people felt the need to express this specific meaning, and the expression turned out to be successful (for whatever reason).

Thus, instead of just distinguishing between "form transfer" and "meaning transfer", as my above visualizations suggest, we can easily find many more fine-grained ways to describe the processes of lexical borrowing in language evolution. Long ago, I took the time to visualize the different types of borrowing processes mentioned in the work of (Weinreich 1953[1974]) in the following graphic:

Lexical borrowing: hierarchy following Weinreich (1953[1974])

From my colleagues in biology, I know well that we find a similar situation in bacterial evolution with different types of lateral gene transfer (Nelson-Sathi et al. 2013). We are even not sure whether the account by Weinreich as displayed in the graphic is actually exhaustive; and the same holds for evolutionary biology and bacterial evolution.

But it may be time to get back to the models at this point, as I assume that some of you who have read this far have began to wonder why I am spending so many words and graphics on borrowing processes when I promised to talk about models. The reason is that in my usage of the term "model" in scientific contexts, I usually have in mind exactly what I have described above. For me (and I suppose not only for me, but for many linguists, biologists, and scientists in general), models are attempts to formalize processes by classifying and distinguishing them, and flow-charts, typologies, descriptions and the identification distinctions are an informal way to communicate them.

If we use the term "model" in this broad sense, and look back at the discussion about parsimony, maximum likelihood, and Bayesian inference, it becomes also clear that it does not make immediate sense to say that parsimony lacks a model, while the other approaches are model-based. I understand why one may want to make this strong distinction between parsimony and methods based on likelihood-thinking, but I do not understand why the term "model" needs to be employed in this context.

Nearly all recent phylogenetic analyses in linguistics use binary characters and describe their evolution with the help of simple birth-death processes. The only difference between parsimony and likelihood-based methods is how the birth-death processes are modelled stochastically. Unfortunately, we know very well that neither lexical borrowing nor "normal" lexical change can be realistically described as a birth-death process. We even know that these birth-death processes are essentially misleading (for details, see List 2016). Instead of investing our time to enhance and discuss the stochastic models driving birth-death processes in linguistics, doesn't it seem worthwhile to have a closer look at the real proceses we want to describe?

References
  • List, J.-M. (2016) Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1.2. 119-136.
  • Nelson-Sathi, S., O. Popa, J.-M. List, H. Geisler, W. Martin, and T. Dagan (2013) Reconstructing the lateral component of language history and genome evolution using network approaches. In: : Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization. Franz Steiner Verlag: Stuttgart. 163-180.
  • Weinreich, U. (1974) Languages in contact. With a preface by André Martinet. Mouton: The Hague and Paris.

Saturday, February 25, 2017

Blog anniversary: 5 years


The first post was put up on this blog on Saturday, February 25 2012, which makes today the fifth anniversary.

First blog header

By my reckoning, this is the 469th blog post, not all of them written by me, of course; but this makes an average of one post for every 3.9 of the 1,827 days. I have never counted the number of actual words, but if I had ever contemplated that number then I probably would never have started.

Second blog header

It is rather tricky to estimate the readership, because of the number of blog hits that clearly come from robots. However, even trying to take that into account, I get an estimate just short of 500,000 pageviews over the 5 years.

Third blog header

So, thanks to everyone for dropping by. If you ever feel inclined to re-read any of the old posts, then they are grouped roughly by topic in the "Pages" at the top of the right-hand column.

Monday, February 20, 2017

Producing admixture graphs


I have written before about admixture graphs, which are phylogenetic networks that represent reticulations due to introgression:
To date, these graphs have not really been incorporated into the mainstream network literature. Part of the problem has been the rather disparate nature of the admixture literature itself. A paper has recently appeared as a preprint in Bioinformatics that provides a brief introduction to this situation:
  • Kalle Leppälä, Svend Vendelbo Nielsen, Thomas Mailund (2017) admixturegraph: an R package for admixture graph manipulation and fitting. Bioinformatics

There are currently several quite different programs for producing admixture graphs:
  • qpgraph (Castelo and Roberato 2006)
  • TreeMix (Pickrell and Pritchard 2012)
  • AdmixTools (Patterson et al. 2012)
  • MixMapper (Lipson et al. 2013)
  • admixturegraph (see above)
These programs summarize the genetic data in different ways based on genetic drift (eg. the covariance matrix versus so-called f statistics), and construct the graphs in different ways (eg. sequential heuristic building versus a user specified graph). There are also different ways to evaluate the graphs, including fitting the graph parameters using likelihood, and comparing them, including the bootstrap, jackknife, and MCMC.

None of this is ideal. Another problem has been that the graphs are often constructed by hand, and may be needed as input to the programs. However, the biggest limitation is that there are currently no algorithms for inferring the optimal graph topology. This is, of course, the basic problem that needs to be solved for all network construction. To quote the authors with regard to their own R package:
The set of all possible graphs, even when limited to one or two admixture events, grows super-exponentially in the number of leaves, and it is generally not computationally feasible to explore this set exhaustively. Still, we give graph libraries for searching through all possible topologies with not too many leaves and admixture events.
For larger graphs we provide functions for exploring all possible graphs that can be reached from a given graph by adding one extra admixture event or by adding one additional leaf. However, the best fitting admixture graphs are not necessarily extensions of best fitting smaller graphs, so we recommend that users not only expand the best smaller graph but a selected few best of them.
The world of graph-edge rearrangements (NNI, SPR) does not yet seem to have encountered the world of admixture graphs.

Tuesday, February 14, 2017

The evolution of women's clothing sizes


Several years ago I presented a piece about the Evolutionary history of Mazda motor cars, in which I pointed out that what is known in biology as Cope's Rule of phyletic size increase applies to manufactured objects as well as to biological organisms. This "rule" suggests that the size of the organisms within a species generally increases through evolutionary time. Human beings, for example, are on average larger now than they were a few thousand years ago. Furthermore, through time, new species arise to occupy the niches that have been vacated (because the previous organisms are now too big to fit).

This situation is easy to demonstrate for cars, because all successful car models get bigger through time — the customers indicate that the car is not quite big enough, and the manufacturer responds. Some examples are illustrated in Car sizes through the years.

Another simple example is women's clothing, which I will discuss here.

Women's clothing changes through time in response to two factors in the modern world: changes in the "desired" image of women (as discussed in the post on Changes in Playboy's women through 60 years), and increasing obesity in western society (see the post on Fast food and diet). Illustrating Cope's Rule in this case is thus easy.

There have been five voluntary "standards" developed over the past century for standardized clothing sizes in the USA, as discussed in Wikipedia. These standards describe, for example, what sized woman should fit into a Size 12 in terms of various of her dimensions. There is nothing mandatory about these standards, and they simply reflect societal recommendations at any given time. So, a Size 12 in 1958 is not the same as a Size 12 in 2008.

These three graphs illustrate the time course of the changes in each of the defined clothing sizes (Size 0 to Size 20), in terms of three female girth measurements.




This is blatantly Cope's Rule in all three cases. All of the sizes get bigger through time, at approximately the same rate. Furthermore, as the dimensions increase through time, new sizes appear to fit the smaller women — Size 8 did not exist in 1931, Size 6 did not exist in 1958, Sizes 2 and 4 did not exits in 1971, and sizes 0 and 00 did not exist in 1995.

To put it another way, a Size 12 woman today is much larger than her Size 12 mother was, who in turn was bigger than the Size 12 grandmother. I believe that this is referred to in the clothing business as "vanity sizing", which it may well be, but it is also a natural example of Cope's Rule of phyletic size increase.

Finally, there is no reason to expect that this phyletic size increase will stop any time soon. Do cars or clothes have an upper limit on their size? Biological organisms do, mainly because of the effect of gravity, and so the phyletic size increase either ceases or the species becomes extinct. Manufactured objects are different.

Data sources
  • DuBarry / Woolworth (1931-1955) - see Wikipedia
  • National Institute of Standards and Appeals (1958) Commercial Standard CS215-58: Body Measurements for the Sizing of Women's Patterns and Apparel Table 4
  • National Institute of Standards and Appeals (1971) Commercial Standard PS42-70: Body Measurements for the Sizing of Women's Patterns and Apparel Table 4
  • ASTM International (1995, revised 2001) Standard D5585 95 (R2001)
  • ASTM International (2011) Active Standard D5585 11e1: Standard Tables of Body Measurements for Adult Female Misses Figure Type, Size Range 00–20

Tuesday, February 7, 2017

Networks, trees and sequence polymorphisms


One of the more obvious bits of evidence that an organismal history may not be entirely tree-like is the presence of sequence polymorphisms. For example, intra-individual site polymorphisms in ITS sequences create considerable conflict in a dataset, if we try to construct a tree-like phylogeny.

This means that people have adopted a range of strategies to try to get a nice neat tree out of their data. This topic is briefly reviewed in this recent paper:
Agnes Scheunert and Günther Heubl (2017) Against all odds: reconstructing the evolutionary history of Scrophularia (Scrophulariaceae) despite high levels of incongruence and reticulate evolution. Organisms Diversity and Evolution in press.
The authors discuss the following strategies, for which they also provide a few literature references.

1. Delete the offending taxa

Pruning the offending taxa is among the most-used tactics. This deletes part of the phylogeny, of course.

2. Delete the polymorphisms

Excluding the polymorphic alignment positions is probably the most common tactic. Similar strategies include the replacement of the polymorphisms with either a missing data code or the most common nucleotide at that position. All of these ideas resolve the polymorphisms in favor of the strongest phylogenetic signal, and thus sweep the conflicting signals under the carpet.

3. Select single gene copies

The polymorphisms become apparent because there are multiple copies of the gene(s) concerned, and therefore selecting a single copy removes the polymorphisms. This can be done by cloning the gene (at the time of data collection), or by statistical haplotype phasing methods (during the data analysis). This also sweeps the conflicting signals under the carpet..

4. Code the polymorphisms

As a preferred alternative, rather than discarding or substituting the sequence variabilities, we could include them as phylogenetically informative characters. This would allow the construction of a phylogenetic network, as well as a tree-like history.

One possibility, suggested by Fuertes Aguilar and Nieto Feliner (2003), concentrates on Additive Polymorphic Sites (APS). A sequence site is an APS when each of the nucleotides involved in the polymorphism can also be found separately at the same site in at least one other accession. Other intra-individual polymorphisms are ignored. This approach has been used to detect hybrids, for example.

An alternative, as used by Scheunert and Günther Heubl to study reticulate evolution in their paper, uses 2ISP (Intra-Individual Site Polymorphisms). All IUPAC codes, including polymorphic sites, are treated as unique characters, by recoding the complete alignment as a standard matrix, which is then analyzed using a multistate analysis option for categorical data. The authors actually use the ad hoc maximum-likelihood implementation from Potts et al. (2014), with additional adaptation of a method for bayesian inference based on Grimm et al. (2007).

You can check out these papers for details.

References

Fuertes Aguilar J., Nieto Feliner G. (2003) Additive polymorphisms and reticulation in an ITS phylogeny of thrifts (Armeria, Plumbaginaceae). Molecular Phylogenetics and Evolution 28: 430-447.

Grimm G.W., Denk T., Hemleben V. (2007) Coding of intraspecific nucleotide polymorphisms: a tool to resolve reticulate evolutionary relationships in the ITS of beech trees (Fagus L., Fagaceae). Systematics and Biodiversity 5: 291-309.

Potts A.J., Hedderson T.A., Grimm G.W. (2014) Constructing phylogenies in the presence of intra-individual site polymorphisms (2ISPs) with a focus on the nuclear ribosomal cistron. Systematic Biology 63: 1-16.

Tuesday, January 31, 2017

Similarities and language relationship


There is a long-standing debate in linguistics regarding the best proof deep relationships between languages. Scholars often break it down to the question of words vs. rules, or lexicon vs. grammar. However, this is essentially misleading, since it suggests that only one type of evidence could ever be used, whereas most of the time it is the accumulation of multiple pieces of evidence that helps to convince scholars. Even if this debate is misleading, it is interesting, since it reflects a general problem of historical linguistics: the problem of similarities between languages, and how to interpret them.

Unlike (or like?) biology, linguistics has a serious problem with similarities. Languages can be strikingly similar in various ways. They can share similar words, but also similar structures, similar ways of expressing things.

In Chinese, for example, new words can be easily created by compounding existing ones, and the word for 'train' is expressed by combining huǒ 火 'fire' and chē 車 'wagon'. The same can be done in languages like German and English, where the words Feuerwagen and fire wagon will be slightly differently interpreted by the speakers, but the constructions are nevertheless valid candidates for words in both languages. In Russian, on the other hand, it is not possible to just put two nouns together to form a new word, but one needs to say something as огненная машина (ognyonnaya mašína), which literally could be translated as 'firy wagon'.

Neither German nor English are historically closely related to Chinese, but German, English, and Russian go back to the same relatively recent ancestral language. We can see that whether a language allows compounding of two words to form a new one or not, is not really indicative of its history, as is the question of whether a language has an article, or whether it has a case system.

The problem with similarities between languages is that the apparent similarities may have different sources, and not all of them are due to historical development. Similarities can be:
  1. coincidental (simply due to chance),
  2. natural (being grounded in human cognition),
  3. genealogical (due to common inheritance), and
  4. contact-induced (due to lateral transfer).
As an example for the first type of similarity, consider the Modern Greek word θεός [θɛɔs] ‘god’ and the Spanish dios [diɔs] ‘god’. Both words look similar and sound similar, but this is a sheer coincidence. This becomes clear when comparing the oldest ancestor forms of the words that are reflected in written sources, namely Old Latin deivos, and Mycenaean Greek thehós (Meier-Brügger 2002: 57f).

As an example of the second type of similarity, consider the Chinese word māmā 媽媽 'mother' vs. the German Mama 'mother'. Both words are strikingly similar, not because they are related, but because they reflect the process of language acquisition by children, which usually starts with vowels like [a] and the nasal consonant [m] (Jakobson 1960).

An example of genealogical similarity is the German Zahn and the English tooth, both going back to a Proto-Germanic form *tanθ-. Contact-induced similarity (the fourth type) is reflected in the English mountain and the French montagne, since the former was borrowed from the latter.

We can display these similarities in the following decision tree, along with examples from the lexicon of different languages (see List 2014: 56):

Four basic types of similarity in linguistics

In this figure, I have highlighted the last two types of similarity (in a box) in order to indicate that they are historical similarities. They reflect individual language development, and allow us to investigate the evolutionary history of languages. Natural and coincidental similarities, on the other hand, are not indicative of history.

When trying to infer the evolutionary history of languages, it is thus crucial to first rule out the non-historical similarities, and then the contact-induced similarities. The non-historical similarities will only add noise to the historical signal, and the contact-induced similarities need to be separated from the genealogical similarities, in order to find out which languages share a common origin and which languages have merely influenced each other some time during their history.

Unfortunately, it is not trivial to disentangle these similarities. Coincidence, for example, seems to be easy to handle, but it is notoriously difficult to calculate the likelihood of chance similarities. Scholars have tried to model the probability of chance similarities mathematically, but their models are far too simple to provide us with good estimations, as they usually only consider the first consonant of a word in no more than 200 words of each language (Ringe 1992, Baxter and Manaster Ramer 2000, Kessler 2001).

The problem here is that everything that goes beyond word-initial consonants would have to take the probability of word structures into account. However, since languages differ greatly regarding their so-called phonotactic structure (that is, the sound combinations they allow to occur inside a syllable or a word), an account on chance similarities would need to include a probabilistic model of possible and language-specific word structures. So far, I am not aware of anybody who has tried to tackle this problem.

Even more problematic is the second type of similarity. At first sight, it seems that one could capture natural similarities by searching for similarities that recur in very diverse locations of the world. If we compare, for example, which languages have tones, and we find that tones occur almost all over the world, we could argue that the existence of tone languages is not a good indicator of relatedness, since tonal systems can easily develop independently.

The problem with independent development, however, is again tricky, as we need to distinguish different aspects of independence. Independent development could be due to: human cognition (the fact that many languages all over the world denote the bark of a tree with a compound tree-skin is obviously grounded in our perception); or due to language acquisition (like the case of words for 'mother'); but potentially also due to environmental factors, such as the size of the population of speakers (Lupyan et al. 2010), or the location where the languages are spoken (see Everett et al. 2015, but also compare the critical assessment in Hammarström 2016).

Convergence (in linguistics, the term is used to denote similar development due to contact) is a very frequent phenomenon in language evolution, and can happen in all domains of language. Often we simply do not know enough to make a qualified assessment as to whether certain features that are similar among languages are inherited/borrowed or have developed independently.

Interestingly, this was first emphasized by Karl Brugmann (1849-1919), who is often credited as the "father of cladistic thinking" in linguistics. Linguists usually quote his paper from 1884, in order to emphasize the crucial role that Brugmann attributed to shared innovations (synapomorphies in the cladistic terminology) for the purpose of subgrouping. When reading this paper thoroughly, however, it is obvious that Brugmann himself was much less obsessed with the obscure and circular notion of shared innovations (which also holds for cladistics in biology; see De Laet 2005), but with the fact that it is often impossible to actually find them, due to our incapacity to disentangle independent development, inheritance and borrowing.

So far, most linguistic research has concentrated on the problem of distinguishing borrowed from inherited traits, and it is here that the fight over lexicon or grammar as primary evidence for relatedness primarily developed. Since certain aspects of grammar, like case inflection, are rarely transferred from one language to another, while words are easily borrowed, some linguists claim that only grammatical similarities are sufficient evidence of language relationship. This argument is not necessarily productive, since many languages simply lack grammatical structures like inflection, and will therefore not be amenable to any investigation, if we only accept inflectional morphology (grammar) as rigorous proof (for a full discussion, see Dybo and Starostin 2008). Luckily, we do not need to go that far. Aikhenvald (2007: 5) proposes the following borrowability scale:
Aikhenvald's (2007) scale of borrowability

As we can see from this scale, core lexicon (basic vocabulary) ranks second, right behind inflectional morphology. Pragmatically, we can thus say: if we have nothing but the words, it is better to compare words than anything else. Even more important is that, even if we compare what people label "grammar", we compare concrete form-meaning pairs (e.g., concrete plural-endings), and we never compare abstract features (e.g., whether languages have an article). We do so in order to avoid the "homoplasy problem" that causes so many headaches in our research. No biologist would group insects, birds, and bats based on their wings; and no linguist would group Chinese and English due to their lack of complex morphology and their preference for compound words.

Why do I mention all this in this blog post? For three main reasons. First, the problem of similarity is still creating a lot of confusion in the interdisciplinary dialogues involving linguistics and biology. David is right: similarity between linguistic traits is more like similarity in morphological traits in biology (phenotype), but too often, scholars draw the analogy with genes (genotype) (Morrison 2014).

Second, the problem of disentangling different kinds of similarities is not unique to linguistics, but is also present in biology (Gordon and Notar 2015), and comparing the problems that both disciplines face is interesting and may even be inspiring.

Third, the problem of similarities has direct implications for our null hypothesis when considering certain types of data. David asked in a recent blog post: "What is the null hypothesis for a phylogeny?" When dealing with observed similarity patterns across different languages, and recalling that we do not have the luxury to assume monogenesis in language evolution, we might want to know what the null hypothesis for these data should be. I have to admit, however, that I really don't know the answer.

References
  • Aikhenvald, A. (2007): Grammars in contact. A cross-linguistic perspective. In: Aikhenvald, A. and R. Dixon (eds.): Grammars in Contact. Oxford University Press: Oxford. 1-66.
  • Baxter, W. and A. Manaster Ramer (2000): Beyond lumping and splitting: Probabilistic issues in historical linguistics. In: Renfrew, C., A. McMahon, and L. Trask (eds.): Time depth in historical linguistics. McDonald Institute for Archaeological Research: Cambridge. 167-188.
  • Brugmann, K. (1884): Zur Frage nach den Verwandtschaftsverhältnissen der indogermanischen Sprachen [Questions regarding the closer relationship of the Indo-European languages]. Internationale Zeischrift für allgemeine Sprachewissenschaft 1. 228-256.
  • De Laet, J. (2005): Parsimony and the problem of inapplicables in sequence data. In: Albert, V. (ed.): Parsimony, phylogeny, and genomics. Oxford University Press: Oxford. 81-116.
  • Dybo, A. and G. Starostin (2008): In defense of the comparative method, or the end of the Vovin controversy. In: Smirnov, I. (ed.): Aspekty komparativistiki.3. RGGU: Moscow. 119-258.
  • Everett, C., D. Blasi, and S. Roberts (2015): Climate, vocal folds, and tonal languages: Connecting the physiological and geographic dots. Proceedings of the National Academy of Sciences 112.5. 1322-1327.
  • Gordon, M. and J. Notar (2015): Can systems biology help to separate evolutionary analogies (convergent homoplasies) from homologies?. Progress in Biophysics and Molecular Biology 117. 19-29.
  • Hammarström, H. (2016): There is no demonstrable effect of desiccation. Journal of Language Evolution 1.1. 65–69.
  • Jakobson, R. (1960): Why ‘Mama’ and ‘Papa’?. In: Perspectives in psychological theory: Essays in honor of Heinz Werner. 124-134.
  • Kessler, B. (2001): The significance of word lists. Statistical tests for investigating historical connections between languages. CSLI Publications: Stanford.
  • List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
  • Lupyan, G. and R. Dale (2010): Language structure is partly determined by social structure. PLoS ONE 5.1. e8559.
  • Meier-Brügger, M. (2002): Indogermanische Sprachwissenschaft. de Gruyter: Berlin and New York.
  • Morrison, D. (2014): Is the Tree of Life the best metaphor, model, or heuristic for phylogenetics?. Systematic Biology 63.4. 628-638.
  • Ringe, D. (1992): On calculating the factor of chance in language comparison. Transactions of the American Philosophical Society 82.1. 1-110.

Wednesday, January 25, 2017

Irony and duality in phylogenetics


Ruben E. Valas and Philip E Bourne (2010. Save the tree of life or get lost in the woods. Biology Direct 5: 44) have an interesting discussion of the relationship between the Tree of Life and the Web of Life. They argue that:
Function follows more of a tree-like structure than genetic material, even in the presence of horizontal transfer ... We propose a duality where we must consider variation of genetic material in terms of networks and selection of cellular function in terms of trees. Otherwise one gets lost in the woods of neutral evolution.

As an aside, they also note:
We must keep in mind the humor of calling the central metaphor for evolution "the tree of life". The phrase first appears in Genesis 2:9 ... There is irony in using the name of a tree central to the creation story to argue against that very myth.

There is clearly a duality in Darwin's theory of descent with modification: the history of variation is well described by a network and the history of selection is well described by a tree.

Tuesday, January 17, 2017

What is the null hypothesis for a phylogeny?


As noted in the previous blog post (Why do we need Bayesian phylogenetic information content?), phylogeneticists rarely consider whether their data actually contain much phylogenetic information. Nevertheless, the existence of information content in a dataset implies the existence of null hypothesis of "no information", relative to the objective of the data analysis.

In this regard, Alexander Suh (2016), in a paper on the phylogenetics of birds, makes two important general points:
  • Every phylogenetic tree hypothesis should be accompanied by a phylogenetic network for visualization of conflicts.
  • Hard polytomies exist in nature and should be treated as the null hypothesis in the absence of reproducible tree topologies.
It is difficult to argue with the first point, of course. However, the second point is also an interesting one, and deserves some consideration. Suh notes that: "In contrast to ‘soft polytomies’ that result from insufficient data, ‘hard polytomies’ reflect the biological limit of phylogenetic resolution because of near-simultaneous speciation". That is, the distinction is whether polytomies result from simultaneous branching events (hard) or from insufficient sequence information (soft).

The matter of a suitable null hypothesis in phylogenetics has been considered before, for example by Hoelzer and Meinick (1994) and Walsh et al. (1999), who come to essentially the same conclusion as Suh (2016). Clearly, a network cannot be the null hypothesis for a phylogeny, and nor can a resolved tree (even partially resolved); the only logical possibility is a polytomy.

However, it seems to me that the current null hypothesis is effectively a soft polytomy, although no hypothesis is ever explicitly stated by most workers. Nevertheless, any evidence to resolve polytomies seems to be accepted, with evidence taken in descending order of strength in order to resolve any conflicting evidence. This inevitably produces a tree that is at least partly resolved, which is the alternative hypothesis.

On the other hand, resolving a hard polytomy requires unambiguous evidence for each branch in the phylogeny. If there is substantial conflict then it can only be resolved as a reticulation, or it must remain a polytomy. The existence of a reticulation, of course, results in a network, not a tree, so that the alternative hypothesis is a network, which may in practice be very tree-like.

So, in phylogenetics we have: null hypothesis = hard polytomy, alternative = network, rather than null hypothesis = soft polytomy, alternative = partially resolved tree.

As a final point, Suh claims that: "Neoaves comprise, to my knowledge, the first empirical example for a hard polytomy in animals." This is incorrect. There is also a hard polytomy at the root of the Placental Mammals, as discussed in this blog post: Why are there conflicting placental roots?

References

Hoelzer G.A., Meinick D.J. (1994) Patterns of speciation and limits to phylogenetic resolution. Trends in Ecology & Evolution 9: 104-107.

Suh A. (2016) The phylogenomic forest of bird trees contains a hard polytomy at the root of Neoaves. Zoologica Scripta 45: 50-62.

Walsh H.E., Kidd M.G., Moum T., Friesen, V.L. (1999) Polytomies and the power of phylogenetic inference. Evolution 53: 932-937.

Tuesday, January 10, 2017

Why do we need Bayesian phylogenetic information content?


There are many ways to construct a phylogenetic tree, and after we have done so we are usually expected to indicate something about "branch support", such as bootstrap values or bayesian posterior probabilities. Rarely, however, do people indicate whether there is much tree-like phylogenetic information in their dataset in the first place — it is simply assumed that there must be (fingers crossed, touch wood).

Recently, this latter issue has been addressed for bayesian analysis by:
Paul O. Lewis, Ming-Hui Chen, Lynn Kuo, Louise A. Lewis, Karolina Fučíková, Suman Neupane, Yu-Bo Wang, Daoyuan Shi. (2016) Estimating Bayesian phylogenetic information content. Systematic Biology 65: 1009-1023.
They develop a methodology for "measuring information about tree topology using marginal posterior distributions of tree topologies", and apply it to two small empirical datasets. That is, we can now work out something about "[substitution] saturation and detecting conflict among data partitions that can negatively affect analyses of concatenated data."

However, we have long been able to do this with data-display phylogenetic networks. More to the point, we can do it in a second or two, without ever constructing a tree. More pedantically, if the network construction produces a tree, then we know there is tree-like phylogenetic information in the dataset; if we get a network then there is little such information. Equally importantly, the network might tell us something about the patterns of non-tree-likeness, which a single-number measurement cannot.

Let's take the first empirical dataset, as described by the authors:
The five sequences of rpsll composing the data set BLOODROOT [three taxa from the angiosperm family Papaveraceae and two monocots] ... were chosen because they represent a case in which horizontal transfer of half of the gene results in different true tree topologies for the 5′ (219 nucleotide sites) and 3′ (237 nucleotide sites) subsets, which allows investigation of information content estimation in the presence of true conflicting phylogenetic signal. We analyzed each half of the data separately and measured phylogenetic dissonance, which is expected to be high in this case.
Here is the NeighborNet based on uncorrected distances. The idea that there is something non-tree-like about Sanguinaria seems hard to avoid. Indeed, the network pattern makes recombination an obvious first choice, with part of the sequence matching the Papaveraceae (on the left) and part matching the monocots (on the right). This recombination may be due to HGT.


Now for the second dataset:
The data set ALGAE comprises chloroplast psaB sequences from 33 taxa of green algae (phylum Chlorophyta, class Chlorophyceae, order Sphaeropleales) ... The alignments of just the psaB gene ... were chosen because of their deep divergence, which invites hasty judgements of saturation, especially of third codon position sites. We analyzed second and third codon position sites separately ... to assess which subset has more phylogenetic information.
Here are the two NeighborNets based on uncorrected distances. Once again, it is immediately obvious that the third-codon positions have almost no information at all, even for a network, let alone a tree — the terminal branches do not connect in any coherent way. The second-codon positions do have some information, but it is so contradictory that one could not construct a reliable tree. Saturation of nucleotide substitutions is a likely candidate for this situation; and some correction for this saturation would be needed even to construct a reasonable network from these data.

2nd positions:

3rd positions:

Tuesday, January 3, 2017

Phylogenetics versus historical linguistics


Google Trends looks at recent trends in web searches, and it has been used to study patterns in web activity for many concepts. This is similar to The Ngram Viewer in Google Books (see the post Ngrams and phylogenetics). Google Trends aggregates the number of web searches that have been performed for any given search term (or terms), and it can display the results as a time graph, for any given geographical region. The Trends searches are somewhat restrictive, but they may show us something about the period 2004-2016 (inclusive).

So, I thought that it might be interesting to look at a few expressions of relevance to readers of this blog. The Trends graphs show changes in the relative proportion of searches for the given term (vertically) through time (horizontally). The vertical axis is scaled so that 100 is simply the time with the most popularity as a fraction of the total number of searches (ie. the scale shows the proportion of searches, with the maximum always shown as 100, no matter how many searches there were).


As you can see, the term "phylogenetics" has maintained its popularity over "historical linguistics". However, it has decreased in popularity through time much more than has "historical linguistics". Nevertheless, both decreases are very small compared to that for the term "bioinformatics", as discussed in the blog post on Bioinformaticians look at bioinformatics.

It is not necessarily clear to me why many technical terms have decreased in Google searches through time, although there are several possibilities. First, it could be Google itself. The Trends numbers represent search volume for a keyword relative to the total search volume on Google. So, actual search numbers for the technical terms could be increasing while as a fraction of total search volume of the internet they are decreasing, if total Google search volume is increasing. 

Alternatively, Business Insider has noted that "search is facing a huge challenge ... consumers are increasingly shifting [from desktop] to mobile. On mobile, consumers say they just don't search as much as they used to because they have apps that cater to their specific needs. They might still perform searches within those apps, but they're not doing as many searches on traditional search engines". Furthermore, "people are discovering content through social media. The top eight social networks drove more than 30% of traffic to sites in 2014".

The extra raggedness in search popularity in the first couple of years of the graph probably reflects inadequacies in the Google Trends dataset in the early years (as discussed by Wikipedia). The same is true for the next graph, as well.


The "phylogenetic tree" searches have been more popular than "evolutionary tree", just as was true for the Google Books usage discussed in the post Ngrams and phylogenetics. However, the "phylogenetic tree" searches show a distinctly bimodal pattern every year. This presumably reflects teaching semesters — few people search for technical terms out of term time!

Unfortunately, it is not possible to look at the term "phylogenenetic network", because Google Trends tells me that there is "Not enough search volume to show results". How rude!