The Genealogical World of Phylogenetic Networks: April 2017

Tuesday, April 25, 2017

The siteswap annotation in juggling, and the power of annotation and modeling

I have been a juggler for more than 20 years now. It started when I was thirteen, and primarily interested in doing magic tricks, but I quickly realized that there are more transparent ways of presenting ones manipulation skills. About 15 years ago, when I was starting my studies in Berlin, there was a booming juggling scene in that city, with many young people, including many geeks who studied mathematics, programming, or physics. I, myself, was studying Indo-European linguistics by then, a field deprived of formalisms and formulas, devoted to the implicit as reflected in scientific prose that is not amenable to formalization, modeling, or transparent annotation.

It was at that time that some jugglers began to develop an annotation system for juggling patterns. The system was very simple, using numbers to denote the height and the direction of balls (or other objects) flying around from hand to hand. The 1 denoted the transfer of one ball from hand to hand without tossing it, the 2 denoted to hold one ball in one hand, the 3 to throw it from one hand to the other with a height required to juggle three balls, the 4 to throw one ball up in the air so that one would catch it with the same hand, and the 5 denoted the crossing from one hand to the other, but this time slightly higher, as required when juggling five balls. Some of these numbers are indicated in these animated GIFs.

The people called this system siteswap, and they claimed that it was a good idea to formalize juggling to increase creativity, since one was not required to throw all of the balls with the same number, but one could combine them, following some basic mathematical ideas.

When people told me about this, I was extremely skeptical, probably due to my classical education, which gave me the conviction that juggling is an art, and an art cannot be describe in numbers. When people tried to teach me siteswaps, I ridiculed them, showing them some complicated patterns involving body movements (see the next GIFs), and told them they would never be able to describe all the creativity of all the jugglers in the world in numbers.

Only a couple of years later, I realized that the geeks had proven me wrong, when, after a longer break, I was again participating in one of the many juggling conventions that take place throughout the year, in different locations in Europe and the whole world. I saw people doing tricks with three balls that I had never thought of before, and I asked them what they were doing. They answered, that these were siteswaps, and they were juggling patterns they called 441, or 531, respectively, as shown in these GIFs:

I gave in completely, when I saw how they applied the same logic to routines with five and more balls, which they called 654, 97531, or 744, respectively. Especially the 97531 fascinated me. During this routine, all of the balls end up in one vertical line in the air, for just a moment, but enough even for laymen to see the vertical line, which then immediately breaks down to a normal five-ball pattern, as shown here.

I realized, how wrong it was to take the un-annotability of something for granted. But even more importantly, I also understood that models, as restrictive as they may seem to be at first sight, may open new pathways for creativity, showing us things we had been ignoring before.

Only recently, when I promised colleagues to juggle during a talk on linguistics, I detected the parallel with my own studies in historical linguistics. For a long time, the field has been held back by people claiming that things could not be handled formally, for various reasons.

But I am realizing more and more that this is not true. We just need to start with something, some kind of model, which may not be as ideal and as realistic as we might wish it to be, but that may eventually help us to detect things we did not see before. We just need to start doing it, walking in baby-steps, improving our models and our annotation, as well as improving our understanding of the limits and the chances of a given formalization.

Needless to say, the patterns that I deemed to be un-annotatable 10 years ago in juggling can now easily be handled by my colleagues. They did not stop with the normal number system, but kept (and keep) developing it, and they take a lot of inspiration from this.

Tuesday, April 18, 2017

Multimedia phylogeny?

Evolutionary concepts have often been transferred to other fields of study, or derived independently in them, especially in anthropology in the broadest sense, covering all cultural products of the human mind. This includes phylogenetic studies of languages, texts, tales, artifacts, and so on — you will find many examples of such studies in this blog. One of the more recent applications has been to what is sometimes called multimedia phylogeny — the research field that "studies the problem of discovering phylogenetic dependencies in digital media".

I have noted before that phylogenetics in the biological sense is an analogy when applied to other fields, because only in biology is genetic information physically transferred between generations — in the other fields, cultural information transfer is all in the minds of the people, not in their genes (see False analogies between anthropology and biology). This analogy often becomes problematic when applied to other fields, because the practical application of bioinformatics techniques separates the informatics from the bio, and the mathematical analyses focus on trying to implement the informatics without any biological justification.

A recent paper that discusses the application of bioinformatics to multimedia phylogeny exemplifies the potential problems:

Guilherme D Marmerola, Marina A Oikawa, Zanoni Dias, Siome Goldenstein, Anderson Rocha (2017) On the reconstruction of text phylogeny trees: evaluation and analysis of textual relationships. PLoS One 11(12): e0167822.

The authors described their background information thus:

Articles on news portals and collaborative platforms (such as Wikipedia), source code, posts on social networks, and even scientific publications or literary works, are some examples in which textual content can be subject to changes in an evolutionary process. In this scenario, given a set of near-duplicate documents, it is worthwhile to find which one is the original and the history of changes that created the whole set. Such functionality would have immediate applications on news tracking services, detection of plagiarism, textual criticism, and copyright enforcement, for instance.

However, this is not an easy task, as textual features pointing to the documents' evolutionary direction may not be evident and are often dataset dependent. Moreover, side information, such as time stamps, are neither always available nor reliable. In this paper, we propose a framework for reliably reconstructing text phylogeny trees, and seamlessly exploring new approaches on a wide range of scenarios of text reusage. We employ and evaluate distinct combinations of dissimilarity measures and reconstruction strategies within the proposed framework.

So, their solution to the separation of bio from informatics is to try a range of techniques, none of which are based on any particular model of how phylogenetic changes might occur in text documents. All of these methods involve distance-based tree-building.

The essential problem, as I see it, is that without a model of change there is no reliable way to separate phylogenetic information from any other type of information. For example, similarity can arise from many sources, only some of which provide information about phylogenetic history — phylogenetic similarity is a form of "special similarity". In biology, other sources of similarity are usually lumped together as chance similarities, such as convergence, parallelism, etc. Without this basic separation of phylogenetic and chance similarity, it does not matter how many distance measures you use, or how many tree-building methods you employ — if you can't separate phylogeny from chance then you are wasting your time constructing a hypothetical evolutionary history.

The authors' only saving grace is their claim that: "In text phylogeny, unlike stemmatology [the analysis of hand-written rather than digital texts], the fundamental aim is to find the relationships among near-duplicate text documents through the analysis of their transformations over time." The expectation, then, is that the phylogenetic similarity of the texts will be high, which will thus reduce the possibility of chance similarities. Sadly, it will also reduce the probability that the similarities will contain any phylogenetic information at all — this is the classic short-branches-are-hard-to-reconstruct problem in phylogenetics.

For digital texts, the authors employ three distance measures: edit distance, normalized compression distance, and cosine similarity. None of these are model-based in any phylogenetic sense (although the first one is used in alignment programs such as Clustal) — I have discussed this in the post on Non-model distances in phylogenetics. Their tree-building methods include: parsimony, support vector machines (a machine-learning form of classification), and random forests (a decision-tree form of classification). Once again, none of these is model-based in terms of textual changes.

A final issue is the insistence on trees as the model of a phylogeny. In stemmatology, for example, a network is a more obvious phylogenetic model, because hand-written texts can be copied from multiple sources. Indeed, this distinction plays an important role in the first application of phylogenetics to stemmatology (see the post on An outline history of phylogenetic trees and networks). Perhaps this is not an issue for "near-duplicate text documents", but it does seem like an unnecessary restriction. Moreover, one of the empirical examples used in the paper actually has a network history, which therefore does not match the authors' reconstructed tree.

Tuesday, April 11, 2017

Morgan Colman and English royal genealogies

I noted in an earlier post (Drawing family trees as trees) that from 1576 CE Scipione Ammirato, an Italian writer and historian, set up a cottage industry producing family trees for the nobility. Over the years, he was not the only person to try to make money this way.

In the English-speaking world, one of these was Morgan Colman (or Coleman), who produced an impressively large genealogy of King James I and Queen Anne, in 1608. Nathaniel Taylor has commented: "Of all the congratulatory heraldic and genealogical stuff prepared early in James’s reign, this might be the most impressive piece of genealogical diagrammatic typography."

Unfortunately, we do not have a complete copy of this family tree. It was published as a set of quarto-sized bifolded sheets that needed to be joined together. Below is a small image of the copy in the British Library, which gives you an idea of the intended arrangement, and its incompleteness (click to enlarge). Taylor has a larger PDF copy available here.

The WorldCat library catalog lists the work as "Most noble Henry ; heire (though not son)", which is the first line of the dedicatory verse at the top left. Elsewhere, I have seen it referred to as "The Genealogies of King James and Queen Anne his wife, from the Conquest".

It is usually described as "a genealogy of James I and Anne of Denmark in 10 folio sheets [sic], with their portraits in woodcut, accompanied by complimentary verses to Henry Prince of Wales, the Duke of York (Prince Charles) and Princess Elizabeth, and with the coats-of-arms of the nobles living in 1608 and of their wives."

A Christies auction notes the sale of an illuminated manuscript of the "Genealogy of the Kings of England, from William the Conqueror to Elizabeth 1", produced by Colman in 1592. The accompanying text reads (in part):

Colman, a scribe and heraldic painter, was steward and secretary to various eminent public figures, including successive Lord Keepers of the Great Seal, Sir John Puckering (1592-96) and Sir Thomas Egerton (1596-1603) who caused his election as MP for Newport, Cornwall in 1597. Heraldic and genealogical compositions were his speciality and in 1608 he had composed, and prepared for printing, genealogies of King James and his Queen published as ten large quarto sheets; in 1622 a payment records his work for James I in producing two large and beautiful tables for the King's lodgings in Whitehall and for making many of the genealogical tables for 'His Majesty's honour and service'. But these successes were a distant prospect in 1592 when he produced the present manuscript: in that year he petitioned for the post of York Herald and a second petition at about this date, possibly to Sir John Puckering, solicits the addressee's continued support for his advancement. This genealogy appears therefore to be part of a campaign to secure employment: the writer ends his summary of contents 'Wherein if the simplicity of well-meaning purpose, maie procure desired accept'on then rest persuaded that the industrious hand is fullie prepared spedelie to produce matter for more ample contentment.' The inclusion of Francis Bacon's arms at the end of his work shows that Colman had hopes of securing Bacon's patronage: by 1592 Bacon's political and legal career was well established, he was confidential adviser to the Earl of Essex, the Queen's favourite, and had hopes of high office. Colman, however, hedged his bets; another copy of this genealogy survives, though incomplete and lacking the arms of a recipient.

Colman apparently petitioned for the office of herald in the latter part of the reign of Queen Elizabeth I, but never obtained it.

Tuesday, April 4, 2017

Terry Gilliam's film career

Terence Vance Gilliam, the well-known film director, has been in the news recently, for trying yet again to film his movie The Man Who Killed Don Quixote. This movie started back in the early 1990s, and has now been up and down like a yo-yo for more than 25 years. Maybe he will complete it this time, which he didn't last year, or in 2010 or 2008 — and it is cinema legend what happened back in 2000 (as shown in the documentary Lost in La Mancha).

It has been said of Gilliam that "his directorial vision has secured his rightful place within the pantheon of substantive filmmakers as well as appreciative, if selective, audiences throughout his career." This means that his films often do well, but not all that well; he is more than an art-film maker, but not quite a mainstream director. You either love his movies or you don't — there is little or no middle ground.

Gilliam is probably best known for wanting to make what are called "independent" films but which require studio-scale funding, and then fighting with the studio executives over the finished product. He clearly wants to be an independent auteur but without the tight budget that normally goes with it. In other words, he makes his own bed and then has trouble lying in it

Being a director of some renown, there are plenty of people who have been interested in providing retrospectives and commentaries on Gilliam's career. After all, that sort of thing seems to be the principal activity in the arts world — you are either a creator or a commentator, or sometimes both (such as film commentator turned film director Peter Bogdanovich).

So, it might be worthwhile to look at what some of these commentators have thought about Gilliam's career, as represented by his directorial repertoire of completed films. This ignores his involvement with television animations and various commercials.

To date, the Gilliam directorial oeuvre consists of 12 feature-length movies:

Monty Python and the Holy Grail (1975)
Jabberwocky (1977)
Time Bandits (1981)
Brazil (1985)
The Adventures of Baron Munchausen (1988)
The Fisher King (1991)
Twelve Monkeys (1995)
Fear And Loathing In Las Vegas (1998)
The Brothers Grimm (2005)
Tideland (2005)
The Imaginarium of Dr Parnassus (2009)
The Zero Theorem (2013)

and 5 short films:

Storytime (1968)
The Miracle of Flight (1974)
The Crimson Permanent Assurance (1983)
The Legend of Hallowdega (2010)
The Wholly Family (2011)

In the modern world, arts commentators tend to provide rankings of works of art, telling us which work is "best" and which "worst". If nothing else, this allows a mathematical analysis, although I am never quite sure how one goes about actually ranking works of art in some linear series.

The available commentaries that contain ranked lists of Gilliam's films include some personal choices:

some compilations of comments from members of the public:

Ranker
Internet Movie DataBase (IMDB)

and some compilations of opinions from professional critics:

There is also a list based on the adjusted US box office grosses (Box Office Mojo); there is a combined score from multiple sources (Ultimate Movie Rankings); plus the Top 10 Films site, which does not rank three of the films. I will ignore these latter three lists, since they are not directly comparable to the other lists.

Few commentators have included the short films in their discussion, and so I will start my analysis with the two sources who have done so. Here is a time-course graph of the 17 films as ranked independently by both IndieWire and IMDB.

Note that both lists agree that Gilliam was at his best (ie. he produced the top third of his works) during the middle period of his career; and that he hasn't produced anything of note this century. This does not bode well for the future success of The Man Who Killed Don Quixote. [Note: The failure of this movie to be made is responsible for the large gap between films from 1998 to 2005.]

We could now use a phylogenetic network as an exploratory data analysis to display the consensus rankings of the feature films (only), from all of the commentators listed above. As usual, I first used the manhattan distance to calculate the similarity of the different films based on their rankings. This was followed by a neighbor-net analysis to display the between-film similarities as a network. Films that are closely connected in the network are similar to each other based on their critic rankings, and those that are further apart are progressively more different from each other.

The network shows a straightforward pattern from the highest ranked films at the top-right to the lowest at the bottom-left. In the graph, the films are numbered in the order of their production (not their ranking!). So, six of Gilliam's first seven films as director are the highest-ranked ones, by consensus, with Jabberwocky plus his final five films as the lowest-ranked.

Most of the commentators selected Brazil as their number one film, with occasional votes for Monty Python and the Holy Grail. More than a half of the commentators selected The Brother Grimm as the worst film, with Tideland running a strong second.

There is nothing unusual about any of this, of course. It is a truism of social history that most people, whether they are artists or scientists, do their most interesting and influential work during the earlier part of their career. From Isaac Newton to Albert Einstein, most scientists coast through their careers after age 35, although sometimes in their later years still collecting awards for the useful work they did 30 years before. Perhaps the best-known exception was Louis Pasteur, who made significantly different major contributions to chemistry and biology during his 20s, 30s and 40s.

Well, artists are no different. Very few of them become famous during their later life, but instead continue to be "interesting" without being either as original or influential as they were in their earlier career. They are often well known and well respected, although just as often they are completely forgotten, or even unknown to later generations. Gilliam, at least, has not suffered the latter fate.