Tuesday, March 21, 2017

Computer viruses and phylogenetic networks

I have written before about the Phylogenetics of computer viruses. This is an example of the use of phylogenetics as a metaphor for the history of non-biological objects. By analogy, computer viruses and other malware can be seen to be phylogenetically related, because new viruses are usually generated using existing malicious computer code — that is, one virus "begets" another virus due to changes in its intrinsic attributes. In this sense the analogy is helpful, although there is no actual copying of anything resembling a genome — this is phenotype evolution not genotype evolution.

Furthermore, the model of historical change in computer viruses is often the same as that for biological viruses — recombination rather than substitution. That is, like real viruses, new computer viruses are often created by recombining chunks of functional information from pre-existing viruses, rather than by an accumulation of small changes. Coherent subsets of the current computer code are combined to form the new programs.

From this perspective, it is unexpected that the principal phylogenetic model in the study of computer viruses has been a tree rather than a network — a recombinational history requires a network representation, not a tree, and thus malware evolution is not tree-like. As noted by Liu et al. (2016): "Although tree-based models are the mainstream direction, they are not suited to represent the reticulation events which have happened in malware generation."

In my previous (2014) post, I noted only two known papers that used a network rather than a tree to represent malware evolution:
  • Goldberg et al. (1996) analyzed their data using what they call a phyloDAG, which is a directed network that can have multiple roots (it appears to be a type of minimum-spanning network; described in more detail in Phylogenetics of computer viruses);
  • Khoo & Lió (2011) used splits graphs rather than unrooted trees to display their data, although they did not specify the algorithm for producing their networks.
Unfortunately, malware researchers have continued to pursue the idea that a phylogeny is simply a form of classification, and have therefore stuck to the idea of producing a tree-like phylogeny using some form of hierarchical agglomerative clustering algorithm (eg. Bernardi et al. 2016).

More positively, however, some papers have appeared that have instead pursued the idea of using a network model rather than a tree:
  • Liu et al. (2016) provided median-joining networks, which are unrooted splits graphs, to display relationships within each of three different virus groups;
  • Jang et al. (2013) infered a directed acyclic graph using a minimum spanning tree algorithm, with a post-processing step to allow nodes to have multiple parents;
  • Anderson et al. (2014) presented a novel algorithm based on a graphical lasso, which builds the phylogeny as an undirected graph, to which directionality is then added using a post-hoc heuristic;
  • Oyen et al. (2016) "present a novel Bayesian network discovery algorithm for learning a DAG [directed acyclic graph] via statistical inference of conditional dependencies from observed data with an informative prior on the partial ordering of variables. Our approach leverages the information on edge direction that a human can provide and the edge presence inference which data can provide."
It is important to note that only the works producing a directed graphs can represent a phylogeny — the other works produce unrooted graphs that may or may not reflect phylogenetic history. The bayesian work of Oyen et al. (2016) is particularly interesting:
Directionality is inferred by the learning process, but in many cases it is difficult to infer, therefore prior information is included about the edge directions, either from human experts or a simple heuristic. This paper introduces a novel approach to combining human knowledge about the ordering of variables into a statistical learning algorithm for Bayesian structure discovery. The learning algorithm with our prior combines the complementary benefits of using statistical data to infer dependencies while leveraging human knowledge about the direction of dependencies.


Anderson B, Lane T, Hash C (2014) Malware phylogenetics based on the multiview graphical lasso. Lecture Notes in Computer Science 8819: 1-12.

Bernardi ML, Cimitile M, Mercaldo F (2016) Process mining meets malware evolution : a study of the behavior of malicious code. Proceedings of the 2016 Fourth International Symposium on Computing and Networking, pp 616-622. IEEE Computer Society Washington, DC.

Goldberg LA, Goldberg PW, Phillips CA, Sorkin GB (1996) Constructing computer virus phylogenies. Lecture Notes in Computer Science 1075: 253-270. [also Journal of Algorithms (1998) 26: 188-208]

Jang J, Woo M, Brumley D (2013) Towards automatic software lineage inference. Proceedings of the Twenty-Second USENIX Conference on Security, pp 81-96. USENIX Association, Berkeley, CA.

Khoo WM, Lió P (2011) Unity in diversity: phylogenetic-inspired techniques for reverse engineering and detection of malware families. Proceedings of the 2011 First Systems Security Workshop (SysSec'11), pp 3-10. IEEE Computer Society Washington, DC.

Liu J, Wang Y, Wang Y (2016) Inferring phylogenetic networks of malware families from API sequences. Proceedings of the 2016 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, pp 14-17. IEEE Computer Society Washington, DC.

Oyen D, Anderson B, Anderson-Cook C (2016) Bayesian networks with prior knowledge for malware phylogenetics. The Workshops of the Thirtieth AAAI Conference on Artificial Intelligence Artificial Intelligence for Cyber Security: Technical Report WS-16-03, pp 185-192. Association for the Advancement of Artificial Intelligence, Palo Alto, CA.

Tuesday, March 14, 2017

Detecting introgression versus hybridization

There has been considerable interest in recent years in developing methods that will detect hybridization in the presence of incomplete lineage sorting (ILS), which will allow the construction of a realistic hybridization network. Clearly, both ILS and hybridization create conflicting gene trees, which will lead to a very complex data-display network. However, if the ILS signals in the data can be used to construct a small collection of gene-tree groups, in which the gene trees within each group are congruent with a single species tree (under the ILS model), then the incongruence between groups can be used to construct a hybridization network. This network will then be an hypothesis for a realistic evolutionary network.

Recently, a paper has appeared that uses simulations to evaluate several of these methods:
Olga K. Kamneva and Noah A. Rosenberg (2017) Simulation-based evaluation of hybridization network reconstruction methods in the presence of incomplete lineage sorting. Evolutionary Bioinformatics 2017:13.
I am not a great fan of simulations, because they exist under very restricted and usually unrealistic mathematical conditions. They are, however, useful for exploring the mathematical properties of various methods, even if they are hard to connect to the biological properties.

My interpretations of the results from the particular scenarios explored by Kamneva and Rosenberg are:
  1. Most of the methods improve as the internal network edges increase in length.
  2. Most of the methods improve as the number of gene trees increases.
  3. Under good conditions the maximum-likelihood methods do better than the parsimony and consensus methods.
  4. The maximum-likelihood methods are more affected by gene-tree error than are the other methods.
  5. There are conditions under which none of the methods work well.
I doubt that any of this is controversial, in the sense that model-based methods usually work well when their models apply, but not necessarily otherwise. Reality is more complex than the models, and so the methods are likely to fail for real data.

For me, the most interesting part of the paper is the examination of balanced versus skewed parental contributions to the hybrid taxon. A balanced genetic contribution in the simulations is analogous to homoploid or polyploid hybridization, whereas a skewed contribution is analogous to introgression or horizontal gene transfer (HGT). The simulations seem to show that the methods examined do not deal very well with skewed contributions.

So, these methods may literally be hybridization-network methods only, with separate network methods needed for detecting introgression or HGT — for example, the admixture methods used for genomes (see the recent post on Producing admixture graphs).

This would mean that we cannot first produce networks with reticulations, and then afterwards explore what is causing the reticulations. Instead, we will need to decide on the possible biological mechanisms of reticulation before the analysis, and then mathematically explore possible networks that reflect those mechanisms.

This is not an issue for constructing trees, of course, since the only recognized mechanisms are speciation and extinction, both of which are explored post hoc rather than a priori. This is an important difference of networks versus trees.

Tuesday, March 7, 2017

Roundels and family trees

I have written before about the slow development of what has come to be known as the "family tree", including reducing human network relationships to a tree-like form (Reducing networks to trees), and presenting it as an actual tree (Drawing family trees as trees), rooted at the base (Does it matter which way up a tree is drawn?).

Most of the early representations of pedigrees had the people's names enclosed in a circle, called a "roundel", and it was these roundels that were connected to show the family relationships. One of the steps on the way to a tree was thus dropping this idea, so that the names could be connected directly.

Some of the diagrams with roundels that I have covered include:

c. 400 CE — The genealogy of Jesus Christ, Part I, Part II, Part III
c. 1000 CE — Genealogy of Cunigunde of Luxembourg
c. 1140 CE — Genealogy of the Carolingians
c. 1185 CE — Genealogy of the Welf dynasty
c. 1237 CE — Genealogy of the Ottonian dynasty

Interestingly, the earliest pedigrees that do not have roundels also date from this early period. As noted by Nathaniel Lane Taylor, the importance of this development is that: "the scribe relies on the power of the names themselves to anchor a diagram on the page, with lines simply taking the place of any syntax needed to describe the filiation." That is, no abstract iconography is needed.

I have already illustrated the earliest known example:
c. 1121 CE — The genealogy of Lambert of Saint-Omer

Taylor provides links to illustrations of the next known example:
   c. 1128, John of Worcester, Chronicle of World and English History (Corpus Christi College MS 157).
This book contains eight genealogies of Anglo-Saxon and Norman kings (pp. 47-54), one of which is shown above.

Taylor also refers to "one of the Arabic stemmata" illustrated in:
   Arthur Watson (1934) The Early Iconography of the Tree of Jesse. Oxford University Press,
I have not seen this book, but the illustrations are apparently confined to those from the 12th century, making the diagram contemporaneous with the two listed above. The Tree of Jesse normally appears in Medieval Christian art as a richly illustrated genealogy of Jesus in illuminated manuscripts, but apparently this one was an exception.