Random binary tree

From Wikipedia - Reading time: 13 min


In computer science and probability theory, a random binary tree is a binary tree selected at random from some probability distribution on binary trees. Different distributions have been used, leading to different properties for these trees.

Random binary trees have been used for analyzing the average-case complexity of data structures based on binary search trees. For this application it is common to use random trees formed by inserting nodes one at a time according to a random permutation. Adding and removing nodes directly in a random binary tree will in general disrupt its random structure, but the treap and related randomized binary search tree data structures use the principle of binary trees formed from a random permutation in order to maintain a balanced binary search tree dynamically as nodes are inserted and deleted.

Other distributions on random binary trees include the uniform discrete distribution in which all distinct trees are equally likely, distributions on a given number of nodes obtained by repeated splitting, and trees generated by Galton–Watson processes, for which (unlike the other models) the number of nodes in the tree is not fixed.

For random trees that are not necessarily binary, see random tree.

Background[edit]

A binary tree is a rooted tree in which each node may have up to two children (the nodes directly below it in the tree), and those children are designated as being either left or right. It is sometimes convenient instead to consider extended binary trees in which each node is either an external node with zero children, or an internal node with exactly two children. A binary tree that is not in extended form may be converted into an extended binary tree, with all of the original nodes converted into the internal nodes of the extended binary tree, by adding additional external nodes as children of the nodes of the given tree, so that after this addition the internal nodes all have exactly two children. In the other direction, an extended binary tree with at least one internal node may be converted back into a non-extended binary tree by removing all its external nodes. In this way, these two forms are almost entirely equivalent for the purposes of mathematical analysis, except that the extended form allows a tree consisting of a single external node, which does not correspond to anything in the non-extended form. For the purposes of computer data structures, the two forms differ, as the external nodes of the first form may be represented explicitly as objects in a data structure.[1]

When the internal nodes of a binary tree are labeled by ordered keys of some type (such as distinct numbers), and the inorder traversal of the tree produces the sorted sequence of keys, the result is a binary search tree. Generally, the external nodes of such a tree remain unlabeled.[2] Binary trees may also be studied with all nodes unlabeled, or with labels that are not given in sorted order. For instance, the Cartesian tree data structure uses labeled binary trees that are not necessarily binary search trees.[3]

A random binary tree is a random tree drawn from a certain probability distribution on binary trees. In many cases, these probability distributions are defined using a given set of keys, and describe the probabilities of binary search trees having those keys. However, other distributions are possible, not necessarily generating binary search trees, and not necessarily giving a fixed number of nodes.[4]

From random permutations[edit]

For any set of numbers (or, more generally, values from some total order), one may form a binary search tree in which each number is inserted in sequence as a leaf of the tree, without changing the structure of the previously inserted numbers. The position into which each number should be inserted is uniquely determined by a binary search in the tree formed by the previous numbers. In the random permutation model of random binary trees, each of these permutations is equally likely.

For instance, if the three numbers (1,3,2) are inserted into a tree in that sequence, the number 1 will sit at the root of the tree, the number 3 will be placed as its right child, and the number 2 as the left child of the number 3. There are six different permutations of the numbers (1,2,3), but only five trees may be constructed from them. That is because the permutations (2,1,3) and (2,3,1) form the same tree. Thus, this tree has probability of being generated, whereas the other four trees each have probability .

Expected depth of a node[edit]

For any fixed choice of a value in a given set of numbers, if one randomly permutes the numbers and forms a binary tree from them as described above, the expected value of the length of the path from the root of the tree to is at most , where "" denotes the natural logarithm function and the introduces big O notation. This follows because the expected number of ancestors of is by linearity of expectation equal to the sum, over all other values in the set, of the probability that is an ancestor of . And a value is an ancestor of exactly when is the first element to be inserted from the elements in the interval . Thus, the values that are adjacent to in the sorted sequence of values have probability of being an ancestor of , the values one step away have probability , etc. Adding these probabilities for all positions in the sorted sequence gives twice a Harmonic number, leading to the bound above. A bound of this form holds also for the expected search length of a path to a fixed value that is not part of the given set.[5]

The longest path[edit]

Although not as easy to analyze as the average path length, there has also been much research on determining the expectation (or high probability bounds) of the length of the longest path in a binary search tree generated from a random insertion order. This length, for a tree with nodes, is almost surely

where is the unique number in the range satisfying the equation

[6]

Expected number of leaves[edit]

In the random permutation model, each of the numbers from the set of numbers used to form the tree, except for the smallest and largest of the numbers, has probability of being a leaf in the tree, because it is a leaf when it inserted after its two neighbors, and any of the six permutations of these two neighbors and it are equally likely. By similar reasoning, the smallest and largest of the numbers have probability of being a leaf. Therefore, the expected number of leaves is the sum of these probabilities, which for is exactly .[7]

Strahler Number[edit]

The Strahler number of a tree is a more sensitive measure of the distance from a leaf in which a node has Strahler number whenever it has either a child with that number or two children with number . For -node random binary search trees, simulations suggest that the expected Strahler number is . However, only the upper bound has actually been proven.[8]

Treaps and randomized binary search trees[edit]

In applications of binary search tree data structures, it is rare for the values in the tree to be inserted without deletion in a random order, limiting the direct applications of random binary trees. However, algorithm designers have devised data structures that allow insertions and deletions to be performed in a binary search tree, at each step maintaining as an invariant the property that the shape of the tree is a random variable with the same distribution as a random binary search tree.

If a given set of ordered numbers is assigned numeric priorities (distinct numbers unrelated to their values), these priorities may be used to construct a Cartesian tree for the numbers, a binary tree that has as its inorder traversal sequence the sorted sequence of the numbers and that is heap-ordered by priorities. Although more efficient construction algorithms are known, it is helpful to think of a Cartesian tree as being constructed by inserting the given numbers into a binary search tree in priority order. Thus, by choosing the priorities either to be a set of independent random real numbers in the unit interval, or by choosing them to be a random permutation of the numbers from to (where is the number of nodes in the tree), and by maintaining the heap ordering property using tree rotations after any insertion or deletion of a node, it is possible to maintain a data structure that behaves like a random binary search tree. Such a data structure is known as a treap or a randomized binary search tree.[9]

Uniformly random binary trees[edit]

The number of binary trees with nodes is a Catalan number: for these numbers of trees are

1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, ... (sequence A000108 in the OEIS).

Thus, if one of these trees is selected uniformly at random, its probability is the reciprocal of a Catalan number. Trees generated from a model in this distribution are sometimes called random binary Catalan trees.[4] They have expected depth proportional to the square root of , rather than to the logarithm.[10] The expected Strahler number of a uniformly random -node binary tree is , lower than the expected Strahler number of random binary search trees.[11] In some cases the analysis of random binary trees under the random permutation model can be automatically transferred to the uniform model.[12]

Due to their large heights, this model of equiprobable random trees is not generally used for binary search trees. However, it has other applications, including:

Galton–Watson process[edit]

The Galton–Watson process describes a family of distributions on trees in which the number of children at each node is chosen randomly, independently of other nodes. For binary trees, two versions of the Galton–Watson process are in use, differing only in whether a binary tree with an external root node is allowed:

  • In the version where the root node may be external, it is chosen to be internal with some specified probability or external with probability . If it is internal, its two children are trees generated recursively by the same process.
  • In the version where the root node must be internal, its left and right children are determined to be internal with probability or external with probability , independently of each other. In the case where they are internal, they are the roots of trees that are generated recursively by the same process.

Trees generated in this way have been called binary Galton–Watson trees. In the special case where they are called critical binary Galton–Watson trees.[17] This probability marks a phase transition for the binary Galton–Watson process: for the resulting tree is almost certainly finite, whereas for it is infinite with positive probability.

Another way to generate the same trees is to make a sequence of coin flips, with probability of heads and probability of tails, until the first flip at which the number of tails exceeds the number of heads (for the model in which an external root is allowed) or exceeds one plus the number of heads (when the root must be internal), and then use this sequence of coin flips to determine the choices made by the recursive generation process, in depth-first order.[18] Because the number of internal nodes equals the number of heads in this coin flip sequence, all trees with a given number of nodes are generated from (unique) coin flip sequences of the same length, and are equally likely, regardless of . That is, the choice of affects the variation in the size of trees generated by this process, but for a given size the trees are generated uniformly at random.[19] Smaller values of below the critical probability will produce trees with a smaller expected size, while larger values of will produce trees with a larger expected size. For the critical probability there is no finite bound on the expected size of trees generated by this process.

Galton–Watson processes were originally developed to study the spread and extinction of human surnames, and have been widely applied more generally to the dynamics of human or animal populations. It has been generalized to models where the probability of being an internal or external node at a given level of the tree (a generation, in the population dynamics application) is not fixed, but depends on the number of nodes at the previous level.[20]

Binary tries[edit]

Another form of binary tree, the binary trie or digital search tree, has a collection of binary numbers labeling some of its external nodes. The internal nodes of the tree represent prefixes of their binary representations that are shared by two or more of the numbers. The left and right children of an internal node are obtained by extending the corresponding prefix by one more bit, a zero or a one bit respectively. If this extension does not match any of the given numbers, or it matches only one of them, the result is an external node; otherwise it is another internal node. Random binary tries have been studied, for instance for sets of random real numbers generated independently in the unit interval. Despite the fact that these trees may have some empty external nodes, they tend to be more balanced than random binary search trees. More precisely, for uniformly random real numbers in the unit interval, or more generally for any square-integrable probability distribution the unit interval, the average depth of a node is asymptotically , and the average height of the whole tree is asymptotically . The analysis of these trees can be applied to the computational complexity of trie-based sorting algorithms.[21]

Random split trees[edit]

Luc Devroye and Paul Kruszewski describe a recursive process for constructing random binary trees with nodes. It generates a real-valued random variable in the unit interval , assigns the first nodes (rounded down to an integer number of nodes) to the left subtree, the next node to the root, and the remaining nodes to the right subtree. Then, it continues recursively using the same process in the left and right subtrees. If is chosen uniformly at random in the interval, the result is the same as the random binary search tree generated by a random permutation of the nodes, as any node is equally likely to be chosen as root. However, this formulation allows other distributions to be used instead. For instance, in the uniformly random binary tree model, once a root is fixed each of its two subtrees must also be uniformly random, so the uniformly random model may also be generated by a different choice of distribution (depending on ) for . As they show, by choosing a beta distribution on and by using an appropriate choice of shape to draw each of the branches, the mathematical trees generated by this process can be used to create realistic-looking botanical trees.[22]

Notes[edit]

  1. ^ Knuth (1997).
  2. ^ Knuth (1973).
  3. ^ Vuillemin (1980).
  4. ^ a b Sedgewick & Flajolet (2013).
  5. ^ Hibbard (1962); Knuth (1973); Mahmoud (1992), p. 75.
  6. ^ Robson (1979); Pittel (1985); Devroye (1986); Mahmoud (1992), pp. 91–99; Reed (2003).
  7. ^ Brown & Shubert (1984).
  8. ^ Kruszewski (1999).
  9. ^ Martínez & Roura (1998); Seidel & Aragon (1996).
  10. ^ Knuth (2005), p. 15.
  11. ^ Devroye & Kruszewski (1995).
  12. ^ Mahmoud (1992), p. 70.
  13. ^ Mahmoud (1992), p. 63.
  14. ^ Flajolet, Raoult & Vuillemin (1979).
  15. ^ Shreve (1966).
  16. ^ Aldous (1996).
  17. ^ Burd, Waymire & Winn (2000).
  18. ^ For the connection between trees and random walks (as generated by random coin flips) see e.g. Section 6, "Walks and trees" pp. 483–486, of Harris (1952).
  19. ^ Broutin, Devroye & Fraiman (2020). More generally, every Galton–Watson process, conditioned on producing trees of a certain size, produces the same probability distribution as a critical Galton–Watson process: see section 2 of Kennedy (1975).
  20. ^ Jagers (2011).
  21. ^ Devroye (1984).
  22. ^ Devroye & Kruszewski (1996).

References[edit]

External links[edit]


Licensed under CC BY-SA 3.0 | Source: https://en.wikipedia.org/wiki/Random_binary_tree
6 views | Status: cached on February 12 2024 09:00:51
Download as ZWI file
Encyclosphere.org EncycloReader is supported by the EncyclosphereKSF