Lesson 5: Hierarchical Data and Tree Shapes

Goal

After this lesson you should be able to describe a tree using the words node, root, branch, leaf, child, and subtree; explain why there is no single official Scheme tree representation; define a small tree abstract data type; and map a procedure over two different tree representations.

This lesson is about tree shape and representation. The next lesson will focus on traversal order, search, and expression trees. Keeping those apart makes the first tree programs easier to read.

A Tree Is A Hierarchy

A tree represents information arranged in levels. Each item can have smaller items below it. The top item is the root. The items below a node are its children. A node with no children is a leaf.

A node is a point in the tree, but it is also a tree of its own. If you focus on the node labelled Europe below, Europe is a child of World, but Europe is also the root of a smaller tree containing countries and cities.

Trees are useful whenever the information itself is hierarchical: directories contain files and subdirectories, a document contains sections and paragraphs, a country contains regions and cities, and a programming language expression contains subexpressions.

Tree Vocabulary

The vocabulary is small, but worth being precise about.

A node is one position in a tree.
The root is the top node of the tree currently being discussed.
A child is a node immediately below another node.
A branch is a connection from a node to a child, or informally a non-leaf part of a tree.
A leaf is a node with no children.
A subtree is any node considered together with all of the nodes below it.

The same object can be described in more than one way. Lisbon is a leaf in the geography tree. Portugal is a child of Europe, a parent of Lisbon, and also the root of the subtree containing Portugal and Lisbon.

No Single Official Tree Representation

Scheme has a standard notation and representation for proper lists. A proper list is a chain of pairs ending in the empty list.

Trees are different. There is no single tree shape that fits every problem.

Questions you have to answer include:

Does every node have data, or only the leaves?
Can a node have any number of children, or exactly two child positions?
Does the order of children matter?
Is an empty tree allowed?
Should a branch node store a value of its own?

Because the answers vary, Scheme gives you pairs and lists, and you build the tree representation that matches the problem.

N-Way And Binary Trees

An n-way tree allows a node to have any number of children. The geography tree above is n-way: World has three children in the diagram, and it could have more.

A binary tree has at most two child positions, conventionally called left and right. Binary trees are common when the representation itself has a left/right meaning, as in binary search trees or arithmetic expression trees.

This lesson will mostly use n-way trees. The next lesson will use binary trees when discussing inorder traversal and search trees.

A Tree ADT With Data At Every Node

One clean representation is a tree node with a datum and a list of children. Each child is itself a tree.

(define (make-tree datum children)
  (cons datum children))

(define (datum node)
  (car node))

(define (children node)
  (cdr node))

(define (leaf? node)
  (null? (children node)))

In this representation, a leaf is still a tree node. It simply has no children.

Try this expression

The expression above is intentionally awkward. It shows why selectors are necessary, but it also shows that deeply nested data often needs better helper procedures before programs become pleasant to read.

A Node Is Also A Subtree

With the ADT above, the Europe node from the geography tree can be selected and then treated as a complete tree.

(define europe-subtree
  (car (children geography-tree)))

The value named europe-subtree is a child of the world tree. It is also a tree whose root datum is europe.

Try this expression

This is the same idea as with lists: the rest of a nonempty proper list is itself a proper list. In a tree, each child is itself a tree.

Mapping Over A Tree ADT

In Lesson 2, map over a list applied a procedure to each element of a sequence. With a tree ADT, there is one datum at the current node and then a list of child trees.

That leads to this shape:

(define (tree-map proc tree)
  (make-tree (proc (datum tree))
             (map (lambda (child)
                    (tree-map proc child))
                  (children tree))))

The recursive call is not just on one rest list. It is made for each child. The ordinary list procedure map handles the sequence of children; tree-map handles each child as a tree.

Try this expression

The result uses the same tree shape, with each datum transformed.

Leaves-Only Trees

Another common representation treats ordinary nested lists as trees. In this view, branch nodes do not have data of their own. The leaves are the non-list values.

For example:

'(a (b c d) (e (f g) h))

This can be read as a tree with three top-level children: the leaf a, a branch containing b, c, and d, and a branch containing e, another branch, and h.

This is not the same ADT as the previous section. There is no separate constructor saying where the datum is and where the children are. The list structure itself is being interpreted as the tree.

Mapping Over Nested-List Trees

For nested-list trees, mapping over leaves means making a choice at each step:

If the tree is empty, return the empty list.
If the tree is a pair, recursively process both the car and the cdr.
Otherwise, the tree is a leaf, so apply the procedure to that leaf.

R7RS does not provide atom?, so define the idea directly or use the pair? test as below.

(define (nested-tree-map proc tree)
  (cond
    ((null? tree) '())
    ((pair? tree)
     (cons (nested-tree-map proc (car tree))
           (nested-tree-map proc (cdr tree))))
    (else (proc tree))))

This is tree recursion over the pair structure: one recursive call for the car and one recursive call for the cdr.

Try this expression

This direct pair-structure version is useful, but be clear about what it means. It treats every pair as two subtrees. That is different from treating a node as a datum plus a list of child trees.

Choosing A Representation

Both representations are reasonable, but they answer different questions.

Use a datum-plus-children ADT when every node should carry a value. This is natural for organization charts, taxonomies, file-system indexes, outlines, and syntax trees where each branch has a label.

Use a nested-list tree when the branch nodes are only grouping structure and the leaves carry the real data. This is natural for “apply this to every leaf” examples and for some symbolic data structures.

The important habit is to write down the representation before writing procedures over it. If you change the representation, procedures written above a clear abstraction barrier should not all need to change.

First Complete Example

The published example source for this lesson checks the datum-plus-children tree ADT, leaf detection, subtrees, tree mapping, and nested-list tree mapping.

(define (tree-map proc tree)
  (make-tree (proc (datum tree))
             (map (lambda (child)
                    (tree-map proc child))
                  (children tree))))

(define (nested-tree-map proc tree)
  (cond
    ((null? tree) '())
    ((pair? tree)
     (cons (nested-tree-map proc (car tree))
           (nested-tree-map proc (cdr tree))))
    (else (proc tree))))

The full source file has a highlighted view at /examples/r7rs/intro/lesson-05/ and a raw source link from that page.

When working from the source tree used to build this site, run the example suite with:

scripts/test_examples_chicken.sh

Exercises

These exercises are original and deliberately small.

In the geography diagram, identify the root, three leaves, and one subtree.
Explain why the Europe node can be viewed as both a child and a tree.
Define a tree for a small table of contents: book, chapters, and sections.
Write leaf-count for the datum-plus-children tree representation.
Use tree-map to add 1 to every datum in a numeric tree.
For '(a (b c) d), draw the leaves-only tree by hand.
Use nested-tree-map to convert every symbol in '(a (b c)) into the symbol x.
Explain why the datum-plus-children representation and the nested-list representation are not interchangeable.
Give an example where sibling order matters.
Give an example where sibling order does not matter.

What Comes Next

The next material will use the tree representations from this lesson to ask how a program should visit the nodes: depth first, breadth first, preorder, postorder, and inorder for binary trees.

Before moving on, make sure you can explain these ideas without guessing:

What node, root, child, leaf, and subtree mean.
Why a node can also be considered a tree.
Why Scheme does not impose one official tree representation.
How make-tree, datum, children, and leaf? work together.
Why tree-map uses ordinary map over the children.
Why nested-tree-map recurs on both car and cdr.
Why choosing a representation is part of designing the program.