Venti
Data Structures and Algorithms
It has been a week since my last post. Last week was the most stressful so far at 8th Light, the Coursera course I am currently taking took up a huge amount of my time, and unfortuantely I badly estimated it in my IPM. In doing so, some tasks slipped, and this blog was one of them. I’ve been trying to think about how to avoid the same situation in the future. I recognise that this blog is a learning mechanism for myself first, and information for others second. So by neglecting it, I am actually hindering my own development. Even if it means that my data structures progress suffers.
Making time to reflect is key, and last week I was so focused on getting things done, I didn’t look back once. I am sure this contributed to the stress I felt. Additionally, I have been asked to make this blog more “in depth” and move away from a daily model, so I probably would have learnt more than usual.
So, in the past week, what have I learnt? In total I have covered a huge number of different topics:
- Arrays
- Single Linked Lists
- Double Linked Lists
- Stacks
- Queues
- Trees
- Tree Traversal
- Dynamic Arrays
- Amortised Analysis
- Priority Queues
- Binary Trees
- Heap Sorting
- Disjoint Sets
- Big O Notation
I wouldn’t say that I am an expert in any of these areas yet, but I have at least got a basic understanding and awareness of the different concepts.
The course has a number of different assignments at the end of each weekly topic. All of which require the implementation of an algorithm to solve a specific problem. There have been two types of assessment so far, create an algorithm from scratch, or refine an existing algorithm in order to make it more efficient. The automated test suites that run the submissions operate with different timing constraints for different languages. So far I have been doing everything with Java.
In this blog post I’m going to address the one of the first assignments I had, how I tackled it, and what ended up being the final solution. This particular assignment, unlike some of the others, didn’t come with a “real world” problem to address. It was directly an implementation query, “how to find the height of a tree (with an arbitrary number of branches)”.
For those that are unfamiliar, a tree data structure consists of a collection of nodes, with a root node from which all other nodes are connected via branches. In fact, trees are typically depicted more like the roots of a tree rather than the above ground section, since the root is at the top, and the branches extend downwards. There is a special type of tree called a binary tree that has specifically no more than two branches coming from any node. These trees are particular useful for storing data, since they can be used for very efficient searching. The height of a tree is calculated as the number of nodes from the root to the lowest leaf node (node without any branches).
Side Note: there are actually two definitions of tree height/depth, those that count the number of nodes in a tree from highest to lowest, and those that count the branches between them. Hence some define a tree with only a root node as having a height of one, others would say zero. For the purposes of this assignment, a tree with only a root node will have a height of one!
The starter code for the assignment gave an iterative solution to the problem.
int computeHeight() {
// Replace this code with a faster implementation
int maxHeight = 0;
for (int vertex = 0; vertex < n; vertex++) {
int height = 0;
for (int i = vertex; i != -1; i = parent[i])
height++;
maxHeight = Math.max(maxHeight, height);
}
return maxHeight;
}
For clarity, n
is an integer representing the number of nodes in the tree.
And parent[]
is an int array representing the tree, where the value at each
index of the array corresponds to the index of that node’s parent.
I found getting my head around the parent array difficult, it seemed to be designed in the wrong way. It added to my confusion as a whole around this solution so I disregarded it at first. I recognised it at least as iterative, and so came to the presumptuous conclusion that the solution they were looking for must be recursive, because recursive algorithms are faster…
In order to create a recursive solution to the problem I knew that the solution
needed to allow a root node to be passed so I could recursively ask for the
heights of it’s children. In order to do so, I needed to create a Node class
that could represent the tree structure in a better way than the array.
I needed a way to link to children not parents of nodes. I iterated through the
existing parent array and created a tree structure using the Node
class. And
then used the recursive newComputeHeight
method to calculate the tree height
based on passing in the root node.
class Node {
private int data;
private Node parent;
private List<Node> children;
public Node(int data) {
this.data = data;
this.children = new ArrayList<Node>();
}
public void addChild(Node child) {
children.add(child);
}
public int countChildren() {
return children.size();
}
}
class TreeHeight {
...
int newComputeHeight(Node root) {
if (root.countChildren() == 0) return 1;
List<Integer> childHeights = new ArrayList<Integer>();
for (Node child : root.children) {
childHeights.add(newComputeHeight(child));
}
return 1 + Collections.max(childHeights);
}
}
This solution worked! Ish.
In producing this solution and submitting it online, I discovered that the algorithm would be tested against a tree with 100000 nodes. I discovered this because my algorithm was failing due to a stack overflow error! My first stack overflow error!
At this point in the story, Uku arrived to the rescue. He explained to me that recursive algorithms are almost never more efficient than iterative ones, generally they are slower. However, they are typically more elegant. So my assumption at the start head lead me down the wrong path. Although ultimately it was a good lesson learnt.
Armed with new information on the efficiency of different types of algorithm, I turned back to the given solution from the starter code. After investigating it a bit further (after a long time hitting my head against a wall) I discovered what their algorithm was doing, and what it was doing inefficiently.
The line for (int i = vertex; i != -1; i = parent[i])
confused me for
a while. What this line is doing in conjunction with for
loop above is
starting at a given node, follow the path up the tree via parents until you
reach the root node (denoted by a value of -1). Each time adding one to the
height calculation. I allowed myself to get distracted during this assignment
for various external reasons, and didn’t take the time to try and understand
this algorithm. However, when I did, I realised that it was duplicating a huge
amount of work. On travelling up the tree from a given node, it would pass
a large number of nodes on its way to the root, but would do nothing with the
information on their heights. So it was recalculating the height for nodes that
it had essentially already calculated! I knew it was inefficient, but I was
amazed when I saw the figures to back it up:
Note: I changed the word vertex to node, because it made reading the algorithm a little easier for me.
int computeHeight() {
// Replace this code with a faster implementation
int maxHeight = 0;
long numOperations = 0;
for (int node = 0; node < n; node++) {
int height = 0;
for (int i = node; i != -1; i = parent[i]) {
height++;
numOperations++;
}
maxHeight = Math.max(maxHeight, height);
}
System.out.println(numOperations);
return maxHeight;
}
numOperations: 2,500,100,000
running time: 22s 341ms
Notice that I had to make the
numOperations
along
type!
I implemented a simple solution using a new array int[] heights = new int[n]
that stored a height for values that it passed on the way up to the root during
other calculations. There was still some duplication in the algorithm, but far
less than before.
int computeHeight() {
int maxHeight = 0;
int numOperations = 0;
for (int node = 0; node < n; node++) {
int height = 0;
for (int i = node; i != -1; i = parent[i]) {
if (heights[i] != 0) {
height = height + heights[i];
break;
}
height++;
numOperations++;
}
heights[node] = height;
maxHeight = Math.max(maxHeight, height);
}
System.out.println(numOperations);
return maxHeight;
}
numOperations: 1,050,284
running time: 243ms
This solution actually turned out to be the fastest solution for this problem, but Uku rightly pointed out that using memoization to solve this assignment, probably wasn’t its purpose.
So, onto a more technically appropriate solution to the problem. Tree
traversal. I’m showing the full TreeHeight
class in this solution, so
a couple of notes on the new items. The read
method has always existed, and
just takes the string argument from terminal and uses it to define the number
of nodes, and parent array as described before. I piggybacked off that method
to implement my tree structure in order that I could traverse it.
class Node {
private List<Node> children;
public Node() {
this.children = new ArrayList<Node>();
}
public void addChild(Node child) {
children.add(child);
}
}
class TreeHeight {
int n;
int parent[];
Node root;
Node getRoot() {
return root;
}
void read(InputStreamReader input) throws IOException {
FastScanner in = new FastScanner(input);
n = in.nextInt();
parent = new int[n];
for (int i = 0; i < n; i++) {
parent[i] = in.nextInt();
}
createTree();
}
void createTree() {
List<Node> nodeList = new ArrayList<Node>();
for (int i = 0; i < n; i++) {
Node node = new Node();
nodeList.add(node);
}
for (int i = 0; i < n; i++) {
if (parent[i] == -1) {
root = nodeList.get(i);
} else {
nodeList.get(parent[i]).addChild(nodeList.get(i));
}
}
}
int computeHeight(Node root) {
int height = 1;
List<Node> todo = root.children;
while(true) {
int nodeCount = todo.size();
if (nodeCount == 0) {
return height;
} else {
height++;
}
while(nodeCount > 0) {
Node thisNode = todo.remove(0);
todo.addAll(thisNode.children);
nodeCount--;
}
}
}
}
This version of the computeHeight
method uses Breadth First traversal to travel
down the tree from the root node, passing all the children on each level of the
tree. Whenever all the children from a particular level are added to the queue
the height total is incremented. This is really an iterative version of the
recursive algorithm from before, so it is quite nice in that sense to see how
the different algorithms stack up.
The algorithm above works nicely on any type of tree, since it doesn’t care how many children there are on each node. It is slightly slower than the memoized version of the solution, but is preferable since it is a standard algorithm that can be reused. The delay is mostly due to the creation of the underlying tree data structure used for the iteration. The starter code algorithm is tied heavily to the way data is presented to the programme from the terminal, so is not a preferable solution in that sense.