Graphs

Graph’s Representation:

To solve graph problems, we should firstly understand how to store. We also need to figure out that in what data structure we store the graph will help us solve the problem. The first step to solve graph problem is to properly store the graph in a correct data structure. Generally, there are four ways to represent graphs. Each with its advantages and disadvantages.

  • Nodes as objects and edges as pointers.

Most problems about (binary) trees usually use this pattern.

Time complexity: O(V)

Memory complexity: O(V)

  • A list of edges represented by two nodes.

For Example,

[ [0,1], [0,6], [0,8], [1,4], [1,6], [1,9], [2,4], [2,6], [3,4], [3,5],
[3,8], [4,5], [4,9], [7,8], [7,9] ]

Graph problems in general are likely use this pattern. Sometimes, we would like to convert it to an adjacent list.

Time complexity: O(E)

Memory complexity: O(E)

  • Adjacency lists

Representing a graph with adjacency lists combines adjacency matrices with edge lists. adjacency list

Time complexity: O(Deg(V))

Memory complexity: O(V∗Deg(V))

  • Adjacency Matrices

from node in column to node in row, if there is an edge, represent by 1 or not by 0.

adjacency matrices

Time complexity: O(1)

Memory complexity: O(V2)

Reference

[1] KHANACADEMY

Distributed System Study

What are distributed systems?

As to my understanding, the concept of distributed system, is against one machine computer system.

When the demand for complicated computing, faster speed and parallel computing grows, people came up with the idea of multi-core processor in a single machine. While, when that demand grows even more, distributed computing systems become more important for large amount of data to store and computing.

All computers or computing systems need three basic functions: Storage, computation and communication. When we learn how a distributed system works, we actually learn about how a system stores data, what  strategies or algorithms used for computing and what the mechanisms for communicating between each part of the system.

Thus, to study a distributed system, we need to figure out that how this system stores data distributedly, how to compute distributedly and how each node (machine) communicates with each other.

The paragraph below is cited from wiki: https://en.wikipedia.org/wiki/Distributed_computing

Distributed computing is a field of computer science that studies distributed systems. A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.[1]Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.

How does a distributed system work?

This is a big question and not easy to answer in short words, but we can use some examples to reveal the facts about how does a distributed system work. Here I would like to use Hadoop as an example to elaborate this point.

Before dig into a specific distributed system, answer the following questions will help a lot.

What should we care most when design a distributed system?

To answer this question, we should understand CAP theorem.

(cited from: UCI, EECS219 lecture)

CAP represents three properties of a distributed system:

1. Consistency: Every node in the system contains the same data

(e.g. replicas are never out of data)

2. Availability: Every request to a non-failing node in the system returns a response

3. Partition Tolerance: System properties (consistency and/or availability) hold even when the system is partitioned (communicate lost) and data is lost (node lost)

You can have at most two of these three properties for any shared-data system

• To scale out, you have to partition. That leaves either consistency or availability to choose from

– In many cases, you would choose availability over consistency.

 

How to store data in a distributed system?

As we all know,  files are the basic form to store data in a computer system. Distributed system stores data in the way that very similar to the file system in a computer. To guarantte consistency, distributed system usually keep replica in all nodes.

How to do computation in a distributed system?

One of the most popular framework for computing is MapReduce.

How to deal with node failure? Master Node? slave Node?

Use Hadoop as an example. It is a master-slaver architecture. To detect a slave node’s fail. The system uses heartbeat mechanism, which is used to prove a slave node is alive. While the master node is a single point failure.

cited from :

http://stackoverflow.com/questions/33494697/secondary-namenode-usage-in-hadoop-2-x/33719153#33719153

http://stackoverflow.com/questions/33311585/how-does-hadoop-namenode-failover-process-works/33313804#33313804

The master nodes in distributed Hadoop clusters host the various storage and processing management services, described in this list, for the entire Hadoop cluster. Redundancy is critical in avoiding single points of failure, though single points of failure can not be totally avoided.

  • NameNode: Manages HDFS storage. To ensure high availability, you have both an active NameNode and a standby NameNode. Each runs on its own, dedicated master node.

  • Checkpoint node (or backup node): Provides checkpointing services for the NameNode. This involves reading the NameNode’s edit log for changes to files in HDFS (new, deleted, and appended files) since the last checkpoint, and applying them to the NameNode’s master file that maps files to data blocks.

    In addition, the Backup Node keeps a copy of the file system namespace in memory and keeps it in sync with the state of the NameNode. For high availability deployments, do not use a checkpoint node or backup node — use a Standby NameNode instead. In addition to being an active standby for the NameNode, the Standby NameNode maintains the checkpointing services and keeps an up-to-date copy of the file system namespace in memory.

  • JournalNode: Receives edit log modifications indicating changes to files in HDFS from the NameNode. At least three JournalNode services (and it’s always an odd number) must be running in a cluster, and they’re lightweight enough that they can be colocated with other services on the master nodes.

In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs).

When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is reads these edits from the JNs and apply to its own name space.

In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.

It is vital for an HA cluster that only one of the NameNodes be Active at a time. ZooKeeper has been used to avoid split brain scenario so that name node state is not getting diverged due to failover.

The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNodealso runs a ZKFC, and that ZKFC is responsible for:

Health monitoring – the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFCconsiders the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.

ZooKeeper session management – when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special “lock” znode. This lock uses ZooKeeper’s support for “ephemeral” nodes; if the session expires, the lock node will be automatically deleted.

ZooKeeper-based election – if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local NameNode active.

 

 

 

 

 

Leetcode. 257. Binary Tree Paths

Given a binary tree, return all root-to-leaf paths.

For example, given the following binary tree:

   1
 /   \
2     3
 \
  5

All root-to-leaf paths are

["1->2->5", "1->3"]

Solution 1:

Basically, to solve this problem, we need traversal all the nodes in the tree. Hence, no matter in what order, the time complexity is O(n).

This is my first solution, which still can be improved. Firstly, we can try to change the way of concatenation to avoid delete “->” on the leaf node. Secondly, string concatenation may be too expensive, so we can use stringBuilder to improve the performance.

/**
* Definition for a binary tree node.
* public class TreeNode {
* int val;
* TreeNode left;
* TreeNode right;
* TreeNode(int x) { val = x; }
* }
*/
public class Solution {
public List<String> binaryTreePaths(TreeNode root) {
List<String> result = new ArrayList<>();
if(root == null) return result;
getPath(root, "", result);
return result;
}
private void getPath(TreeNode node, String path, List<String> result) {
path += node.val + "->";
if(node.left != null) getPath(node.left, path, result);
if(node.right != null) getPath(node.right, path, result);
if(node.left == null && node.right == null){
path = path.substring(0, path.length()-2);
result.add(path);
}
// System.out.println(path);
}
}

Change the way of string concatenation.


public List<String> binaryTreePaths(TreeNode root) {
    List<String> answer = new ArrayList<String>();
    if (root != null) searchBT(root, "", answer);
    return answer;
}
private void searchBT(TreeNode root, String path, List<String> answer) {
    if (root.left == null && root.right == null) answer.add(path + root.val);
    if (root.left != null) searchBT(root.left, path + root.val + "->", answer);
    if (root.right != null) searchBT(root.right, path + root.val + "->", answer);
}

Use StringBuilder. It is tricky to decide when to append the node.val and “->”.


public List<String> binaryTreePaths(TreeNode root) {
        List<String> res = new ArrayList<>();
        StringBuilder sb = new StringBuilder();
        helper(res, root, sb);
        return res;
    }

    private void helper(List<String> res, TreeNode root, StringBuilder sb) {
        if(root == null) {
            return;
        }
        int len = sb.length();
        sb.append(root.val);
        if(root.left == null && root.right == null) {
            res.add(sb.toString());
        } else {
            sb.append("->");
            helper(res, root.left, sb);
            helper(res, root.right, sb);
        }
        sb.setLength(len);
    }

Hash Table

Learn from these links:

http://stackoverflow.com/questions/730620/how-does-a-hash-table-work

https://en.wikipedia.org/wiki/Hash_table

http://www.cs.cornell.edu/courses/cs3110/2011sp/lectures/lec20-amortized/amortized.htm

What is hash algorithm (hash function)?

A hash function takes a group of characters (called a key) and maps it to a value of a certain length (called a hash value or hash). The hash value is representative of the original string of characters, but is normally smaller than the original.

Hashing is done for indexing and locating items in databases because it is easier to find the shorter hash value than the longer string. Hashing is also used in encryption.

This term is also known as a hashing algorithm or message digest function.

Hashing is used with a database to enable items to be retrieved more quickly. Hashing can also be used in the encryption and decryption of digital signatures. The hash function transforms the digital signature, then both the hash value and signature are sent to the receiver. The receiver uses the same hash function to generate the hash value and then compares it to that received with the message. If the hash values are the same, it is likely that the message was transmitted without errors.

One example of a hash function is called folding. This takes an original value, divides it into several parts, then adds the parts and uses the last four remaining digits as the hashed value or key.

Another example is called digit rearrangement. This takes the digits in certain positions of the original value, such as the third and sixth numbers, and reverses their order. It then uses the number left over as the hashed value.

It is nearly impossible to determine the original number based on a hashed value, unless the algorithm that was used is known.

How is a hash table stored in memory?

A hash table is stored in the memory as an array of buckets. Each bucket has a unique hashcode. each entry (key value pair) will store in a bucket based on the hash value (equal to the hashcode of the target bucket) calculated based on the key. Shows in the picture below.

630px-Hash_table_3_1_1_0_1_0_0_SP

Why there is a collision in hash table? 

Because the hash value is representative of the original string of characters, but the range of hash values is normally smaller than the original. values after hashing function may result in a same value. Then a collision happens.

How to deal with collision?

When a collision happens in a bucket, we can chain the entries as a linked list, as showed in the picture above.  There are also many other data structures that can be used to chain entries in a bucket. For example, we can use balanced binary tree. In this case, the time complexity will be improved, but the space complexity will be increased.

The size of the bucket in a hash table is pre-defined, which decides how many entries that a bucket can hold. if the array becomes too full, we will just resizing the array. There many resizing ways. One is to double the size of array. recompute the hashcodes for existing keys (due to the size of array is changed), copy them to the new array, delete the old one.

This operation is not constant time, but rather linear in the number of elements at the time the table is grown.

What is the Time and Space complexity for using hash table?

As for linked list chaining method.

Time complexity:

average is O(1).

Because in most cases, the collision would not happen frequently, we can get the value for get the hascode of its key in constant time. We say that the insertion operation has O(1) amortized run time because the time required to insert an element is O(1) on average, even though some elements trigger a lengthy rehashing of all the elements of the hash table.

worst would be O(n).

Because, in worst case, all entries are stored in a bucket which means every time a new entry create will have a collision happen. In this case, to access or search a entry in hash table would be the same as linked list.

 

Space complexity:

O(n).

Because the size of space will grow linearly as the number of entries increasing.

 

 

 

 

Array

Learn from these links:

https://en.wikipedia.org/wiki/Array_data_structure

 

 

What is Array?

An array is a data structure consisting of a collection of elements, each identified by one index or key.

In most cases, the indexes of an array is a sequence of continuous integer numbers, starting from 0 or 1. As what we’ve seen in most programming languages like, C, JAVA, Python, JavaScript, etc.

How are arrays stored in Memory?

Basically, An array is stored in a sequence of continuous addresses in memory. How much space each element occupies depends on what type of the element is in the array.

What is time and space complexity of using arrays?

Space complexity: O(n);

As arrays are stored in a continuous spaces, and memory is just a series of continuous spaces.  We just need k*n space to store an array, n is the length of an array, k is the size of each element of the array.

Time complexity:

Access: O(1)

Because array is stored in a continuous space address. For finding any element in an defined array, we just need to calculate the address of the wanted element based on the starting address and the index of the element. Thus the time for getting an element in an array is constant.

Search: O(n)

To search a value in an array, we need to traversal the whole array.

 

 

 

 

Memory basics

memory hierarchy

In a computer, memory is not just one thing, but is a multi-level system. It is the memory hierarchy. From processor to several levels of RAM, to hard driver. The size of each memory increasing, the speed and cost usually decreased. The two pictures below show the memory hierarchy.

computer-memory-pyramidScreen Shot 2016-04-14 at 9.47.11 AM

(source: http://computer.howstuffworks.com/computer-memory1.htm)

 

how is data stored in Memory?

To answer this question, firstly, we need to figure out the physical architecture of a memory media.  There are many types of memory media: semiconductor, magnetic, optical and paper tape. Let me just focus on semiconductor here, which is the media for RAMs,  flashes and SSDs.

A semiconductor memory media (chip) can contain millions of transistors and capacitors. these elements are the places (units) store 1 and 0. Those units are aligned in matrix, and we can address each unit’s location. We can use row selector and column selector to locate a unit, and the address information send between each memory media or processor using address bus (lines), and transfer the address to high or low signal for row and column selectors. Shows in the picture below.

Screen Shot 2016-04-14 at 11.16.13 AM

As we all know, in computer science, any data types (Integers, Strings, images, etc.) are stored in 1 and o. A group of these units can be used as a basic block for representing data. Hence, we have a series of 1 and 0 to represent a data. I think this is why we need to convert any data to a binary code.

Let’s say the minimum the unit of data is stored in a row, for example, each row is a word. Hence, each row stores a data unit. We access the memory row by row.

These basic ideas about memory will help us understand how all kinds of data structures were stored in memory.