## Statistical Traversals

DSE Version: 6.0

Video

WIth a statistical traversal, we move around the graph and collect statistics about the structure or properties in some way. We are computing aggregates or some kind of analysis that will summarize what is in the graph. In this unit, you will learn about statistical traversals.

Transcript:

Hey everyone, I'm Denise gosnell these are statistical traversals.

In a statistical traversal, we are going to move around the graph and collect statistics about the structure or properties in some way. We are computing aggregates or some kind of analysis that will summarize what is in the graph. So, instead of returning a subgraph or a collection of vertices or edges, what the statistical traversal is doing is probably returning a scalar or collection of numbers that describe the statistics in the graph.

You can see in this example that we are counting the different colors of the vertices. One is green, two are orange, and 4 of the vertices are that dark teal.  This could be the result of a real statistical traversal.

Of course, any new traversal type will give us some new steps. Let’s look at the steps for doing simple aggregates. We have count, sum, mean, min, and max. And all of those things do about what you would expect them to do. They will count things, calculate an average, find the minimum or find the maximum.

Now for each one, we can optionally give them a scope. By default that scope is goign to be global. But, we can also ask for it to be local. The deafult global scope will apply the aggregation to the entire stream. The local scope will apply the aggregation to each collection type object in the traversal stream.

You may want to provide an aggregation function of your own where none of the triial mean, min, max things fit the build. For this, there is fold.

Now, you can call fold and you can give it an intial value, and then a function that will be applied on all of the objects in a traversal stream.

Fold is best explained in an example, that is coming up in a little bit. There is also an inverse of fold called unfold. Now, an aggregation like sum, there is no way that you can un-sum numbers, but potentially -- we are able to unfold after we have folded.

When we are aggregating, it is pretty common for us to also group. Group basically creates a map of objects that organizes from the traversal stream into groups, based on some known characteristic of those objects coming in. Now, we can name the group -- that is the optional map label, if we want to be able to refer to the map later on. But, much more importantly, is the modulator by. We are usually grouping by some particular thing. We are grouping by age, or gender, or director, or something else. So, the typical use is group and then by with the category we want to group by.

Now, the by modular here is very important. Calling the by modulator once will indicate what we want to group on. For example age ,directores, etc. The second time you call the by modulator, you are indicating how you want to represent each object in the group. So, you may want to pick out some property in the object and use that property to represent the object in the map that group is creating.

Now, the distinction between those two calls to by is very important and an example will go a long way in making that make sense.

If the purpose of grouping is simply to count, then we want to use group count.

If you want to group by age and then count up the people who have various ages, then you will want to say group count by age. It is more effecient and more expressive to use this step.

Let’s look at some examples. As always, we are using our killr video graph schema.

We have movies, they are made by various kind of people. The movies belong to genres, and the movies are rated by users. And, Users know each other in a social network.

Let’s see what we can do with teh statisical traversals with our sample data in this schema.

Let’s start with the basics. Let’s count some vertices and then count some edges. We see 10,797 vertices in this graph, and 69,054 edges in this sample graph.

We can be a bit more specific and we can count only those vertices with the user label --which there are  100 of those.

And, we see that there are 3500 edges with the “knows label”.

Next, let’s calculate some statistical values about the properties we have on the vertices in our graph.

First, “Look at all of the vertices labeled users, get their age, and calculate their average age”.

The youngest user is 12 and the oldest user is 25.

Some more simple analysis.

Let’s get all of the movie vertices, get their duration property, and add it up. We have 106,358 minutes worth of movies in the kilrr video dataset.

Now, we mentioned fold before, so here is an example that uses fold. We are using fold to do the same computation as the sum step, just so that it is an easy function to wrap you mind around. Now, the call to fold gives you a seed value of 0 -- when you are trying to do a sum, 0 is likely t he seed value you want to start with. Next, we pass it a function and we use the closure syntax from the groovy language. That function takes two parameters, we are naming them x and y -- the stuff to the left of the arrow. To the right of the arrow is the body of the function. It is a very trivial function, just adding x and y. Then, the value is returned from the closure.

Now, important to note--- that function can be much more complex. We can put arbitrary code to the right of the arrow that can do more complex things.

Now, if we have the name of the method, you can also pass that directly into the fold step.  You give it the seed value of 0 and then pass it the name of the method sum. We are passing a native groovy method there, but it could be some other closure identifier that we are passing in there. We could define that function elsewhere and pass it into the traversal, if it is a more convenient way to factor our code.

Now, this is pretty trivial of an example. But, if you have many lines of code for your function,  you might not want to put that inline in your closure in the middle of your traversal. This gives us a cleaner way to factor our code when the need arises.

So, three ways to answer the question: how long will it take to watch all of our movies in our test data.

Now, when we were looking at the group step, we mentioned that it returne a map. Let’s look at this in an example.

What we want to do is to calculate the average age of the men and the women, separately, who use killr video.

So, we will get all of the user vertices and we will group them by the gender property. That is the first 3 lines: all of the users and group them by gender. The 2nd call to the by modulator where it says by age means in the resulting map, I want to represent the values by that property.  If we had just grouped by gender: we would get a map with 2 keys: one key is M and one key is F. And, the values would be the user vertices.The whole vertex would be the value. Here, we are saying that we know we want to simplify that. We just want the ages. Here, we want the collection of ages for the M key, and a collection of ages for the F key.

And, that is what we see in the sample output. The average age of the women is 30.56. The average age of the men is 32.01.

Next, suppose we want to find the number of different kinds of vertices and the different kinds of edges in this graph. We will start by getting all of the vertices and simply group counting them by label.

We see that there are 4 labels that show up in our graph. There are movies, people, genres, and users. And you see the counts for each one of those, giving us a distribution of vertex labels for this sample graph.

Next, we will group count the edges by their label. We will look at all of the unique edge labels in this graphs, put them into groups, and count them. There are 10,678 actors. 48,094 ratings and so forth.

And, as you look at these numbers, you might see things like there are 10,678 actors but only 8759 people. But, actor is an edge label, so it doesn’t really refer to a person doing acting. In reality, one person can be an actor in more than one movie. So, one person can have multiple edges actor edges going to movie vertices.

These numbers may seem odd at first, but once you think about them, they make sense.

Next, let’s look at the distribution of ages of our users.

To do this, let’s grab all of the user vertices and group count them by age. We will get a map that almost produces our histogram for us. From here, we could plot how many people are of each age.

Lastly, we can break down the movies by year. We can look at how many movies were made in each year. let’s grab all of the movie vertices and group count them by year.

Again, this kind of gives you a histogram of how many films were made in each year.

Now, up until now, the examples have been pretty simple. This next one will calculate the centrality for the vertices in our graph.

Now, centrality isn’t quite the same by looking at the total number of the edges on a vertex -- it is a little bit more complicated than that.

To calculate centrality, we are going to walk through all of the vertices in our graph and we are going to order them by their centrality. Then, the first vertex in the list will have the highest centrality.

We are going to use a recursive traversal, that is the repeat step. Let’s look at the steps inside repeat. We group count, and give the map a name “m”. So, every time this group count is repeated, the results are going to be put into the same map. So, we are going to walk around the graph counting things, and put all of the counts in this m map.

Now, notice we are not counting by anything in particular. We are counting simply the vertices that we find in the graph. Everytime we visit one, we count it once. From that group, we traverse edges in both directions. So, we are trying to walk all over the graph.

We will repeat that a certain number of times. We picked 30 -- and we picked it for a reason. When doing this type of calculation to estimate centrality, if you can imagine the list changing order every time you do this, you can imagine that you will have a certain ordering after the first repetition. On the 2nd traversal, the order will shift around the bit. Eventually, the order will converge on an estimation of the centrality of the vertices in the graph.  5 times is likely not enough for this estimation, but after 30, you might not see many changes in the centrality ordering for most graphs.

Now, that repeat group count both times 30 can do a lot of work. We are goign to force that work to complete by calling cap on m. So, all of those traversers need to do their computation and give us this map m to work with. The next 4 lines are important for giving us the result we want, but computationally, they are less interesting. Here, we are going to locally order those objects by value, decreasing.

What is that value? It is count of the number of times we have encountered each one of those vertices.

Now -- let’s look at the select keys line. Remember, the object we have up until this point is a map. The keys in the map are the vertices themselves, the values are counts. We have already ordered them by their value. So, we have the centrality ordering we need and we just want the keys. This returns the now ordered list of vertices. We would like to operate on those vertices individually, not as a collection -- that is what unfold does. It takes a big collection of vertices and let’s us deal with them individually.

Then, we choose how we want to observe the output. We choose to look at movies by their title, and otherwise we just want to get their name.  And you can see what we get: the central vertex is the comedy genre followed by the drama comedy. Skipping a few elements, we see the most central movie is Forrest Gump.

Estimating centrality is a complex traversal, it isn’t as simple as group or group by. But, it is a great example of a way to use the statistical traversal steps to build up to a more complex mathematical estimation.

No write up.
No Exercises.
No FAQs.
No resources.