Subgraph Traversals

DSE Version: 6.0

Video

The subgraph traversal extracts a full subset of the larger graph for analysis, visualization, or some other similar purpose. In this unit, you will learn about subgraph traversals.

Transcript: 



Hey everyone, I am Denise Gosnell and this is subgraph traversals.

Many of the traversals that we focus on walk around the graph, calculating some statistical values, selecting a list of vertices or edges, picking a value or something like that.  But, the subgraph traversal extracts a full subset of the larger graph for analysis, visualization, or some other purpose like that. Specifically, it generates an edge induced subgraph based on edges we give it in the traversal stream.

An edge induced subgraph is subgraph formed by the subset of edges and their endpoint vertices. In a graph, it doesn’t make sense for an edge not to have a vertex. Therefore, once we have an edge, we also have the vertices it is connected to. also, all of the properties from the vertices and edges are copied into the subgraph. So, the subgraph would have all of the properties we would expect it to have.

For subgraphs, we have a new step. That step is subgraph. We give it a label; this label is the name of the induced subgraph. This generates an edge induced subgraph from the edges it receives in the traversal stream. Now, this subgraph is not passed on to subsequent steps in the stream. Instead, it is produced as a side effect, that is why the label is so important. That is how we are going to identify it. It is sort of stored off to the side in a map, if you will.  Its label is the key and the subgraph itself is the object in the sideeffect map.

Now, we mentioned that the subgraph step doesn’t pass the subgraph into the subsequent steps in the traversal. But you can force that. If you use the cap step, and you tell it the label of the subgraph of interest, what cap will do is force all of the traversers that you have defined so far to complete their work, that is to completely define the subgraph, and then pass that subgraph onto the next step in the traversal.  Or, if you don’t want to pass the subgraph onto the other steps in the traversal, you could just call iterate. Now, iterate will also, like cap, force all of the traversers to do their work. So, you will know that when iterate is done, the subgraph has been completely computed. But the subgraph will just be stored into the sideffect area and not passed on into subsequent steps in the traversal.

A reminder as always – of our KillrVideo movie schema.

Movies are related to people and movies belong to genres. Movies also have directors, actors, and other creative team members involved in the production of the movie.

Lastly, Users rate movies and know each other. That will be important for the upcoming examples.

Let’s extract the complete social graph of all killr video users in our example data. This graph will be the users who know other users.

In the traversal, we begin with our graph and we look at all of its edges, and then only those edges which have the label “knows”. We know from our schema,that those edges will be connecting our users. Then, we ask for the subgraph and we give it the label social graph.

Then, we cap the social graph subgraph, which means -- make sure all of the traversers run and pass those traversers on into the next step in the traversal. Then, we will call next which will access the actual subgraph object that we have computed, and that is what we will return. That is what this traversal returns into the identifier social graph.

Then, we create a new traversal on that subgraph and we are going to do 2 things to it. First, we will get all of the vertices and group count them. group count is statistical step that we went over in a different video. It is going to find all of the vertices that have a common label, as shown in the by modulator. We will group up the vertices by their label, and count them. But, in this subgraph, we only extracted users. So we will find 1079 users.

Next, we will do the same thing with the edges. We will group them by label and count them up. Now, this is our social graph, so the only kind of edge we are going to see has the knows label.

Lastly, because this is an edge induced subgraph, there is a consequence. We will never see vertices that do not have edges. It is perfectly fine for there to be isolate or disconnected vertices in our graph, a property graph supports that idea. But, those vertices will not show up in our edge induced subgraph. And, this last traversal counts up the total number of isolate vertices. In this traversal, we are looking for all of the vertices labeled user, but do not have a “knows” edge coming in or out. We count them up, and we have 21 of them.  That is, we have 21 disconnected users, or loaner users who do not know any body yet.

Now, let’s find a subgraph based on the immediate social connections of a particular user. That user in this case will be good ole U1.  so we are going to look for the vertex that has the user ID of U1. then we're going to get both the incoming and outgoing edges with the “knows” labels -- that's all the social relationships for this user and compute the subgraph based on those edges. In Edge induced subgraphs -- we start with edges. We will call that the U1 graph, we will cap it to force those traversers to do the work, and then we will return the graph. We will do the same kind of statistical analysis on this graph, as we did in the last example.

We are goign to group count by label the vertices and the edges. We get 6 users; of course they are only going to be users because in our social network schema, we know that this edge only connects users. Keep in mind, this will be the original u1 plus 5 other people that u1 knows. Then, we will count the edges by label, and we find 5 of those. 1 u1 vertex, and 5 edges to the 5 people it knows.

Let’s expand our search a little bit. Let’s look at the friends of u1’s friends.

To do this, we will need to use a recursive traversal. That is covered in a whole other topic, but that is the repeat step is on the second like.

To start, we get user u1 and then we do a recursive traversal looking for both incoming and outgoing edges looking for the “knows” edge label, and we compute the subgraph based on those edges.

Now, note that the subgraph step is inside the repeat step. So, we are repeating that -- we are repeating the subgraph step. And, each time, we give it the same label “u1graph3”.

How many times are we doing this? You see that in the 4th line with the times step where we say to do this 3 times.  So, as that traversal is repeated, the state of that side effect -- the u1graph3 object -- it is changed as the repitition proceeds. As we run this traversal, u1graph3 is growing as a subgraph.  We we are done, we cap it to make sure those traversers are doing their work and we get the graph back in the u1graph3 identifier that we specify in the top line.

Then we do the same statistical analysis. We group the vertices by label and the edges by label. We see there are 195 users. That is 1 user for u1, and then 194 friends, friends of their friends, and friends friends of their friends. And then 240 edges that connect all of the people in this increasingly large graph.

Now, a subgraph traversal doesn’t have to be limited to only extracting one subgraph. In this example, we are going to extract two. We are going to start with all users and the edges that connect them. Get all of the knows edges in the graph.  That is kind of the whole social network.

Then, we extract that subgraph calling it “social graph”.

Now, when we are done with that subgraph step we call both V. Now, remember subgraph doesn’t pass it subgraph onto the next step in the traversal. It generates a seide effect. So, the bothV call on the second line is taking its input from the bothE on the first line. Now, in our schema, the endpoint of a knows edge is always a user. So, both e will give us user vertices. From there, we will look at outgoing edges that have the rated label and we know from our schema that those are going to be movies. So, we will get the subgraph of movies. So now we have created that sideeffect called “movie graph” on the 3rd line with that subgraph of movies.

From there, recall that we are on an edge -- the most recent step and relevant step was  outE “rated”, so we will go to the incoming vertex with the inV call. Again, we are at movies. We will go from there to the outgoing edges and in our schema, we know that those outgoing edges coming from movies are things like “director”, actor, belongs to, and so forth. Then, we call subgraph using the same movie graph label. So, we already have a subgraph called movie graph, so in effect adding stuff into that graph by using that same label again.  So, we are kind of enriching our movies subgraph with all of the people and genres associated with the movies.

Then, we want to do things with those two subgraphs. So, we can cap them both. We call cap, and we ask for the social graph and movie graph labels. Those traversers will finish, we will return those subgraphs in the graph identifiers we have named on the first line.

Now, graphs will be a map of two subgraphs: social grpah, and movie graph. We will define traversals on those and again do very similar analysis. We see the vertex and edge statsitics of the social graph. There are 1079 users and 3500 relationships relating those users. Then, we see the label distribution of the movie graph: 920 movies, 8759 people, 18 genres, and of course those 1079 users who are rating them.

Likewise, we can count the number of edges relating those things and we can get interesting facts. Like, that there are 1668 screenwriters or 1061 directors in our database.

Now, it is worth noting, if we were merge these two subgraphs back into one, we actually wouldn’t have our original graph because we have lost the disconnected people. We extracted the social graph and the users with no social relationships didn’t have a chance to show up. We lost them, and their ratings.

So, the fact that we are extracting edge induced subgraphs has consequences like that.

And here is one more way of doing the same thing. Now, our subgraph definitions are exactly the same as the previous example. We are getting the social graph and then the movie graph. We get the movie graph by starting wth movies that people have rated and we add onto that the people who make movies, and the genres that movies belong to. Instead of calling cap, we just call iterate. That is it. The traversers do their work, and the traversal is done.

Now, to access the subgraph, we have to look at the sideEffect collection. We treat it like a map and we get things out of it by label. We get the social graph, and the movie graph, and we create new traversals on those things.

Then, the rest of our analysis is exactly the same.

The key point in this last example is the side effect map and how to use that. When we make a subgraph, we are creating a side effect. We are putting an entry into this sideeffect map, whose key is the subgraph label and the value is the subgraph object itself. We have the pretty easy API taht we can use to access those.

That is subgraph traversals in a nutshell.  We hope these examples have been helpful as you dig into the following exercise.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.