Sack Traversals

DSE Version: 6.0

Video

The sack traversal in Gremlin is kind of like giving an empty bag (or sack) to every gremlin traverser in your graph. The gremlins carry their bag as they walk around the graph, and they collect properties about the graph to store in the bag. In this unit, you will learn about the sack traversal.

Transcript: 

Hey everyone, I am Denise Gosnell and this is sack traversals.

As I like to think of it, the sack step in gremlin is kind of like giving an empty bag (or sack) to every gremlin traverser in your graph. The gremlins carry their bag as they walk around the graph, and they collect properties about the graph to store in the bag. Possibly, using this sack, you could instruct your gremlin, say, to pick up, calculate, and store something in the bag as it walks through the graph structure. For example, the gremlin could collect the total number of vertices or edges it walks through by summing up an integer in its sack. Then, At the end of the traversal, you could inspect the contents of the sack to see what the gremlin traverser collected during its walk.

Let’s visualize this.

You can imagine that the gremlins on this graph represent the full journey of one gremlin traverser from left to right. That is, the gremlin started on the vertex at the far left of the graph and walked the full path all the way to the right.

To start, let’s imagine that the gremlin on the very left vertex began his journey with an empty bag. Then, as the gremlin walked to the right, he added one for every edge he walked over. So, when you inspect the sack for the gremlin at any vertex, it will contain the number of edges the gremlin walked to get there. At the end, we would inspect the value in the sack and find a value of 5, representing the 5 edges our gremlin walked from left to right in our graph.

While as a visual, that may make sense, let’s dive a little deeper and create a more concrete definition.

In its most simplest form, a sack is just an object or a data structure that is local to the traverser. You are required to initialize the sack with a value or an object, such a zero or maybe an empty map.  As your traverser moves throughout the graph structure, you can update the object in the sack or you can read the values of the object to make a decision about your traversal.

It is important to note two things:

  1. Unlike side effects, which are global data structures visible to all traversers, sacks are local data structures visible to only their respective traversers.

  2. Unless explicitly initialized, traversers do not have sacks.

Unless you happen to be a handbag aficionado, why would you want give your gremlin a sack?

For starters, sacks are really useful in collecting path information or computing decaying energy algorithms throughout your graph structure. Also, in general, sack()-based traversals have a smaller memory footprint than path()-based traversals for two reasons.

(1) sacks usually contain values like the path distance rather than full path histories and

(2) traversers with sacks can be merged while traversers with path data structures cannot be bulked.

While they are extremely useful, sack()-based traversals may not outperform path()-based traversals 100% of the time. A lot depends on path characteristics, the graph’s branching factor, and how much bulking is possible

We will get into the differences between bulking in merging in a later example. But for now, it is important to note that bulking is something you can do with any traverser whereas merging is something that is specific to the sack step.

As with any new concept, let’s introduce the new steps, their parameters, and common usage patterns.

First, we have the withSack step. The withSack step enables sacks in your graph; it is the way that you create a sack for each gremlin traverser. There is one required parameter and two optional parameters. The first parameter, initialValueSupplier,  is required, and it provides the initial value for each traverser’s sack.

Next, you can provide a split operator. The split operator is a UnaryOperator that clones the traverser’s sack when the traverser splits. If no split operator is provided, then UnaryOperator.identity() is assumed. This essential tells the traverser what to do when it moves from a vertex and out to multiple edges. The split operator defines what to do with the sack and its values when the traversers split from one to many.

After a split, we need to consider what to do on a merge. That is the third parameter of the withSack step.  Merge operator is a BinaryOperator that unites two traversers' sacks when they are merged from multiple traversers into one. If no merge operator is provided, then traversers with sacks cannot be merged. You can picture this when multiple gremlins take different routes to end up on the same vertex. The merge operator indicates what the traversers are to do with the values in each traverser’s sack.

The next steps to introduce are those which allow us to mutate, update, or inspect a traverser’s sack. The sack step is used for these functions: reading the sack value and writing to it. The sack() step with no parameters reads the sack value and places it in the traversal stream.

The sack step also can take an update function which will update the sack value based on two things:

  1. the existing object in the sack, and

  2. the parameters supplied by the by step modulator.

Common update functions that I use, or have seen used, are sum, min, and mean. That will become more clear as we look at examples.

And, before digging into the examples, let’s take a look at the by modulator. We have seen the by modulator in a few different gremlin topics in this series so far. As far as its use within the upcoming examples, we will see that The by step modulator is frequently used to pass additional parameters for computing a new sack value. Since we have seen the by modulator a few times, let’s dig into the examples to get an idea of exactly how to use it with the sack step.

As before, we will be using the KillrVideo graph schema where users know users. Users rate movies and movies belong to genres. We also have various people of the production team which act in, direct, compose, etc the movies in this sample data set.

To begin understanding the sack step, Let’s start with a visual example. You can imagine that this graph represents you, shown there in the middle next to the start arrow, and your social network.  And if you are not 17, maybe you can imagine some other 17 year old in your life that this graph can represent.

What we want to do, is use the sack step to start from the center vertex, and add up all of the ages of your friends, and your friend’s friends.

To do that, we would drop one traverser at the starting vertex with a sack initialized to zero. Then, the sack would grab the age on the current vertex, which is 17, and add it into the sack. Next, this traverser would split into 4 total traversers and move to the adjacent vertices. Now, each sack will have a local copy of the integer, which at this point is still 17.

Let’s follow the traverser in this graph that moved up, or to the vertex directly above the starting point. This traverser arrives at the new vertex, grabs the age of the vertex, and adds it into the sack. At this point, the current value in the sack is the previous value, 17, plus the new value 17, for a total for 34. From here, the traverser splits again and continues to move up. The traverser that followed the branch up and to the left will arrive at a vertex with an age of 16. And as you see right above, the total value in this traverser’s sack will be 17 + 17 + 16 which is 50. This means that the total age of the friends observed on this path is 50.

The other traverser which moved up along the branch to the right will reach a vertex with an age of 17.  And as you see right above the vertex, the total value in this traverser’s sack will be 17 + 17 + 17 which is 51. This means that the total age of the friends observed on this path is 51.

As you look around this graph, you will see that there are a total of 13 different paths and therefore 13 total traversers that would be generated throughout this process.

Given this visual example, let’s take a look at some code.

We have two different ways to write a gremlin traversal that does the very walk that we just stepped through on the last slide.

The first traversal shows you how to do the aggregation as you walk through the graph.  To do this, the first line of the traversal calls g.withSack(0) to create a sack for the traversal and initialize its value to zero. On the second line, we find our favorite user, u1. Starting at this user, we add it’s age into the sack. If you remember the mental walk we did on the previous slide, this is the point at which we added 17 into the sack value. We do this by calling the sack step, passing the sack step the “sum” update function, and then indicated what values to sum with the subsequent by modulator. Here, we say to sum the value age into the sack for the traverser.

We want to do this process for our friends, and the friends of our friends. To do this, we will need the recursive step repeat, and you see at the end of the repeat step, we say to only do the repetition 2 times. Meaning, we are only going to do some work for two neighborhoods out from our starting location. Inside the repeat step, we tell the traverser to walk out the knows edge to the next vertex. Once at the adjacent vertex, we call the sack step, give it the sum update function, and tell the traverser to add in the age value into gremlin’s sack. Once we have done this for our friends, and the friends of our friends, we inspect the value in gremlin’s ack by calling the sack() function, and we see the full list of all 13 values previously seen on the last slide.

In the second example shows us how to do this without using the sack command. However, we will then need to use the path command so that we preserve the full history of every traverser so that we have the right data for the sum.

To do this, we find our favorite user, u1. Next, we use the repeat step to walk out the “knows” edges two times. This is the same as before and traverses to the second neighborhood. Lastly, we call path, and represent all of the objects in the traversal stream by their age, and apply a sum to each traverser, locally. The use of local here makes sure that each of the 13 traversers sum up the ages in their individual path, instead of summing up all of the ages observed across the full graph.

And, as we hoped, we see the same 13 ages as before.

We can also use the value in gremlin’s sack to make a decision within the traversal. For example, maybe we want to keep walking through the graph until we collect a specific amount of data. So, in this example, let’s find a path that starts with our user u1 and eventually adds up to a total of 200 or greater.

To initialize the traversers to have a sack, we start with g.withSack(0). Then, we find user u1 and sum u1’s age into the sack. Next, we see a repeat until pattern. Here, we want to continue to walk out the “knows” edges and summing up the observed age into the sack. We do this until the sack value is greater or equal to 200.

Then, we call the path step on the traversers that pass through the repeat, until filter. We inspect the path objects, which are all users, by their user id. We limit the results to one path, and unfold the path. Here, we see that the collection of u1, u74, u750, u775, u572, and u277 had a total age of 204, which is greater than 200.

Let’s dig into an example that is a little more complicated than keeping track of a number. Let’s walk through the graph and instead, keep track of an object, like a map.

When we start using more complex data structures like maps or lists, it is vital to make sure we define a split operator. We need to define a split operator because the default operator is the identity operator, and that probably will not give you the behavior you expect.

In this example, we want the same behavior we have been using in the previous examples. We want to create a copy of the sack and its value that is to be passed onto the traversers when they split. Then, we want each traverser to have their own instance of the sack object and value so that the traverser can update the object during its walk throughout the graph.

To do this with a complex data structures like a map, we need to define the split operator to be clone, as we see here at the beginning of this example.  Using the clone split operator gives each traverser it’s own copy of the map. Otherwise, Without defining the split operator, it would use the default behavior, which is the identity function. When using a more complex data structure like a map, this would mean that each traverser would be updating and reading from the same global object through the traverser’s local reference to that object.

To initialize the sack function to be a map, we call g.withSack and instanciate the sack to be a map with the map notation seen here. Next, we want to define the split function to be clone. This guarantees that each traverser will receive its own local copy of the object.

We want this object to be a map of users and their ages. To do that, we find our favorite user user 1. On the second line, call the sack function to update the map in the traverser’s sack. What we want to do is to create a key in the map that is u1, and the value of that key is u1’s age. We can do that by creating a custom function. This function is a lambda function that takes two parameters: the map and the user vertex. These parameters are shown to the left of the arrow in the lambda function notation. To the right of the arrow, we define how we want to use those parameters. We want the map to have a key, and we want that key to be the user’s ID. We want the value to be the user’s age. Then, we return the map into the traverser’s sack.

All of that was just for putting in the very first value of the map. At this time, if we stopped the traversal and inspected the map, we would see u1: 17.

Next, we want to repeat the same process as before. We want to traverse to u1’s friends, and the friends of u1’s friends. Along the way, we want each traverser collect the userID, the user’s age, and insert that data in to the map in their sack.

We do this, just as before, with a repeat step with a limit on the number of times: we only want to go 2 times. Then, inside of the repeat step, we walk out the “knows” edges to get to another user vertex. On that user vertex, we want to update the traverser’s sack. We do this update with a lambda function inside the sack call, as we saw before. We have two parameters: the sack’s map and the object the traverser is currently on: the user vertex. We create a key in the map that is the user’s id, and the key’s value is the user’s age. Then, we return the map into the traverser’s sack.

After repeating that process two times, we want to see the object within the sack, so we call the sack() step to read the contents of the traverser’s sack. We limit to only 1 traverser and look at the results.

We see a map with {u1=17, u74=15, u40=12}

While this was a pretty big example, the main things to remember are:

  1. With split, each traverser gets a copy of the starting/initialized map.

  2. The updates to the map are local to the individual traverser’s sack

  3. With complex objects, you probably want to define the split function to be clone, because without split, the traversers would be updating a global object through a local reference.

Now once we have instructed our traversers on how to split, naturally we need to tell them how to merge.

In our last example here, we want to dig into the merge operator and how it enables bulking optimization for better performance.

Before digging into this example, it is important to clarify the difference between bulking and merging. Bulking is the word for combining traversers that reach the same object in a traversal. You can picture this as 10 gremlins arriving to the same vertex, and then with bulking enabled, the 10 gremlins are all combined into one larger gremlin with a bulk value of 10.  Bulking is something you can do with any traversal, not just the sack step.

Conceptually, bulking is very similar to merging. However, the word merging is reserved to an operation with the sack step. When using gremlin sacks, merging defines how the values in the traverser’s sacks are to be combined or merged when two or more traversers are bulking together. This of the same example when 10 gremlins arrived to the same vertex. If those gremlins all have sacks with objects or values, we need to tell them how they are to combine their data into one sack.  That process is called merging. Merging is something specific to the sack operator, which tells the gremlin traversers what to do with the values in the sack when they are being bulked, or combined together.

To illustrate the difference between these concepts, we are going to walk the graph shown on the right part of this slide. We are going to start at U1 at the top. Then, we are going to walk down each path to get to u493.  The first example does not have a merge operator, so we are going to have two traversers with data to inspect. The second example introduces a merge operator, to the two traversers that arrive at u493 are going to bulk together and merge the data in their sacks. The output will only have one traverser.

To really dig into this, let’s walk through the code.

In this first example, we start with g.withSack and initalize the value to zero. Then, we start our traversal at user u1. We add u1’s age into the sack. As we have seen before, we repeat walking out the “knows” edges, and we add up the ages of the vertices into the sack on the way. We traverse 3 knows edges and reach u493 at the bottom of the image. Here, we call the sack function, and see that we have two traversers. One traverser with 65, and teh second traverser with 72. We expected to see two traversers because we did not define a merging operation and so we saw each traverser, even though they ended on the same result vertex.

In the second example, we want to define a merge operator. We start by saying g.withBulk(false).WithSack(0, sum). I am going to cover what is occuring with the initialization of withBulk(false) later. The important part to note here is the definition of the merging operator within the withSack statement. We are indicating that the merging operator here will be sum. That means that when two traversers need to bulk (or combine from two into one), we will merge the sack values by adding them up. Next, we want to start our traversal at user u1 and add u1’s age into the sack. Then, we repeat, 3 times, walking out the “knows” edge, and adding the user’s age into the sack.

Then, once we have reached user u493, we inspect the value of the object in the sack by calling the sack function.

Here, because we had a merging function, we only see one value, which is 116. Now, if you look at the path values from the last example, you will see that 65 + 72  = 137. However, we were using the bulking optimization. Therefore, when the two traversers arrived at u493, they bulked together, added their sack values, and then augmented the sack with the age from u493. This technique keeps you from adding u493’s age into the sack twice.

Now, back to the withBulk(false) initialization.  We used withBulk(false) so that we did not output the same result multiple times for the same traverser.  this is the default behavior. In other words, withBulk only affects output formatting and not actual execution. If we had withBulk(true), then when we were processing the traversal stream at the end of this query, we have one traverser with the sack of 116 and bulk of 2. withBulk(true) would result in outputting 116 two times because there were two traversers that combined into the one traverser at the end. The “bulk” value keeps track of the total number of traversers that are combined together into the single traverser so that we know how many had been removed through the bulking optimization.

That is it for the sack traversal, bulking and merging. I know that was a significant amount of information, so give the exercise a try and we will be here ready to dig into the next topic.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.