Gremlin Olap Traversals

DSE Version: 6.0

Video

A Gremlin traversal can be executed using either real-time (OLTP) engine or analytic (OLAP) engine. The latter execution results in a Gremlin OLAP traversal. In this unit, you will learn about Gremlin OLAP traversals.

Transcript: 

 

Hi, I am Artem Chebotko and this is Gremlin OLAP Traversals.

A Gremlin traversal can be executed using either real-time (OLTP) engine or analytic (OLAP) engine. The latter execution results in a Gremlin OLAP traversal.

 

Gremlin OLAP traversals are still written using the same standard Gremlin graph traversal language but they do start with an OLAP traversal source and execute as Spark jobs on DSE Analytics nodes.

 

Currently, the OLAP engine may not support all Gremlin traversal steps but additional steps are constantly being added. Please refer to the documentation for the most current list of supported traversal steps.

Here are some of the characteristics that may help you to recognize an OLAP traversal and choose an appropriate execution engine.

 

Any traversal that take longer to execute, involves broader-scope data analysis, requires expensive graph scans, deals with large subgraphs, traverses longer paths with many branches should be strongly considered for execution by the OLAP engine instead of the OLTP engine.

It very straightforward to switch between OLTP and OLAP engines by simply switching between OLTP and OLAP traversal sources to start your traversal.

 

In DataStax Studio, you can use the execution drop-down menu available for each Gremlin cell. You have two options: Execute using real-time engine and Execute using analytic engine.

 

In Gremlin Console, you can use the remote command to configure an alias g that refers to either KillrVideo.g or KillrVideo.a, which are OLTP and OLAP traversals sources, respectively, for the KillrVideo graph.

 

The Gremlin traversal itself does not have to change, which is the great news!

Gremlin OLAP traversals are translated into Spark jobs that are executed on DSE Analytics nodes. Therefore, you can monitor their execution using DSE Analytics tools, such as Spark Web UI. You simply need to point your Web browser to a node in your cluster by supplying its public IP address and port 7080.  

 

Among running applications, you should be able to see “Apache TinkerPop’s Spark-Gremlin” that executes Gremlin OLAP traversals. You can always click on the application ID to drill down to specific Spark jobs and execution stages to see more details.

Let’s look at a couple of examples.

 

Since you already know Gremlin, they should look familiar. The only difference could be that we use an OLAP traversal source to execute them.

 

Here we find vertex distribution by label and we have 920 movies, 8759 persons, 18 genres, 1100 users in our KillrVideo graph.

 

The screenshot of Spark Application Web UI shows the completed Spark job that corresponds to this Gremlin traversal.

 

It may not be obvious to you, but this particular traversal was routed to the DSE GraphFrames engine. I can tell based on the DseGraphTraversal.scala file in the description. Indeed this traversal is a simple scan query and meets all the requirements for the OLAP query routing optimization we discussed earlier.

This traversal starts with a random user, traveses knows edges in any direction six times and count users that we are able to reach. We get 1077 users out of 1100 total users in the graph which suggest that almost any two users are connected by a path with at most six edges. Our user social network has six degrees of separation.

 

The screenshot (part of Spark Application Web UI) shows some of the completed Spark jobs that correspond to this single Gremlin traversal. This time, the SparkGraphComputer engine is used for execution.

Besides the Intelligent Analytics Query Routing optimization that is applied automatically, Snapshots is another optimization that you have to introduce yourself.

 

Snapshots can be used to optimize performance. If you need to execute many OLAP traversals over the same subgraph, you can create a snapshot, which is essentially a copy of that subgraph, in Spark to query it faster. A snapshot is a persisted dataset in Spark.

Then, you can create an OLAP traversal source to query the snapshot and execute multiple OLAP traversals using that traversal source.

--

The Snapshot API includes …

 

The snapshot() method … to start a snapshot definition.

 

The vertices() and edges() methods … to specify which vertices and edges to add to the snapshot based on their labels.

 

The conf() method … to specify configuration properties for the snapshot. The configuration properties are defined in the Apache TinkerPop™ documentation for SparkGraphComputer.

And the create() method … to create the snapshot and return its corresponding OLAP traversal source.

 

It is actually quite straightforward. Let’s look at an example.

We extract a social subgraph from the KillrVideo graph as a snapshot containing all user-vertices and all knows-edges and cache this snapshot in memory of Spark executors.

 

We then execute four traversals over the snapshot using the respective OLAP traversal source called “social” in this example.

 

All these OLAP traversals are analyzing the same subgraph … counting the number of vertices and their distribution by age, gender, and degree. The first traversal will take longer to execute because this is when the snapshot is actually materialized in memory and all subsequent traversals will be very fast.

 

It is that simple!

Now that you know what Gremlin OLAP traversals are and how to execute them with Spark, let’s talk a bit more about the Spark running environment itself.

 

Since OLAP traversals create many intermediate objects during execution and those objects are garbage-collected by the JVM … it is better to have a larger pool of executors each with smaller memory and CPU resources.

 

Note that this is quite different from non-graph Spark jobs which typically perform better with fewer executors with higher memory and CPU resources.

 

Also, to reduce garbage collection pauses and improve OLAP traversal performance, we recommend allocating executors with no more than 8 cores … in most cases, just 1 core per executor is a good choice. The memory available to Spark should be equally spread among the cores.

There is a convenient Gremlin-Spark configuration API to to control Spark settings for an OLAP traversal source. g.graph.configuration.setProperty() allows you to change both Spark and Spark-Cassandra Connector properties for your particular use case. There are many dozens of those configuration properties … we will see a few in our examples but please refer to the documentation to learn about many more.

 

One thing to keep in mind is once you change a configuration property, you may need to kill and restart the Apache TinkerPop’s Spark-Gremlin application via the Spark Web UI for the change to take an effect.

Here is an example of how to change three Spark properties before executing your statistical Gremlin OLAP staversal.

 

The spark.cores.max property sets the maximum number of cores used by the Apache TinkerPop’s Spark-Gremlin application. Setting this property lower than the total number of cores in your cluster limits the number of nodes on which the traversals will be run.

 

The spark.executor.memory property sets the amount of memory used for each executor.

 

The spark.executor.cores property sets the number of cores used for each executor.

 

Again, if the Apache TinkerPop’s Spark-Gremlin application was running before you change these settings, you should restart it.

And this is an example of changing the Spark-Cassandra Connector setting called spark.cassandra.input.split.size_in_mb.

 

This property sets the approximate size of data the Spark-Cassandra Connector will request with each individual CQL query.

 

In this example, we change this setting because … When deleting many edges or vertices from a graph, we may end up with many tombstones. As a result, we may get errors in subsequent queries due to the large number of tombstones left in the database.

 

To avoid these errors, we reduce the number of tombstones per request by setting the spark.cassandra.input.split.size_in_mb property to a smaller size than the default of 64 MB.

In particular, we set the property to 1MB before dropping all our users.

 

Of course, there are other important implications of how data is read from Cassandra and into how many Spark partitions when this property is changed.

Finally …  it is time for hands-on.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.