Traversing Graph Data Preliminaries

DSE Version: 6.0

Video

We are going to start how we can traverse graph data in DataStax Enterprise Graph. We will begin with some important preliminaries.

Transcript: 

Hi, I am Artem Chebotko. Let’s talk about how we can traverse graph data in DataStax Enterprise Graph.

Traversing Graph Data Preliminaries

We begin with some important preliminaries.

In DataStax Enterprise Graph, graph traversals are expressed in Gremlin, which is a standard graph traversal language defined in Apache TinkerPop. We cover Gremlin extensinsively in this course.

Traversals happen in one of two schema modes: Production and Development. Those could have serious implications on traversal execution.

We do need to talk about default read and write consistency levels, as well as how to change them.

Finally, we want to understand the difference between OLTP (or transactional) traversals and OLAP (or analytical) traversals, so that we can choose an appropriate traversal engine to execute our traversal. Sometimes it is easy to classify a traversal as transactional or analytical and in other cases the boundary can be somewhat fuzzy. 

Gremlin is our graph traversal language.

Again, it is part of the open-source Apache TinkerPop project.

It is a very expressive (Turing-complete) and easy-to-use functional language with bindings in many programming languages, including Groovy, Java, Python, Scala and many others. Groovy is kind of a native binding for Gremlin and that is what we will be using in our examples. The differences between Groovy and and Java versions are quite minimal, so almost all of our examples are going to look the same when expressed Gremlin-Groovy or Gremlin-Java.

Here is an example of a Gremlin traversal to look up a movie and get its title and release year properties back. As you can see, in our test data, this happens to be the movie Gremlin from 1984.

You might have learned about the Production and Development schema modes when we discussed graph schemas. Turns out that these modes not only affect schemas but also traversals.

Production is the default mode for all graphs as specified in dse.yaml. It is a recommended mode that ensures that you will not be able to execute potentially expensive traversals. Production mode requires an explicit graph schema and proper graph indexes for each traversal you want to execute.  This enables important performance optimizations and eliminates expensive graph scans.

Now, for development or education purposes, it may be ok to use the Development schema mode. This mode is good for playing around … when you are trying to get to know the product, when you are first sketching out how your graph may look like, when your schema is in the period of very rapid evolution as your understanding of data improves. This is where you want to be agile and be able to try some traversals without worrying about  indexes or even an explicit schema.

The Development mode allows graph scans and can even generate a graph schema for you on the fly as data is being inserted. Just do not expect that things will be super efficient in such scenarios.

This example shows how to change a schema mode from Production to Development for an individual graph. It is pretty easy to do … just do not forget to switch back when you go to production.

Graph scans are disabled by default in Production mode but there might be cases when you want them. You may intentionally want to scan a small subset of data or you may  occasionally need to compute some aggregates.

So, in this example, we are retrieving names of all genres, which is equivalent to scanning all genre-vertices and getting their name-properties. As you can see, this results in an error in Production mode. It tells us that we need to use an index. But we really want to scan and no index will help with our desire. We know that, in our graph, there are only 18 genres and scanning them is not a big deal. By setting the graph.scan_allow option, we can execute our traversal and list all 18 genres.

Of course, caution is advised here. Enabling scans also enables you to write traversals that may perform poorly.

You may naturally think that traversals are like databases queries. They find and retrieve data from a graph. However there are traversals that also find and mutate data in a graph. Since we know that all graph data is internally stored and retrieved from Cassandra, we need to be aware of Read and Write Consistency Levels we use for our traversals.

Supported consistency levels are the same as in Cassandra. There are ONE, LOCAL_ONE, QUORUM, LOCAL_QUORUM, EACH_QUOROM, ALL, and a few others.

The default Read and Write CLs are ONE and QUORUM, respectively. And the code shown here changes both of them to LOCAL_ONE, which will result in higher availability but also higher chance to retrieve stale data. Of course, consistency levels should meet the needs of your particular application.

Since Gremlin is a very expressive language, traversals can be used for both online transaction processing and analytical processing. It is useful to distinguish OLTP and OLAP traversals because DataStax Enterprise Graph has different engines that are specifically designed to efficien tly execute those types of traversals.

In many cases, it is very straightforward to see that a traversal is transactional or analytical but not always. Traversal performance depends on the complexity of both traversal and data, and drawing a clear line is sometimes hard and requires experimentation.

To start with, I do want to give you some guidelines to help you conclude that a traversal is on OLTP or OLAP.

OLTP traversals return results quickly because they target one or a few things in a large graph. They use vertex IDs or indexes to access those few things and further may access a small subgraph by traversing relatively short paths.

In contrast, OLAP traversals may take time to execute and usually touch many things in a graph, traverse larger subgraphs and longer paths, as well as rely on scan access patterns.

When not sure, you can always profile a traversal to see what it takes to execute it.

In this OLTP traversal example, we are profiling a traversal that retrieves all movie-vertices with titles “Alice in Wonderland”. From the profile, we can see that it happens to be two movies with the given title. They are retrieved using a materialized-view-based vertex index. And the whole traversal is executed in under 2 ms. You may even be interested to take a closer look at the CQL query used to retrieve data from Cassandra, as well as applied execution options, such as CL. This is clearly an OLTP traversal.

This traversal finds all users in the graph, groups them by age and counts users in each group.

From the profile, we can see that the traversal touches 1100 elements (user-vertices in this case) and then reduces the result to a single map with keys and values being user ages and counts.This traversal scans a substantial subset of vertices in our graph and executes in roughly 150 ms. While it does not seem to be too slow, we are playing with a small graph here. Because of the scan, this traversal response time will not scale well with larger graphs. Clearly, this is an example of an OLAP traversal that may analyze a lot of data.

The previous two examples used Gremlin Console to display traversal profiles.  You can also visualize traversal profiles in DataStax Studio. You get this nice view with time for each stage  displayed as a bar. You can further click on a bar or stage name to display other relevant information.

It is a really good idea to execute OLTP traversals using the OLTP engine and OLAP traversals using the OLAP engine. The OLTP engine is very efficient for transactional processing and the OLAP engine can deal with very computationally and memory expensive traversals because it uses Spark under the covers for analytical processing. Of course, to use the OLAP engine, you have to enable Spark on your Graph nodes in a cluster.

Examples here show how to use an OLTP traversal source g for graph KillrVideo and an OLAP traversal source a for graph KillrVideo. In both cases, the traversal source alias has the same name “g”, so that the traversal syntax does not have to change when you switch between the engines. This works in Gremlin Console.

In DataStax Studio, things are even easier. Each Gremlin cell has a drop down execution menu to choose one engine or the other.

We are done with traversing graph data preliminaries and let the fun begin! In this exercise, you will profile some traversals that return identical results but their response times are drastically different. We need to learn how to write efficient traversals!

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.