Using Gremlin with DSE Graphframes

DSE Version: 6.0

Video

Gremlin is the preferred language when it comes to defining traversals in DataStax Enterprise Graph. This is also true for DSE GraphFrames. In this unit, we will describe a general strategy of using Gremlin API with DSE GraphFrames

Transcript: 

 

Hi, I am Artem Chebotko and this is Using Gremlin with DSE GraphFrames.

Gremlin is the preferred language when it comes to defining traversals in DataStax Enterprise Graph. This is also true for DSE GraphFrames.

Here we describe a general strategy of using Gremlin API with DSE GraphFrames. The following code is executable, for example, in Spark Shell.

First, we create a DSE GraphFrame for a particular graph that exists in DSE Graph. We call method “dseGraph” for KillrVideo on the SparkSession object.

Second, we begin our traversal with methods V() or E() just like in a regular Gremlin traversal.

Next, we use Gremlin steps, such as in(), out(), both(), has(), and so forth.

Finally, we use a method that can trigger computation in Spark such as DataFrame’s show() in this example.

This is the list of Gremlin steps supported by DSE GraphFrames. I assume that you have the required prerequisites and already familiar with Gremlin so I will not go through individual steps.

As you may recognize, types of traversals that we can define with these steps include simple, projecting, statistical, and ordering. Of course, there are a number of other Gremlin steps that may not be supported yet. You should check the documentation for the most current list.

If a certain Gremlin step is required for your traversal and is not currently supported by DSE GraphFrames, there are always other Spark API, such as Spark GraphFrames and DataFrames, that can compliment your traversal.

All Gremlin predicates are supported by DSE GraphFrames. To use a predicate, you have to qualify it with P.

There are also four tokens that can be used to refer to a label, id, map key or map value. For example, if you need to groupCount by label and order by key, you will need to use those. They are qualified with T.

Let’s look at some examples!

These two traversals compute simple statistical information about our KillrVideo graph. We count how many vertices we have in the graph, as well as find vertex distribution by vertex label.

Because we use the DSE GraphFrame API, all computation is done by Spark.

In this example, we are dropping all knows-edges in the graph and output edge distribution by label. As you can see, there are composer, director, belongsTo, actor, cinematographer, rated, and screenwriter edges. There are no longer any knows-edges.

drop() can also be used to drop vertices and properties. And this is one way to do bulk dropping with the Gremlin API of DSE GraphFrames. There is another way that uses DSE GraphFrame API directly and it also supports bulk insertions and updates. We will talk about bulk operations  in more detail in a separate presentation.

This is a simple traversal that finds Robert De Niro’s movies from 1970s.

We start with the vertex with label “person” and name Robert De Niro, traverse all incoming actor-edges and get to the movie-vertices. We then verify that the year-property is greater than or equal to 1970 and less than 1980. We get to the title and year values of such movies and output them using DataFrame’s method show(). The result contains three fine movies.

Our next example of Gremlin traversals combines steps of both simple traversals and statistical traversals.

We find the person-vertex for Robert De Niro, hop to his movies and get to rated-edges so that we can retrieve movie ratings. We then compute the rating distribution for all of Robert De Niro’s movies.

Notice that the results are not ordered. It would be nice to improve this traversal to output the result in the ascending ratings order. Even though Gremlin traversal step order is supported by DSE GraphFrames, the local scope to order a map is not. For now, we will have to use another API to do ordering.

In many cases, the API that readily complements Gremlin when using DSE GraphFrames is DataFrame API.

We have already seen DataFrame’s show() method that triggers execution and displays the result as a table. Other methods can be used to implement operations that are not available in Gremlin API and even combine graph and non-graph data in a single query.

As you will experience soon, Gremlin + DataFrames is a very powerful union!

So here we find rating distribution for Robert De Niro’s movies and order the results by rating.

As before, we find the person-vertex for Robert De Niro, traverse incoming actor-edges to his movies and get to rated-edges so that we can retrieve movie ratings. We then compute the rating distribution for all of Robert De Niro’s movies. Finally, we order the resulting DataFrame by its rating-column and output the result.

Notice the df() method that is used to convert a DseGraphTraversal object to a DataFrame object. We can omit it in Scala and Spark Shell as the conversion happens implicitly. However it can also be convenient to keep for better readability and easier distinction between Gremlin and DataFrame APIs.

Now let’s go a bit further. Let’s save the result of our analysis of Robert De Niro’s movie ratings into a Cassandra table.

We should assume that we already have Cassandra keyspace “killr_video” and empty table “de_niro_ratings” with columns “rating” and “count”.

The traversal itself should look familiar. There is the Gremlin part, where we compute the rating distribution, and the DataFrame part, which uses Spark-Cassandra Connector to save the results to Cassandra. If the saving-to-Cassandra code looks new to you, consider taking our course on DataStax Enterprise Analytics with Spark on DataStax Academy to learn more.

Lastly, we use Spark SQL to retrieve data back from the Cassandra table to see that we indeed achieved our goal. And it worked!

And finally, our most complex example where we find and compare rating distributions for Al Pacino and Robert De Niro by retrieving and combining data from Graph and Cassandra. That should be interesting!

We get Al Pacino’s rating distribution from the graph and Robert De Niro’s rating distribution from the table in Cassandra we populated in the previous example. We then take the two resulting DataFrames or tables and perform a full outer join on the rating column. The full outer join is required to preserve any rows from both tables that cannot satisfy the join condition. We order the resulting table by rating and show the result.

You should be able to recognize that this traversal uses Gremlin, DSE GraphFrames, Data Frames, Spark-Cassandra Connector and Spark SQL. Now that's what I call powerful!

And it is time for you to embrace the power of the DSE GraphFrames Gremlin side.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.