Using Spark Graphframe API with DSE Graphframes

DSE Version: 6.0


In this unit, we will introduce some methods of the original GraphFrame API from Spark. We will work on the same examples that we also used in our presentation called Using Gremlin with DSE GraphFrames so that you can easily compare Gremlin and GraphFrame APIs.



Hi, I am Artem Chebotko and this is Using Spark GraphFrame API with DSE GraphFrames.

As you may have already noticed, there are DSE GraphFrames and there are Spark GraphFrames. Are they related and how?

DseGraphFrame is an extension of GraphFrame. It uses a GraphFrame-compatible data format and can be converted to GraphFrame implicitly or using method gf().

In this presentation, we will introduce some methods of the original GraphFrame API from Spark. We will work on the same examples that we also used in our presentation called Using Gremlin with DSE GraphFrames so that you can easily compare Gremlin and GraphFrame APIs. I personally prefer using Gremlin but it is always good to have options.

These are the main categories of methods available in GraphFrame API … namely, motif finding, graph topology, graph structure, graph algorithms, and data persistence. It will require a lot of time to learn about all of these methods, so we will only focus on the most interesting one for us … the method called “find”. You can always find more information about the other methods in the Spark documentation.

Motif finding is an interesting way to express graph queries using graph pattern matching. This is similar to declarative Gremlin traversals but the syntax is, of course, quite different and it does rely heavily on DataFrame API, as well.

So the general strategy is to first create a DSE GraphFrame for a particular graph that exists in DSE Graph. We call method “dseGraph” for KillrVideo on the SparkSession object.

Second, we call method “find” on the DSE GraphFrame and pass it a graph pattern that needs to be matched against our KillrVideo graph.

Next, we augment motif finding with DataFrame API methods like “filter” and “select”.

Finally, we use a method that can trigger computation in Spark, such as DataFrame’s show() in this example.

Here are some rules about defining graph patterns.

A unit pattern is simply a directed edge with two endpoints. All three components can be given arbitrary names, such as “a”, “e”, and “b” in this example, denoting that the pattern must match an edge e from vertex a to vertex b.

Multiple unit patterns can be combined into a larger pattern using semicolons, which serve as conjunction operators. In this example, we have two edges from a to b to c.

If different pattern components share the same name, it means that they should match the same graph element … such as “b” is used as an endpoint in one edge and also as an endpoint in the other edge. “b” must match a vertex that has an incoming and outgoing edges according to the pattern.

Vertex and edge names become column names in the resulting DataFrame.

Also, vertex and edge names can be omitted if they are not needed in the resulting DataFrame.

Finally, an edge can be negated to indicate that the edge should NOT be present in the graph. This is done using the exclamation mark in front of a unit pattern. In this example, we find edges from a to b, such that there exists no edge from b to a.

This simple motif finding example uses a unit pattern to match all edges in our graph and returns a DataFrame with three columns. We show the first five found edges.

Likely this is not the most useful query you can think of. To make things more interesting, we need is a way to restrict vertex and edge labels, as well as property values. This can be done with DataFrame API.

Here is an incomplete list of DataFrame methods that you may find to be useful.

Even if you are not very familiar with Spark’s DataFrames, the names of these methods are descriptive enough to infer their purpose. We will see some of them used in the following examples.

Furthermore, there are many predefined functions that can help you define even more complex graph queries. You can sort things, do mathematical computation, compute aggregates, extract substrings, work with dates, and so forth.

Again, this is an incomplete list and you can read the Spark documentation for more details.

Now we are ready to do some damage!

This is a simple query that finds Robert De Niro’s movies from 1970s.

We match edges with pattern "(movie)-[e]->(person)" and further require a person’s name to be Robert De Niro, edge label to be “actor”, movie year to be greater than or equal to 1970 and less than 1980. We select title and year values of such movies and output them using DataFrame’s method show(). The result contains three movies.

Next, we find rating distribution for Robert De Niro’s movies and order the results by rating.

Our graph pattern consists of two unit patterns … user-rated-movie and movie-actor-person. A person name must be Robert De Niro and the two edges must have labels “rated” and “actor”.

Since we are only interested in ratings, we select the rating property values, group them by ratings, and count how many values we have per group.

Finally, we rename column “count(1)” to simply “count”, order by the rating column, and show the results.

Now let’s go a bit further. Let’s save the result of our analysis of Robert De Niro’s movie ratings into a Cassandra table.

We should assume that we already have Cassandra keyspace “killr_video” and empty table “de_niro_ratings” with columns “rating” and “count”.

The query itself should look familiar as it only differs from the previous example in the last two lines. Instead of ordering and displaying the result, we use Spark-Cassandra Connector to save the results into Cassandra. If the saving-to-Cassandra code looks new to you, consider taking our course on DataStax Enterprise Analytics with Spark on DataStax Academy to learn more.

Lastly, we use Spark SQL to retrieve data back from the Cassandra table to see that we indeed achieved our goal.

And finally, our most complex example where we find and compare rating distributions for Al Pacino and Robert De Niro by retrieving and combining data from Graph and Cassandra. That should be interesting!

We get Al Pacino’s rating distribution from the graph and Robert De Niro’s rating distribution from the table in Cassandra we populated in the previous example. We then take the two resulting DataFrames or tables and perform a full outer join on the rating column. The full outer join is required to preserve any rows from both tables that cannot satisfy the join condition. We order the resulting table by rating and show the result.

You should be able to recognize that this query uses a number of APIs ... DSE GraphFrames, Spark GraphFrames, Data Frames, Spark-Cassandra Connector and Spark SQL. There are so many possibilities.

Perfect! Now you have an opportunity to practice some skills.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.