Introduction to DSE Graphframes

DSE Version: 6.0

Video

In this unit, you will be introduced to DSE GraphFrames. DSE GraphFrames is an alternative OLAP engine for DataStax Enterprise Graph. It is designed around our DataStax Enterprise Analytics offering and it enables a unique way of interacting with DataStax Enterprise Graph using various Spark APIs and tools.

Transcript: 

 

Hi, I am Artem Chebotko and this is Introduction to DSE GraphFrames.

DSE GraphFrames is an alternative OLAP engine for DSE Graph. It is designed around our DSE Analytics offering and it enables a unique way of interacting with DSE Graph using various Spark APIs and tools.

You can think about a DSE Graph Frame as a representation of your graph in Spark. And this representation is actually based on the original notion of Graph Frames in Spark.

We will soon learn more about specifics of DSE Graph Frame based on our KillrVideo graph example.

There are various ways you can work with a DSE GraphFrame. You can use a subset of Gremlin to define OLAP graph traversals. There is also a complimentary DSE GraphFrame API that is primarily useful for bulk mutations. You can search for structural graph patterns using the original Spark GraphFrame API. Finally, you can use any other Spark API that may be suitable for your application. That includes SparkSQL, DataFrames, and Spark Streaming ... just to name a few.

When it comes to DSE GraphFrame use cases, there are many … due to the versatility and power of this OLAP engine. Of course, you can do Gremlin traversals just like in the other OLAP traversal engine based on the SparkGraphComputer. But ... DSE GraphFrames are unique as they also support non-Gremlin graph data manipulation and accessing data directly from Spark.

As a result, you can combine data stored in DSE Graph, with data stored in Cassandra, with data stored in a relational database, with data stored in a file, with data coming in a stream, and so forth. With this kind of power, there are virtually unlimited possibilities for the type of analysis you may need.

It is also worth mentioning that DSE GraphFrames is currently the most efficient way to do bulk loading and bulk mutations in DSE Graph. If you have an existing large dataset that you want to load into DSE Graph, DSE GraphFrames are here to help.

Let’s create our first DSE GraphFrame in Spark Shell.

To do that, we start Spark Shell by executing command “dse spark”. And then call the dseGraph method of the SparkSession object with the name of the graph as a parameter … KillrVideo in this case. We assign the result to “g” which is an instance of DseGraphFrame as you can see in the output.

While it is convenient to think about a DseGraphFrame as simply a representation of a graph stored in DSE Graph, there is a bit more to it. Internally, in Spark, a DseGraphFrame is represented by two virtual tables or DataFrames: one for vertices and one for edges.

The Vertex DataFrame has columns for vertex ID, label, and properties. The Edge DataFrame has columns for source and destination vertex IDs, edge internal edge id, label and properties.

It is important to notice that even though vertex IDs in DSE Graph can be composed of values from multiple properties, the DseGraphFrame representation always serializes a vertex ID as a single value through hashing and concatenation. This is done on purpose to make the DseGraphFrame format fully compatible with the original Spark GraphFrame format so that we can take a full advantage of various Spark APIs. In general this ID conversion is done implicitly and automatically but we will also learn how to do it explicitly, which will be useful in some cases.

In the context of our KillrVideo graph, this is how the vertex table schema looks like.

“g” is our DseGraphFrame … we call its method V() to get the vertex DataFrame and then DataFrame’s method printSchema to output the result.

We see columns “id”, label, userId, age, gender, genreId, name, personId, movieId, and so forth. The production column is a bit unique as it corresponds to a multi-property and its data type is the array of strings.

Based on the columns, you can see that this DataFrame is used for vertices of different types … in our case … users, movies, people, and genres.

Using DataFrame’s method “show” we can take a look at the actual column values in the first five rows. We have rows corresponding to user u185, genre Adventure, person Johnny Depp, and two Alice in Wonderland movies.

Similarly, we can output the edge DataFrame schema for the KillrVideo graph.

It does look simpler because we only have one edge property called “rating” and it is of type “int”.

As you may remember, “src” and “dst” are columns that store edge source and destination vertex IDs. And “id” is a column that stores edge IDs that are generated and used internally by DSE Graph.

Here are some concrete values to look at. Besides the label and rating, there is not much for us to comprehend.

Finally, we conclude this presentation with the summary of DSE GraphFrame APIs that cover in more detail in subsequent videos.

First, we have Gremlin API to define Gremlin traversals starting with either method V() or method E() that is followed by traversal steps like in(), out(), count(), and so on. We assume that you are already familiar with the Gremlin traversal language.

Second, there is Spark GraphFrame API that, among other things, has some interesting structural pattern matching capabilities. In Scala, DseGraphFrame is converted to Spark GraphFrame implicitly. Explicit conversion is done using the gf() method.

Third, we have very efficient methods for bulk mutations. We can insert, update and delete graph elements with those.

And last but not least, we have the data persistence methods cache and persist, which you may already be familiar with since they have the same purpose as similar methods available in Spark.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.