Bulk Mutations with DSE Graphframes

DSE Version: 6.0


DSE GraphFrames are great for performing bulk mutations. When you need to insert, update or delete a large number of vertices, edges, and properties, you can use DSE GraphFrames for the best efficiency. That includes bulk loading an existing dataset into DataStax Enterprise Graph initially. In this unit, you will learn about bulk mutations with DSE GraphFrames.



Hi, I am Artem Chebotko and this is Bulk Mutations with DSE GraphFrames.

DSE GraphFrames are great for performing bulk mutations. When you need to insert, update or delete a large number of vertices, edges, and properties, you can use DSE GraphFrames for the best efficiency. That includes bulk loading an existing dataset into DSE Graph initially.

Here is the API for inserting, updating and deleting graph elements using DSE GraphFrames.

Notice that the same methods updateVertices and updateEdges are used for both inserting and updating data in a graph. The way it works is as follows ... DSE Graph does not do a read before a write to verify that something already exists in a graph. Instead, for efficiency, we simply write data, which results in inserting a vertex or an edge if it did not exist before or updating an existing vertex or an edge with new property values. This is also consistent with how upserts work in Cassandra, if you are familiar with the notion of upsert.

Also notice that all methods take a DataFrame as a parameter. DataFrame df must have columns and values that uniquely identify vertices and edges that are being inserted, updated or deleted. For example, in case of a vertex, we need a vertex label and vertex ID properties. In case of an edge, we need an edge label, as well as source and destination vertex IDs.

In addition, the DataFrame should have columns representing properties that are being inserted or updated.

When inserting or updating vertices, An optional set of vertex labels can be specified for better efficiency.

When deleting vertex or edge properties, to-be-deleted property names are passed to deleteVertexProperties and deleteEdgeProperties.

That may be a lot of information for now so let’s look as some examples that use these methods.

First, we need a data source or some existing dataset we can insert into DSE Graph. Anything that is accessible via Spark will work. That includes files in local and distributed file systems, NoSQL databases, Relational databases, Data streams, and so forth.

For the sake of our examples, it makes sense to choose a Cassandra database as a data source. This is the Cassandra database schema we can use.

We have three tables … users, movies and ratings_by_user. Take a look at the columns in those tables. You should be able to recognize that tables users and movies have data that belongs to vertices in our KillrVideo domain. And table ratings_by_user has information about rated-edges, such as users rated movies with particular ratings.

Here is how some sample data in those tables may look like. We insert one user u1234, who is a 32 y.o. Female and two movies … m1111 Moana and m2222 Beauty and Beasts. Our user left the ratings of 9 and 8 for those movies, respectively.

You can of course assume that there are many more rows in each table.

We can assume that we have an empty KillrVideo graph created in DSE Graph and this is its schema.

We first define property keys, such as userId, age, gender, movieId, title, year, duration, country, production, and rating.

We then define two vertex labels and one edge label. By now, you are likely familiar with these definitions and the syntax.

Finally, interesting things start happening here.

We initialize a DSE GraphFrame for the KillrVideo graph whose schema you have just seen.

We retrieve data from Cassandra table users into a DataFrame using a Spark SQL query.

We then call updateVertices to insert each row in the DataFrame as a user-vertex with userId, age and gender properties into the KillrVideo graph in DSE Graph. The optional Seq("user") parameter is used to make data insertion faster.

We display the contents of the DataFrame just for the reference. The data has actually been added to the graph.

That was almost too easy. The only data transformation we had to do is hardcoding the user label in the SQL SELECT clause because the user table has no column that stores a vertex label, which is to be expected.

Next, we use the same strategy to insert our two movie.

SQL query over the movies Cassandra table ....

updateVertices to insert into the KillrVideo graph …

Show the DataFrame content for reference … Done.

Now let’s see how we can insert rated-edges into the graph.

We still retrieve data using a Spark SQL query but then we also post-process the resulting DataFrame. We need to add columns for edge endpoints, namely source and destination, which we do in the select method call.

To generate values for the source and destination columns, we use method idColumn of DSE GraphFrame and pass it a vertex label and any properties that constitute a vertex ID.

We use method updateEdges to insert new edges in the graph.

And you can see how the data in the DataFrame looks like to understand what it takes to insert edges.

When we visualize our newly populated graph in DataStax Studio, we can see that we did a great job. The user-vertex, two movie-vertices, and two rated-edges are all present and their properties are there as well.

So now we have a graph with existing vertices, edges, and properties and it is a good time to do some updated.

Let’s change the user age from 32 to 23.

We still use the updateVertices method but we do create a DataFrame manually using the createDataFrame method of the SparkSession object. We pass an RDD with a single row with values u1234 and 23 … and also specify the DataFrame schema with fields userId and age. In addition, we add another column for label with value “user”.

You can see the resulting DataFrame in the output. Notice that we do not have a column for user gender because we do not need to update it.

Let’s also change the movie rating from 8 to 5 for one of the movies.

This time, instead of creating a DataFrame with the edge information manually, we can retrieve it from the graph itself.

We access edges through the DSE GraphFrame and filter them to have specific source and destination vertex IDs. Then we select DataFrame columns that we need to identify an edge and add the rating column with the new value of 5 that we are updating to.

The updateEdges method does the rest of the work for us.

We are done with insertions and updates. What about deletions?

Here is how we can delete all rated-edges with rating 5 and all movie-vertices with title Beauty and the Beast.

That should leave us with one user-vertex, one movie-vertex and the edge between them as shown in the illustration.

Finally, our last example shows how to delete vertex and edge properties using methods deleteVertexProperties and deleteEdgeProperties.

We delete country and duration properties of the Moana movie and the rating property on all rated-edges.

Looks straightforward to me.

In this exercise, you will bulk load data from CSV files into the KillrVideo graph. This is a pretty common scenario … let’s do it!

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.