Understanding Graph Partitioning and Data Locality

DSE Version: 6.0

Video

In this unit, we will introduce the concepts of graph partitoning and data locality. Graph partitoning involves the grouping of vertices and edges into sections called subgraphs. Data locality can be controlled through selective data partitioning.

Transcript: 

Hi, it’s Jonathan Lacefield here and I’m back to talk about partitioning and data locality in DSE Graph.

In case you’d like more information on this subject, you can always read my blog post from the DataStax Academy blog titled Understanding Graph Locality.



Lets first introduce the concept of Graph partitioning.



For those who are familiar with DSE, you’ll understand partitioning in the terms of Apache Cassandra. That is, Apache Cassandra stores data in partitions. We have partition keys. One of the main tricks to achieving those low millisecond p99s at scale with Apache Cassandra is to setup your data model properly around a good partition access pattern.



For graphs, partitioning means something slightly different. With graphs, partitioning refers to how one groups vertices and edges of a whole graph into specific sections called subgraphs. This is done for performance, analysis, security or data management purposes.



The trick of graph partitioning is to balance graph access patterns, write patterns, and data size constraints. Like most things in distributed architecture, we’re talking about trade-offs.



Theoretically, graph partitioning is an incredibly hard problem. This is because of the complexities involved from highly connected components.



There is no magic algorithm that can generate optimal graph partitions for you.  At least not yet. For the best results, you will have to consider each individual use case and be creative.



In DSE Graph, we can actually combine the concepts of partitions in both Apache Cassandra and graph theory to come up with practical implementation choices to solve real world, at scale graph challenges.



This is because of the way DSE Graph projects the graph model onto Cassandra.



In this diagram, we’re looking at an oversimplified example to illustrate the concept of how graph partitioning is applied to Cassandra in DSE Graph.



In this example, Subgraphs become partitions in Apache Cassandra™. However it is usually a one-to-many mapping because vertices with different labels are stored in different partitions (and tables) in Apache Cassandra™.



In the illustration, Subgraph 1 has vertices with three distinct labels, resulting in three partitions. Subgraph 3 has all vertices with the same label, resulting in only one Apache Cassandra™ partition. A number of vertices per partition is defined by vertex ID design.



You can control graph partitioning in DSE Graph by carefully choosing vertex labels and vertex ID design.  This is very similar to those working with the Cassandra components of DSE Graph.



Choosing vertex labels, especially in the presence of entity type hierarchies, can affect graph partitioning in DSE Graph.



Under the covers, DSE Graph creates 2 Cassandra tables for each Vertex Label (type) that’s created. These cassandra tables use a primary key that corresponds to the vertex identifier.



This means that vertex property and edge data is always stored together on a node.



And we can use Cassandra’s wide row, partition and clustering keys, to further control data placement in DSE Graph. Data placement is the central implementation concept of graph partitioning



In DSE Graph, this technique is named User Defined Vertex IDs.



It’s worth mentioning in older versions of DSE Graph, DSE Graph used to generate IDs for users. This it turns out resulted in sub optimal data placement and impact a graphs ability to scale. Using User Defined IDs is the recommended approach for working with graphs in DSE Graph.



We have 3 types of User Defined Vertex IDs.



The first is a vertex id based on simple partition keys. This is the most basic implementation in which every new vertex is created using  unique Cassandra partition. Cassandra will hash the partition key and store the data randomly throughout the cluster. This is the easiest method of implementation that ensures an even distribution of data across your cluster, but also doesn’t provide users with any control over data placement.



In other words, this is the most simplistic approach to graph partitioning. You could even say this is graph without partitioning.



The next User Defined Vertex ID type is to use composite partition keys. This means that instead of just using 1 property to identify a unique vertex, 2 or more properties are used to identify a unique vertex.



The results of this approach are very similar to the simple partition key approach we described on the previous slide.



The final User Defined Vertex ID type is a vertex id using partition and clustering keys. This method provide the most control over graph partitioning and enables graph users to achieve data locality by using the same partition key property to identify multiple vertices that belong to the same subgraph.



This is the most powerful design pattern as it allows multiple vertices to reside in the same partition and share the same node.



Here you can see that we have 2 vertices with a partition key of Alice in Wonderland. It would be very efficient to retrieve all movie with a given title by simplifying traversing on the partition key only. DSE Graph will translate these operations to Cassandra partition access patterns, which are very efficient.



Okay, so there you see how we can leverage the power of Cassandra’s wide row data model to achieve a practical solution to graph partitioning. Now it’s your turn to give this a shot. Give Exercise 4 a go and we’ll be right here when you’re ready to learn more about DSE Graph.



 

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.