Defining Graph Schemas

DSE Version: 6.0

Video

In this unit, you will everything need to to get started with creating awesome graph schema with Datastax Enterprise Graph.

Transcript: 

Welcome to working with graph schemas! My name is Marc Selwan, a member of the product team, and in this section, we’ll cover everything you need to know to get started with creating awesome graph schemas.

To start things off, let’s talk about how we actually define schemas...

Let’s start by level setting on what a graph schema actually is in the world of DataStax Enterprise. It’s important to note that graph schema is a concept introduced in DSE Graph for efficiency.

Now, graph schema is a combination of property keys, vertex labels, edge labels, and the semantics that connect them.

Graph schema also enables the definition of indices before data is inserted into the system.

Graph Schemas is managed by the graph schema API that’s accessible to you, the end user, via the schema object in DataStax Studio or even the Gremlin Console.

The Schema API provides the methods, such as propertyKey, VertexLabel, edgeLabel and also supports the fluent interface.

The schema api provides two types of schema modes that allow users to define whether they want to run in a production mode or development mode.

Production mode is the default and recommended mode to run as it requires you to define your schema upfront to make sure you’re getting the best performance possible out of your graph.

Development mode allows DSE Graph to infer schema at the insertion of data. While this makes development easy, it’s recommended to use this as a tool for exploration in development environments but you’ll want to use production mode where performance and scalability are key..

DSE Graph allows you to globally set the schema mode via the dse yaml file but it’s recommended to leave it as is, defaulting to production mode.

You can dynamically specify schema mode with the syntax on the bottom here, should you want to switch to development mode for testing and educational purposes, on the respective graph you’re working with.

There are 4 main elements to building graph schema - in order, you must first define propertyKeys, VertexLabels, EdgeLabels, and Indices. Let’s take a deeper look into the first 3 elements.

Property Keys are the first step in your schema definition.

You can think of these are your attributes that define uniqueness, data type, cardinality, and meta-properties. These attributes can then be associated with vertices and edges, however, they are independent.

Here’s how you typically define a property - specifically a single-cardinality property key which is the most frequently used.  

You can see that we define the propertyKey name, then the data type, and then the cardinality, though single cardinality is assumed by default.

Here’s an example of defining a multi-property key. A multi-property key allows for many text values to be stored for a respective property. In this case, it’s possible that there are multiple film production companies for a given production.

DSE Graph also allows for the creation of meta-properties, or, multiple properties within a given propertyKey.

You can see that we define a source and date property key, each with a different data type, and we can include them inside our “budget” property key. It’s also worth noting that meta-properties can only be used with vertex properties and can be applied single or multiple cardinality properties.

Here’s a list of supported property data types. As you can see, everything you expect is here - everything from small ints to decimals to text and even geospatial objects.

So now that we understand how property keys work, we’re ready to talk about vertex labels.

Vertex Labels are your entities in the entity-relationship model. They define the types of vertices in a respective graph.

Defining vertex labels are the second step in the schema definition process. They contain the vertex ID definition, a unique label, as well as the associated property keys we already defined in step 1.

Thinking about the vertex ID is really important. There are two types of Vertex IDs in DSE Graph, user defined IDs and system generated IDs.

User defined vertex IDs are the recommended approach as they are much more performant and give the user the ability to control graph partitioning.

System generated vertex Ids are to be deprecated and have performance limitations with larger graphs.

So lets dive into user-defined vertex IDs. They are defined via two components:

The partition key - which a vertex belongs to.

And the cluster key - which helps uniquely identify a vertex within a partition.

This may sound familiar because these are the exact same concepts taken out of data modeling with CQL. As such, the partition key is defined as a non-empty set of property keys and a clustering key is defined as a set of property keys which can be empty. Both can be made up of single or composite property keys.

Now that we understand how vertex IDs work, let’s create our vertex labels.

Remember, step 1 is to define our property keys, in this case our userID, age, and gender, all of which are single cardinality.

Then we define our vertex label. We’re creating a vertex label called “user” , that’s our entity, and we’re defining the partition key as our userID property. Then we define the other properties or attributes that are associated with our user. In this case, age and gender.

The resulting graph partitioning may or may not make sense depending on types of traversals you require for a graph. For example, it would not be the best partitioning strategy for traversals that analyze all users with a particular age. On the other hand, this partitioning strategy does work well for queries that retrieve one user at a time by a known userId.

Let’s take a look at another example. This time, we’ll be defining our movie vertex label. Notice how we have two properties, year and country, as our partition key. This is known as a composite partition key as we’re combining two or more values to define data locality.

It’s important to remember that just like in CQL data modeling, you must always provide both values of a composite partition key when querying the data.

Now because a movie might have multiple releases or versions, defined by year and country, we’re going use movieID as our clustering key to define uniqueness for a given movie.

This data model allows me to start by finding all movies for a given year and country and then allow us to narrow that down by movieID.

Lastly, let’s take a look at an example that uses system generated IDs. Remember, system-generated vertex IDs are being deprecated.

So let’s take a closer look - notice that in our vertexLabel definition, we have not specified a partition key nor clustering key. In this case, DSE Graph will generate an ID on insertion that consists of community and member IDs. While this sounds convenient in theory, it also means we lose all control of data locality and uniqueness.

The last thing to note about creating vertexLabels is that you can reuse property keys accross vertexLabel definitions. This means there’s no benefit for having a genreName or personName. Simply define the properties associated with a vertex label and the values of the properties will always remain local to the respective vertex.

Now that we know how to define our property keys and vertexLabels, let’s talk about how we actually create the relationships between them. These are known as edges and as such we need to define edgeLabels as the third step in our schema definition process.

Similarly to vertexLabels, an edgeLabel contains a unique label name, cardinality, associated property keys, and the domain and range. You’ll notice that there is no such thing as user-defined edgeIDs. This is because the edgeIDs are automatically derrived from the vertexIDs. We’ll take a closer look at that when we talk about inserting graph data.

Let’s talk about creating edgeLabels. In this example, we’re creating a single cardinality edge label where a user rates a movie. In this case, we only want at most one rating between a respective user and movie so we add single() to the edgeLabel definition.

We then define our properties. It’s worth reminding you that edge labels cannot have multi-properties with meta-properties associated with them.

Then, we define our connection. Logically, the directionality of the relationship is that a user rates a movie so we first define the user, then the movie. Even though we define directionality this way, relationships are stored bi-directionally on disk to make traversing in and out of relationships easier.

Let’s take a look at a multi-cardinality edge label definition. In our example here, we have a movie vertex label and a person vertex label. While person can be anyone, we are relating movies and person by the roles that they played in a respective movie.

Because it’s possible that a single person played many roles in a movie, we define this as a multi-cardinality edge label. Now multi-cardinality is assumed by default so it can be ommitted from the edgeLabel definition as you can see in the bottom example.  

Finally, it’s time to talk about our 4th and final step in the schema definition process - create indices. Because there’s a lot of depth to indices in DSE Graph, we’ll cover them in depth separately but for now, know that there are 3 types of indices: Vertex, property, and edge indices.

Now that you know the schema definition process, let’s talk about some ways you can find information about your schema, starting with the schema.describe() command. This will list all of your schema definitions.

If you want to print schema elements for a specific edge or vertex label, you can simply do that by adding a describe at the end of schema.vertexLabel or schema.edgeLabel.

Lastly, if you want to drop all of your schema definitions, you can simply run the schema.drop command but please be careful as this will cause you to lose all of the data in your respective graph.

This is a larger example of a graph schema. In all our graph-related training units, we will be relying on this KillrVideo graph schema.

The graph schema contains 4 types of vertices, labeled movie, user, genre, and person, and a number of labeled edges that may connect vertices of specific types. Study the vertex ids and properties (their keys/names and value types) shown in the schema graph.

The edge cardinalities are omitted for visual clarity. The only multiple-cardinality edges are actor and screenwriter — that is based on our specific dataset properties.

There is one multi-property and no meta-properties.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.