About this Tutorial
Getting Started with Graph Databases contains a brief overview of RDBMS architecture in comparison to graph, basic graph terminology, a real-world use case for graph, and an overview of Gremlin, the standard graph query language found in TinkerPop.
We're going to take a look at the fundamentals of graph databases. But first, let’s take a look at something most database developers are familiar with: the relational database.
Relational Databases Explained
Relational databases have been around since the 70's and drive an absolutely massive market. The most fundamental concept of the relational database is a table. Tables are where we store our data, and comprise of rows and columns. Each row is a record and has the same set of columns, or fields. For instance, here's a table storing information on a few members of the evangelist team at DataStax.
To query tables, we're going to use Structured Query Language (SQL) usually referred to as “sequel”. The most basic operation we can do is a filter. You can see in the example below, we're querying for all the people who reside in Canada. The database looks at our table and returns all rows matching the filter.
select * from people where country = 'canada';
If you have a relational background, you've probably heard of foreign keys. Foreign keys let us break up our tables, so we don't need to store the same data several times. In this example, we've moved the country to another table, and have set up a foreign key in our people table.
If we make a change to a country name, we don't need to copy it to every record in the original person table, which is very convenient.
In order to bring the data together from different tables we can use something called a join. This is effectively merging two tables together using the foreign keys we've defined.
select * from people join country on people.country = country.id;
Sometimes our relationships can get bit more complex. For instance, what if a person can have residency in more than one country? We'll need to create a new table. This is sometimes referred to as a mapping table. It helps us manage this relationship, which is called many to many.
We can join our three tables together to get a view of all the countries a person is a resident in. This seems easy enough, and pretty straightforward. The difficulty arises when the relationships are more complex and nuanced than they appear on the surface.
select * from people join people_country on people.id = people_country.user join country on people_country.country = country.id;
Real-World Complex Relational Database Example
Let's take a look at a real world example: A media database.
At a trivial level, a media database may contain people, who are either actors, directors, producers, etc. We also have movies and TV shows. We know that many actors appear in many movies, so we'll need a mapping table. We also know that many actors appear in many TV shows, so we'll need another mapping table to keep track of that.
Wait a second, we need to get more granular for TV shows. We really need to get down to the episode level, since actors may guest star in only 1 episode, or they may be regulars in multiple seasons.
What if you're Eddie Murphy and you're in a movie as multiple characters? We haven't even covered tracking directors, producers, assistants. How can we easily query for all people involved in a single movie if that data is spread across many tables? I don't have enough room in the slide to draw out the nightmare that this will turn into.
Correctly modeling this media database can't be done in an iterative fashion, you either need to get it right the first time or you're stuck with a lot of data migrations and workarounds just to make things “sort-of” work. Every small detail you missed can turn into a new mapping table. Using relational tables is too rigid and inflexible, and inevitably makes it too difficult to get any work done.
Graph Database Terminology
OK, so relational in the real world is a pain and graph databases provide a suitable alternative. our alternative? Let's begin by taking a look at how a graph database works.
We're going to have to learn a few new concepts before we get started. The first element of a graph is called a Vertex. A vertex represents a “thing” in the world. In our media database, it may be an actor (like Jean Claude Van Damm) or a movie (like Timecop). Each real world thing gets a vertex in our database.
The next element is called an edge. An edge represents a relationship between two vertices. For instance, we may add an edge between “Van Dammn” and “Timecop”.
An edge has a label that describes the relationship; on the edge we just created, we'll use the label acted in. Edges also have direction, indicated by the arrow. The direction lets us more clearly express the relationship; you can see that Jean Claud acted in Timecop—not the other way around. Here we'd say the edge comes out of Jean Claude and into Timecop.
One thing that's nice about a graph, is that the relationships between two vertices can always be many to many. We don't need to worry about making extra tables to manage the more complex relationships. For instance, if Van Damm had both acted in and directed Bloodsport, we'd have two edges.
On both vertices and edges we can store additional information, called properties. Let's take a look at Jean Claude: He has 4 properties set on his vertex. We can see his status is amazing and his charm is infinite. We already saw a special instance of a property: the label on our acted in edge.
Gremlin Code Example
Let's take a look at a short code sample using the Gremlin Query Language. Lines 1 and 3 are basic setup code to create a graph in our graph console. You see, that on line 5 we create a new vertex with the label actor and name "Jean Claud Van Damme". On 8 and 11 we create vertices for 2 movies, Kickboxer and bloodsport. On lines 14 and 15 we create edges from Van Damme to his two movies to indicate he acted in them, brilliantly.
1 graph = TitanFactory.build().set('storage.backend', 'inmemory').open() 2 3 g = graph.traversal() 4 5 jcvd = graph.addVertex(label, "actor", 6 "name", "jean claude van damme") 7 8 kick = graph.addVertex(label, "movie", "name", "Kickboxer", 9 "year", 1989) 10 11 blood = graph.addVertex(label, "movie", "name", "Bloodsport", 12 "year", 1988) 13 14 jcvd.addEdge("acted_in", kick) 15 jcvd.addEdge("acted_in", blood)
Let's take a second to summarize what we've seen so far:
• Graph databases don't define tables
• Graph databases use vertices to represent real world objects
• Graph databases use edges to represent relationships
• Adding new objects and relationships is easy, we simply create new vertices and edges
Now that we've got our data stored in a graph, how do we access it?
First we'll look at how we can select vertices that match certain criteria. Like a relational database, it's easy to select a single object. In the first example, we're finding a vertex based on it's internal id. In the second example, we're looking for a vertex with a name property that has the value Jean Claude Van Damme. We can select vertices based on any of their properties, and just like a relational database, it's going to be faster if we have an index. The last example shows selecting all vertices matching a range query. As you can see, there’s lots of flexibility in finding vertices.
Now that we've selected a vertex, what can we do with it? This is where traversals come in. A graph traversal is when we follow the edges of a vertex. In this example, we're starting at Jean Claude. We issue the command "out". Out tells the database we want to follow all the out edges to the other vertices. Once this step is completed, the state of our traversal is that we're simultaneously visiting all of the resulting nodes. Any further traversals would exhibit the same behavior.
Alternatively, we can also visit the neighboring edges. This is useful if we want to filter by a property on the edge before completing the traversal to the vertices.
Here's a few examples of traversals. You can see we can mix filtering operations with traversing, giving us very flexible querying. Aside from filtering you can also do complex operations, like aggregations, sorting, and limiting.
About Gremlin Query Language
Similar to how SQL is the standard for relational databases, the TinkerPop stack is the standard in the graph world. The language you've seen is Gremlin, an expressive, powerful language for writing graph traversals. TinkerPop is currently incubating as an Apache project. To learn more about SQL to Gremlin, view the tutorial, here. The documentation for the Gremlin Query Language can be found here.
I've been doing all of my examples with Titan. Titan is a massively scalable, open source graph database which can leverage several backends, one of which is Apache Cassandra. By leveraging Apache Cassandra, we can store billions of vertices and edges.
The next evolution of this technology is the graph we're building as part of Datastax Enterprise. We'll be tightly coupling the graph design to the cassandra data model, leveraging the integrated search as well as integrated spark for global graph analytics. We're really excited to get this rolled out and see what people can build.