Loading Data Using DSE Graph Loader

DSE Version: 6.0

Video

In this section, you will learn all about using the DataStax Enterprise Graph Loader. The DSE Graph Loader is a customizable command-line utility that’s great for loading small to medium size graph datasets.

Transcript: 

In this section, we’ll be talking all about using the DSE Graph Loader.

As discussed in previous sections, these are the ways to create and populate new and existing graphs. For this section, we’re going to talk about using the DataStax Graph Loader.

The DSE Graph Loader is a customizable command-line utility that’s great for loading small to medium size graph datasets. It can load data from a variety of sources, including flat files and databases through a JDBC connection.

It is a groovy based tool so while you can incorporate transformations into your data mappings, this method is deprecated and should not be used.

This is an overview of how the DSE Graph Loader works.

First, a user must create a loading script in groovy. This script tells the graph loader about preferred configurations, data sources, and data mappings. The Graph Loader will use the mapper script to read data from the respective data sources and load into DSE Graph.

There are 3 data processing stages that the DSE Graph Loader goes through.

First, there’s the preparation stage. This is where the Graph Loader will validate the data from the data sources, check to make sure you have the proper schema in place, and of course, validate the loading script for any possible issues.

Then, the Graph Loader will begin loading or retrieving existing vertices and caches them for use in the third step, which is edge and property loading.

The edge and property loading stage is generally the most time-consuming. When loading multi-properties, the schema should reflect that fact to avoid upserts.

Let’s talk about all the different data sources and file formats you can use with the DataStax Graph Loader.

You have support for all the formats you would expect, CSV, JSON, delimited text, and also support for graph-specific formats such as GraphSON, GraphML, and Gryo.

What’s also interesting is that DSE Graph Loader can load data directly from JDBC-compatible databases and this includes other graph databases such as Titan, Janus, and Neo4J.

Next, we’re going to be discussing loading script design for the DSE Graph Loader. As we talked about previously, the loading script defines a graph loader conifguration, it defines the data sources, and defines the data mappings.

And because it’s a groovy based tool, it also supports groovy-based transformations though this functionality is deprecated.

Options address and graph are straightforward and REQUIRED.

create_graph enables creation of a new graph by default.

If you want performance, you should be loading data into a graph with well-defined schema that has indexes to support useful lookups during data loading. Therefore, create_schema should be set to false.

Another performance-related optimization is possible with load_new = true. If you are loading a new dataset, DSE Graph Loader will rely on this option and will not waste time verifying if a vertex or edge already exists in the graph; it will assume to-be-imported graph elements are new and have unique IDs. There is also a finer-grained mechanism to specify existence of specific graph elements when defining a data mapping (covered later in this presentation).

There are many other options for customizations so please take a look at the documentation for an expansive list.

Here are a few parameters you can adjust for increasing (or descrasing) performance.

Read_threads is the number of threads used to read data from the data input.

Load_vertex_threads is the number of threads used for loading vertices. By default it’s set to 0 which will force the number to number of physical cores/2).

Load_edge_threads controls the number of threads used for loading edges and properties and will default to 0 which will translate to the number of nodes in the data center times 6.

Now that we know some of the important configurations for the DataStax Graph Loader, let’s put them to good use.

Here’s how you’d define configurations within your loading script. Notice the config declaration, followed by the parameter and the value. You can also see an alternative syntax on the bottom, using a comma to separate parameters on a single line.

Once we have our configurations defined, it’s time to define our input data sources. You can see that from the File method, we can specify which type of file format we’re loading and whether or not there’s a specific delimiter or header associated with it.

Now if you’re input data is one of the three main graph file formats, the syntax is a little different. We call the Graph method, specify the filename, and then the type of graph format we’re reading from.

Lastly, here’s how we can load data from a JDBC compatible database.

You first need to create a database connection which we can do by calling the database method, provide a connection string, database type, and authentication.

Then, we can use that database connection to query for our data in SQL format.

Once we have our inputs defined, we can then use the transformation API. Please note that this is deprecated and that this is for informational purposes only.

The main methods you have access to are the map, flatmap and filter transformations - each will create a new dataset that’s formed by applying a function on each element of the input dataset.

With a map, there’s a one to one correspondence between input and output elements.

With a flatmap, it’s a one to many correspondence.

As you would expect with filter, the new dataset is formed by the elements in the input dataset that return true with a function, f.

Here’s an example on how to create transformations in your loading script.

As you can see, we define our input, then filter on it to keep users who are over the age of 21. Then we use the map method to add a “u” in front of the userid value so for example, 25 would become u25.

In this example, flatMap adds "m" in front of a movie id (e.g., 44 becomes m44). Next, it splits the genres literal into a list of genres and modifies each element of the list to be a map with fields movieId and genre.

This returns a new belongsTo dataset created by expanding lists returned by the lambda function. This new dataset can be later mapped to edges in a graph.

This is how you map input data to graph constituents. Simply put, you call the load method, which takes input data, and specify whether we’re loading vertices or edges. The asVertices or asEdges method takes a mapper input which we’ll discuss next.

Before we create our mappings, let’s get introduced to the mapping API.

Each vertex or edge must have a label - use label or labelField.

A key is what uniquely identifies a graph element; it is not necessarily a vertex or edge id.

vertexProperty and property can map fields to properties. They do not need to be used explicitly if a field and property names are the same because any data field becomes a vertex or edge property by default.

Similarly, value may not need to be used explicitly.

Use ignore to not create properties out of fields by default.

isNew and exists allow for finer-grained mechanism to specify existence of specific graph elements when defining a data mapping. They override the behavior specified by the load_new configuration option.

inV and outV help to define neighbors of a vertex.

inE and outE help to define incident edges of a vertex.

inV and outV (with different signatures) help to define endpoints of an edge.

There are more methods - please consult the documentation.

Let’s put what we’ve just learned to good use. Here’s an example of mapping our person data to vertices.

We the load method which takes our persons data input. Since we’re loading vertices, we use the asVertices method and begin to map our data.

The label of this vertex is Person and the key within the input data is personId. Both examples here are valid, they just use slightly different syntax.

Here’s an example of mapping actors to edges. As you can see, this mapping is a little more in depth because we need to tell the Graph Loader which adjacent vertices to create an edge between.

Because these are new labels, we can make the Graph Loader a little more efficient by calling the isNew() method. Because we’re loading new edges to existing vertices, we can add the exists() method to our vertex mappings, again for optimization.

Both examples do the same thing with different syntax and both are valid.

Now that we’ve learned how to create our loading script, it’s time to learn how to run the graph loader.

All you need to do is run the graphloader utility, specify your loading script and any additional configuration. It’s also worth noting that command line options can be overridden by loading script options.

Make sure you’re using the DataStax Graph Loader that matches with your DataStax Enterprise Version.

It’s time to put your new graph loader knowledge to use with this hands on exercise.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.