Introduction to Graph Analytics

DSE Version: 6.0


Welcome to the Graph Analytics with DataStax Enterprise Graph course. Graph analytics is about delivering value from your graph data by combining and applying well-known analysis steps into a data analysis workflow, with the goal of doing data-driven decision making, information discovery, graph exploration, and all kinds of ad-hoc analysis.


Hi, I am Artem Chebotko and this is Introduction to Graph Analytics with DataStax Enterprise Graph.

It is a short course but it relies on many, many concepts covered in two prerequisite courses. Please make sure you have prior knowledge of the property graph data model, graph schema, Gremlin traversal language, DSE Graph architecture, DSE Analytics offering, Spark-Cassandra Connector, SparkSQL and DataFrames. You can review these and other prerequisite topics in our DataStax Enterprise Graph and DataStax Enterprise Analytics with Apache Spark courses.

Introduction to Graph Analytics

We begin with some important preliminary notions about Graph Analytics.

Graph analytics is about delivering value from your graph data by combining and applying well-known analysis steps into a data analysis workflow ... with the goal of doing data-driven decision making, information discovery, graph exploration, and all kinds of ad-hoc analysis.

Such data analysis workflows are built using well-understood steps and algorithms like classification, clustering, regression, similarity matching and so forth.

These algorithms are frequently studied in one or more related disciplines, such as the ones shown here.

In general, graph analytics is a very broad (and complex) field that requires knowledge and understanding of many disciplines. For example:

  • Graph analytics benefits from data engineering and processing to extract, acquire, and prepare data

  • Graph analytics benefits from data mining to find patterns and build (predictive, causal, descriptive) models

  • Graph analytics benefits from statistics to analyze data characteristics

  • Graph analytics benefits from visualization to improve sharing, reporting, and evaluation

Many real-life problems and datasets are naturally represented as networks, webs, or graphs. They also primarily focus on connections, links, relationships, and dependencies. Complex relationships, such as those captured in the domains of Customer 360, Recommendations, Fraud Detection, and so forth, are best modeled and traversed using graphs.

Of course, this is not a complete list of graph analytics applications but it is quite representative for DSE Graph.

It is useful to distinguish between real-time and batch analytics and understand how those are supported by DSE Graph. 

Some analytical applications have real-time, user-facing components, requiring fast response. Examples would be recommendations, personalization, and fraud-prevention. It is quite possible that a recommendation system has batch analytics components, too. They may precompute recommendations overnight but eventually recommendations must be served in real-time by a recommender subsystem.

Similarly, you may have a real-time fraud-prevention component that make a decision about a current transaction and a batch-analytics fraud detection that analyses many historical transactions to find fraud that was committed in the past.

In DSE Graph, the some cases, real-time analytical steps can be implemented using Gremlin OLTP traversals with response times measured in milliseconds or, in case of infrequent traversals, in seconds. Alternatively, one can implement NEAR real-time analytics in DSE Graph using Gremlin OLAP traversals. For near real-time analytics, we are expecting a response time in seconds.

Batch analytics should always rely on DSE Graph OLAP capabilities. Those are Gremlin OLAP traversals and DSE GraphFrames. The two approaches complement each other in their capabilities.

Gremlin OLAP traversals are automatically translated into Spark code and executed as Spark jobs. They are suitable for long-running, computationally and memory expensive traversals.

DSE GraphFrames, while also have limited support of Gremlin, are frequently most valued for their ability to perform non-Gremlin graph data manipulation via Spark, combining graph and non-graph data for analysis, ingesting a data stream into a graph, and bulk data mutations.

To decide between Gremlin OLAP traversals and DSE GraphFrames, it is useful to distinguish between deep and scan queries.

Deep queries are targeted traversals that hit a large number of vertices due to a high graph density and longer traversal paths. Gremlin OLAP traversals will do best here.

Scan queries are traversals that touch either an entire graph or its large subgraphs. DSE GraphFrames will be a better choice for scan queries.

We cover both Gremlin OLAP traversals and DSE GraphFrames in much greater details in separate video.

As a result, in DSE Graph, there are two OLAP engines with different APIs and capabilities. These engines rely on Apache Spark™ but use different data abstractions.

The SparkGraphComputer engine is the primary engine for executing Gremlin OLAP traversals, which we already know are great for deep queries. 

The DSE GraphFrames engine is the primary engine for executing DSE GraphFrames queries, which are a good choice for scan queries.

The two OLAP engines are starting to converge via Intelligent Analytics Query Routing.

If you have a Gremlin OLAP traversal that performs a simple scan query, the SparkGraphComputer engine will automatically delegate execution to the DSE GraphFrames engine.

Currently, intelligent OLAP query routing works for count, groupCount, and drop queries that involve no more than three hops and use steps like has(), hasLabel(), out(), in(), both(), outE(). The feature will keep evolving so always check the documentation for the most recent list of requirements.

Intelligent Analytics Query Routing is an automatic optimization feature that is implemented as a DseGraphFrameInterceptorStrategy. Therefore, if for any reason you need to disable rerouting for a particular traversal, it is quite easy to do by disabling this strategy.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.