Apache Spark Streaming

Intro

Write Up

Apache Spark Streaming

What is Spark Streaming?

Spark Streaming is an extension of the core Spark API. Spark Streaming allows us to easily integrate real-time data from disparate event streams (Akka Actors, Kafka, S3 directories, and Twitter for instance) in event-driven, asynchronous, scalable, type-safe and fault tolerant applications. This is a huge difference from the single-threaded, procedural data analysis and computation over the past decade or earlier. Using APIs like Cascading for Java and Scalding for Scala allows only batch workflows but gave us the ability to work with our own existing APIs. But Spark Streaming takes this to a whole new level, also adding the ability to do both batch and streaming in the same application. No other framework offers this with one light-weight API, that is also so easy to use.

Continue Learning with DS320: DataStax Enterprise Analytics with Apache Spark

 

When Would You Use Spark Streaming?

For any real-time data statistics and response, Spark with Spark Streaming are a perfect fit.

  • Real-time adaptive analysis, statistics and response - sensors, alarms, cluster or fleet management, security / cyber security, telemetry, diagnostics

  • Internet of Things, Finance, Online ads / conversion, supply chain, campaigns

Making this even more powerful, you can easily add Spark MLLib for machine learning pipelines to the streaming data pathways.

Here are just some use cases for Spark Streaming:

 

Data Sources And Sinks

Data can originate from almost any source, including Amazon Kinesis, Kafka, Cassandra, Twitter and HDFS (as well as custom sources), and be stored in almost any sink system you might want, such as a file system (S3, HDFS…), or database (Cassandra, Hbase…).

 

Primary Features of Spark Streaming

Spark Streaming enables high-throughput and reliable processing of live data streams. With it we can express sophisticated algorithms easily using high-level functions in the API to process the data streams. Perhaps one of the more important features of Spark Streaming is its exactly-once message guarantees, a vital component of mission-critical data flow in systems. In contrast, Storm provides exactly once processing but only in conjunction with the Trident add-on, which is not trivial to set up. Spark’s consistent, immutable state of its clustered, distributed nodes maintain in-memory cache. In fact caching and storage is easily and highly configurable. RDDs themselves are immutable and provide the basis of the Spark fault tolerant system.

 

DStream

A Discretized Stream, or DStream, is the base abstraction in Spark Streaming. A DStream is a stream of RDDs or continuous sequence of micro batches, of a particular type. We can think of Spark Streaming computations as a series of micro-batch computations on set windows of time, or intervals of a set duration. DStreams are created from live data streams or by transforming an existing DStreams. These enable creating more complex processing models with less effort. The input data received during each interval is stored across the cluster to form an input dataset (the micro-batch) for that interval. Once the time interval completes, this dataset is processed through parallel operations such as map, reduce, window and join. Implicit to the DStream, as with an RDD, is handling of key-value pairs ( reduceByKeyAndWindow, groupByKeyAndWindow), very useful in a Kafka stream. Operations are made implicitly available by simply importing

org.apache.spark.streaming.StreamingContext._

 

InputDStream

InputDStreams are DStreams that read raw data from external data sources such as Kafka, into Spark Streaming. Each InputDStream[T] has a Receiver[T] where T is the data type that will be received. For Kafka, as an example, the Receiver is a

Receiver[(K, V)]

representing the stream of raw data received from streaming sources such as Kafka, Flume or Twitter. There are two types of InputDStreams. Basic InputDStreams are file systems (stream data from an S3 bucket or HDFS), socket connections, and Akka actors (example) are included in the spark-streaming API. More advanced InputDStreams are included in the Spark ‘external’ module of the Spark repo and as their own separate artifacts: Kafka, Flume, Amazon Kinesis, Twitter, ZeroMQ and others. Use any or none. You can stream from a Kafka topic (or multiple), move the data through your spark computations, and write the results of multiple child streams, post computation, to any of your Cassandra tables, in any keyspaces (with the Spark Cassandra Connector), all with just a few lines of code.

A snippet of Scala code which integrates Spark Streaming, Kafka, and Cassandra:

 

val conf = new SparkConf(true)
    .setMaster("local[*]")
    .setAppName(getClass.getSimpleName)
    .set("spark.executor.memory", "lg")
    .set("spark.cores.max", "1")
    .set("spark.connection.cassandra.host", "127.0.0.1")

val sc = new SparkContext(conf)

/** Creates the Spark Streaming Context */
val ssc = new StreamingContext(sc, Seconds(1))

/** Creates an input stream that pulls messages from a Kafka broker */
val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](scc, kafka.KafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY)

stream.map(_._2).countByValue().saveToCassandra("demo", "wordcount")

ssc.start()

 

See the full source code on GitHub

Twitter with Spark Streaming example on GitHub

Some of the ‘external’ module’s Spark Streaming InputDStream Receiver APIs:

spark-streaming-modules.png

 

Window Operations

By applying windowed computations on streams with Spark Streaming we can execute transformations over sliding windows of data. In windowing computation we talk about window length and sliding interval. Window length is the duration and sliding interval is the interval the window operation is performed. With each slide over the DStream, the RDDs within the window are combined and operated on. Here, we reduce the last 30 seconds of data every 10 seconds:

words.map(word => (word,1)).reduceByKeyAndWindow((a:Int,b:Int) => (a+b), Seconds(30), Seconds(10))

This is time series data but by interval versus by specific time periods such as hourly or daily per month per year. Both can easily be done with Spark Streaming and because time series data is an area Cassandra excels in, particularly with time series data modeling, the two are a natural fit. We can apply a basic MapReduce operation, the word count, to Spark Streaming with windowing by modifying it to iterate over last 30 seconds of data, every 10 seconds. We still have the usual tuple of (word, 1) but can apply the Spark Streaming ‘reduceByKeyAndWindow’ operation on the DStream.

Common Streaming Window Operations

window(windowLength, slideInterval)
countByWindow(windowLength, slideInterval) 
reduceByWindow(func, windowLength, slideInterval)
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) 
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks]) 
countByValueAndWindow(windowLength,slideInterval, [numTasks])
  • Spark Streaming Fault Tolerance

  • Spark Streaming supports Kerberos security from Spark

  • Easy connection of Spark Streaming to Hive via Spark SQL as of Spark 1.0

  • Streaming ML algorithms available as of Spark 1.1

Because the pace of development with Spark and Spark Streaming it is nearly future proof as opposed to other options, such as Storm. One of the benefits of using the Spark API is that it bridges the gap for those with less experience in the domain, for instance non data scientists, to be able to leverage sophisticated algorithms in a data pipeline. For data scientists, using Spark means becoming much more efficient.

Take the full Apache Spark course — DS320: Analytics with Apache Spark

Author

Helena Edelson, Senior Engineer at DataStax. Helena is a committer on several open source projects including Akka, the Spark Cassandra Connector and previously Spring Integration and Spring AMQP. She has been working with Scala since 2010 and is currently a Senior Software Engineer on the DSE Analytics team at DataStax, working with Apache Spark, Cassandra, Kafka, Scala and Akka. Most recently she has been a speaker at Big Data and Scala conferences.
 

Continue Learning Learning with DS320: DataStax Enterprise Analytics with Apache Spark

No Exercises.
No FAQs.
No resources.
Comments are closed.