Data Streaming with Spark

Written by Prithvi Singh|6 min read|

Hey guys, earlier last week we talked about streaming platform known as Kafka which serves the purpose of streaming the data from the system to system.

But what happens when you want to perform operations like aggregation on the streaming data. Well, that’s where Apache Spark comes in. 

Apache Spark is an opensource framework which provides high fault tolerance, low latency computational platform. Spark streaming takes the streamed data as input and operates on them in the form of batches. With a spark, you have two ways of doing streaming:-

  • Spark Streaming
  • Structured Streaming

But before we see what they actually are first, let’s look into two data structures of Spark. Spark’s core APIs works on these data structure.

RDDs: – 

RDD stands for Resilient Distributed Dataset. RDDs are a fundamental data structure of spark. They are the immutable distributed set of objects. 

Each dataset in RDD is logically processed on different nodes in the cluster. This dataset contains java, scala, python objects based on your coding language. These RDDs are read-only, partition collection of object. They can be created either by parallelising the current dataset in spark or by reading through some external system like file system, Kafka topics, database.

Spark makes use of RDDs to perform fast and efficient MapReduce operations. Now let’s see some benefits of using RDDs.

  • Data sharing in MapReduce is slow due to replication, serialization and disk IO. 90% of the Hadoop time is spent doing HDFS read-write operations. So to avoid these bottlenecks the researchers came up with Spark. The key idea of RDD is that it supports in-memory processing. It stores the state of the memory as an object across the jobs and each job share objects between each other. And the sharing of objects is faster in-memory than network or disk.
  • The intermediate results are stored in distributed memory instead of disk. But if there isn’t sufficient space to store the intermediate results then it is stored in the disk.
  • If you are continuously querying different queries on the same dataset then it is kept in memory for better execution just like caching. By default, the RDD will be recomputed if you perform different action on it. But if you persist that RDD in memory it will be available faster for computation the next time you query it.

DataFrame: – 

A Spark DataFrame is a collection of data organised into named columns which can provide operations like filter, aggregate, group and they can also be used with Spark SQL. Conceptually, it is equivalent to the relational table with good optimization techniques.

They can be constructed using the existing RDDs, external database systems, tables in HIVE or structured data files. Let’s see some features of DataFrame.

  • Its ability to compute data of kilobytes to petabytes on a single node cluster or large cluster is remarkable.
  • It supports different data formats such as Avro, CSV, elastic search, Cassandra and storage systems like HDFS, Hive tables, MySQL etc.
  • It can be easily integrated with all the Big data tools and frameworks using Spark-core.
  • State of art optimization and code generation through spark SQL Catalyst optimizer. 

These are the underlying data structures used by a spark. But if you need to go more in-depth of these terms I will provide some reference at the end of the article.

Now let’s jump into Spark streaming and Structure streaming in a bit of detail. 

Spark Streaming: – 

Spark streaming was added as an extension of core Spark API in 2013 to provide scalable, high-throughput and fault-tolerant stream processing of live data streams. 

The data can be ingested from multiple sources like Kafka, zume, kinesis and then operations can be performed on them. These operations can be complex algorithms which are expressed as high-level functions such as a map, flat map, reduce, aggregate. Finally, the RDDs generated by Dstream are sent to different platforms like the live dashboard, file system, database etc.

Lets see goals of spark streaming:-

  • Dividing data into small batches allows fine-grained allocations of computations on different nodes. Cause in traditional approach if one partition is more computationally intensive than the others then the node handling that partition will become a bottleneck for the pipeline. But in spark streaming each task is divided among different workers so some workers perform a longer task while others perform a shorter task.
  • In case of failures, traditional systems recompute the failed data on a single node. While in spark streaming they are distributed evenly among all the nodes due to which the recovery from failures becomes faster.
  • Spark extends to rich libraries like Mlib, SQL, DataFrames. RDDs generated by DStreams can be converted into DataFrames and can be queried using SQL.

Problems with Dstreams are as follows: – 

  • Processing with event-time and latency.
  • Reasoning about end-to-end guarantees.
  • Interoperate with batch and interactive.

Structured Streaming: –

Structured Streaming is a scalable, fault-tolerant stream process engine built on the Spark SQL engine. You can express your stream processing computations in the same way you express your batch computation on static data. 

Structure streaming uses DataFrame API rather than RDD in Scala, JAVA,  python or R to express aggregations, event-time windows joins etc. The computations are executed on the same SQL engine. In shorts, Structure Streaming provides fast, scalable, fault-tolerant stream processing without the user having to reason about it.

The two main difference between spark streaming and structure streaming is that first spark streaming uses Dstream APIs while structure streaming uses DataFrame APIs. And the second is that structure streaming provides exactly-once guarantees and Spark streaming provides at-least-once guarantees.

Few benefits of using Structured streaming: – 

  • The main goal is to perform end-to-end streaming applications, which integrate with storage, serving systems, and batch jobs in a consistent and fault-tolerant way.
  • Users just describe the query they want to run, input and output locations and few more extra details and the system then run the query in an incremental fashion.
  • And also it guarantees exactly once approach which is hard to achieve especially in distributed systems. What it means is that each message is processed once and only once. This avoids duplicates on a large scale. 

Wrapping up: –

So in this article, we saw things related to streaming with spark and also the data structures of the spark. In the next section, we will be implementing Spark Streaming using pyspark which is a popular library for spark with python.

For further reference: –