Spark data abstractions

In Spark, RDD, DataFrames, and Datasets are three different abstractions for working with distributed data.

RDD is the fundamental data structure in Spark and stands for Resilient Distributed Dataset.

  • RDDs are fault-tolerant, immutable distributed collections of objects.
  • They can be created from data stored in HDFS, local file systems, or other data sources.
  • RDDs provide low-level transformations and actions, allowing for fine-grained control over data.
  • RDDs are dynamically typed, which means that the type of data in an RDD can vary across elements.
  • RDDs are suitable when you need low-level control or want to perform complex operations not supported by higher-level abstractions.
# RDD from a list
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
# Transformations
squared_rdd = rdd.map(lambda x: x**2)
# Actions
sum_of_squared = squared_rdd.reduce(lambda x, y: x + y)
print(sum_of_squared)

DataFrame is a distributed collection of data organized into named columns.

  • It represents a structured and tabular data format with schema information.
  • DataFrames are built on top of RDDs, providing a more efficient and optimized way to work with structured data.
  • DataFrames allow for high-level abstractions and optimizations, including query optimization and execution plans.
  • DataFrames provide a SQL-like API for querying and manipulating data.
  • DataFrame operations are statically typed, which allows for compile-time type checking and optimizations.
  • DataFrames are suitable for most general-purpose data processing and analysis tasks.
# Creating DataFrame from a list
data = [(1, "raj", 25), (2, "veer", 30), (3, "prag", 35)]
df = spark.createDataFrame(data, ["id", "name", "age"])
# Transformations
filtered_df = df.filter(df.age > 30)
# Actions
count = filtered_df.count()
print(count)

Dataset is an extension of the DataFrame API, providing a type-safe, object-oriented programming interface.

  • Datasets combine the best features of RDDs and DataFrames, offering strong typing and high-level abstractions.
  • Datasets allow you to work with structured and unstructured data, benefiting from the optimization and query engine of DataFrames.
  • Datasets provide compile-time type safety for data manipulation operations, catching errors at compile-time rather than runtime.
  • Datasets are suitable when you want the benefits of strong typing and object-oriented programming, and need better performance than RDDs.
# Creating Dataset from a list
from pyspark.sql import Row
data = [Row(id=1, name="raj", age=25), Row(id=2, name="veer", age=30), Row(id=3, name="prag", age=35)]
dataset = spark.createDataFrame(data).as("dataset")
# Transformations
filtered_dataset = dataset.filter(dataset.age > 30)
# Actions
count = filtered_dataset.count()
print(count)

Posted

in

,

by

Comments

Leave a comment