In Spark, RDD, DataFrames, and Datasets are three different abstractions for working with distributed data.
RDD is the fundamental data structure in Spark and stands for Resilient Distributed Dataset.
- RDDs are fault-tolerant, immutable distributed collections of objects.
- They can be created from data stored in HDFS, local file systems, or other data sources.
- RDDs provide low-level transformations and actions, allowing for fine-grained control over data.
- RDDs are dynamically typed, which means that the type of data in an RDD can vary across elements.
- RDDs are suitable when you need low-level control or want to perform complex operations not supported by higher-level abstractions.
# RDD from a list
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
# Transformations
squared_rdd = rdd.map(lambda x: x**2)
# Actions
sum_of_squared = squared_rdd.reduce(lambda x, y: x + y)
print(sum_of_squared)
DataFrame is a distributed collection of data organized into named columns.
- It represents a structured and tabular data format with schema information.
- DataFrames are built on top of RDDs, providing a more efficient and optimized way to work with structured data.
- DataFrames allow for high-level abstractions and optimizations, including query optimization and execution plans.
- DataFrames provide a SQL-like API for querying and manipulating data.
- DataFrame operations are statically typed, which allows for compile-time type checking and optimizations.
- DataFrames are suitable for most general-purpose data processing and analysis tasks.
# Creating DataFrame from a list
data = [(1, "raj", 25), (2, "veer", 30), (3, "prag", 35)]
df = spark.createDataFrame(data, ["id", "name", "age"])
# Transformations
filtered_df = df.filter(df.age > 30)
# Actions
count = filtered_df.count()
print(count)
Dataset is an extension of the DataFrame API, providing a type-safe, object-oriented programming interface.
- Datasets combine the best features of RDDs and DataFrames, offering strong typing and high-level abstractions.
- Datasets allow you to work with structured and unstructured data, benefiting from the optimization and query engine of DataFrames.
- Datasets provide compile-time type safety for data manipulation operations, catching errors at compile-time rather than runtime.
- Datasets are suitable when you want the benefits of strong typing and object-oriented programming, and need better performance than RDDs.
# Creating Dataset from a list
from pyspark.sql import Row
data = [Row(id=1, name="raj", age=25), Row(id=2, name="veer", age=30), Row(id=3, name="prag", age=35)]
dataset = spark.createDataFrame(data).as("dataset")
# Transformations
filtered_dataset = dataset.filter(dataset.age > 30)
# Actions
count = filtered_dataset.count()
print(count)
Leave a comment