Tag: dataset
-
Spark data abstractions
In Spark, RDD, DataFrames, and Datasets are three different abstractions for working with distributed data. RDD is the fundamental data structure in Spark and stands for Resilient Distributed Dataset. DataFrame is a distributed collection of data organized into named columns. Dataset is an extension of the DataFrame API, providing a type-safe, object-oriented programming interface.