How some teams have lean spark workloads when others doing similar work are using 2-3x more compute and their jobs are running 5-10x longer. There isn’t any rocket science behind this. Follow these basic steps often.
Minimize data shuffling: You can achieve this by partitioning data using columns that are frequently used for joins. Avoid compute waste: Leverage Spark’s caching mechanism to avoid repeating expensive computations. You can cache RDDs or DataFrames that can be reused in multiple stages of the job (Do remember to cache only if you plan to reuse — also remember that cache is a lazy operation and is executed only when you have an action)
Increase parallelism: You improve parallelism of your Spark jobs by adjusting number of executors, cores per executor, and memory allocation.
Transmission overhead: You can broadcast variables to efficiently share read-only data across Spark nodes, reducing data transmission overhead.
Query optimization: You can optimize performance by using Spark’s SQL engine instead of the RDD API, which can offer better query optimization.
Bottlenecks: This is the most important step. Once a job is scheduled, after every few runs, profile and monitor them continuously to identify performance bottlenecks and adjust configurations accordingly. You can use Spark’s built-in monitoring tools and third-party solutions like spark lens.
Follow these steps religiously especially the last review step. Have a second member of your team review every pipeline planned for deployment and most importantly, have an open discussion within your team when faced with issues.
Leave a comment