PySpark Vs Pandas

Pandas Transformations

  • df.count() – Returns the count of each column (the count includes only non-null values).
  • df.corr() – Returns the correlation between columns in a data frame.
  • df.head(n) – Returns first n rows from the top.
  • df.max() – Returns the maximum of each column.
  • df.mean() – Returns the mean of each column.
  • df.median() – Returns the median of each column.
  • df.min() – Returns the minimum value in each column.
  • df.std() – Returns the standard deviation of each column
  • df.tail(n) – Returns last n rows.

PySpark Transformations

  • df.select() – Choose specific columns from a DataFrame.
  • df.filter() – Filter rows based on a condition.
  • df.groupBy() – Group rows based on one or more columns.
  • df.agg() – Perform aggregate functions (e.g., sum, average) on grouped data.
  • df.orderBy() – Sort rows based on one or more columns.
  • df.dropDuplicates() – Remove duplicate rows from the DataFrame.
  • df.withColumn() – Add a new column or replace an existing column with modified data.
  • df.drop() – Remove one or more columns from the DataFrame.
  • df.join() – Merge two DataFrames based on a common column or index.
  • df.pivot() – Pivot the DataFrame to reorganize data based on column values.