Skip to content
Pandas Transformations
df.count()
– Returns the count of each column (the count includes only non-null values).
df.corr()
– Returns the correlation between columns in a data frame.
df.head(n)
– Returns first n rows from the top.
df.max()
– Returns the maximum of each column.
df.mean()
– Returns the mean of each column.
df.median()
– Returns the median of each column.
df.min()
– Returns the minimum value in each column.
df.std()
– Returns the standard deviation of each column
df.tail(n)
– Returns last n rows.
PySpark Transformations
df.select()
– Choose specific columns from a DataFrame.
df.filter()
– Filter rows based on a condition.
df.groupBy()
– Group rows based on one or more columns.
df.agg()
– Perform aggregate functions (e.g., sum, average) on grouped data.
df.orderBy()
– Sort rows based on one or more columns.
df.dropDuplicates()
– Remove duplicate rows from the DataFrame.
df.withColumn()
– Add a new column or replace an existing column with modified data.
df.drop()
– Remove one or more columns from the DataFrame.
df.join()
– Merge two DataFrames based on a common column or index.
df.pivot()
– Pivot the DataFrame to reorganize data based on column values.