PySpark – Prince PARK

df.count() – Returns the count of each column (the count includes only non-null values).
df.corr() – Returns the correlation between columns in a data frame.
df.head(n) – Returns first n rows from the top.
df.max() – Returns the maximum of each column.
df.mean() – Returns the mean of each column.
df.median() – Returns the median of each column.
df.min() – Returns the minimum value in each column.
df.std() – Returns the standard deviation of each column
df.tail(n) – Returns last n rows.

df.select() – Choose specific columns from a DataFrame.
df.filter() – Filter rows based on a condition.
df.groupBy() – Group rows based on one or more columns.
df.agg() – Perform aggregate functions (e.g., sum, average) on grouped data.
df.orderBy() – Sort rows based on one or more columns.
df.dropDuplicates() – Remove duplicate rows from the DataFrame.
df.withColumn() – Add a new column or replace an existing column with modified data.
df.drop() – Remove one or more columns from the DataFrame.
df.join() – Merge two DataFrames based on a common column or index.
df.pivot() – Pivot the DataFrame to reorganize data based on column values.

Tag: PySpark