PySpark Vs Pandas

Pandas Transformations

  • df.count() – Returns the count of each column (the count includes only non-null values).
  • df.corr() – Returns the correlation between columns in a data frame.
  • df.head(n) – Returns first n rows from the top.
  • df.max() – Returns the maximum of each column.
  • df.mean() – Returns the mean of each column.
  • df.median() – Returns the median of each column.
  • df.min() – Returns the minimum value in each column.
  • df.std() – Returns the standard deviation of each column
  • df.tail(n) – Returns last n rows.

PySpark Transformations

  • df.select() – Choose specific columns from a DataFrame.
  • df.filter() – Filter rows based on a condition.
  • df.groupBy() – Group rows based on one or more columns.
  • df.agg() – Perform aggregate functions (e.g., sum, average) on grouped data.
  • df.orderBy() – Sort rows based on one or more columns.
  • df.dropDuplicates() – Remove duplicate rows from the DataFrame.
  • df.withColumn() – Add a new column or replace an existing column with modified data.
  • df.drop() – Remove one or more columns from the DataFrame.
  • df.join() – Merge two DataFrames based on a common column or index.
  • df.pivot() – Pivot the DataFrame to reorganize data based on column values.

Connecting and Downloading Kaggle Dataset from colab

Register https://www.kaggle.com and generate API token via https://www.kaggle.com/settings

# Run this cell and select the kaggle.json file downloaded
# from the Kaggle account settings page.

from google.colab import files
files.upload()

# This will prompt the file upload control, so that we can uppload the file to the temporark work space.
# Next, install the Kaggle API client.
!pip install -q kaggle

# The Kaggle API client expects this file to be in ~/.kaggle, so move it there.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# This permissions change avoids a warning on Kaggle tool startup.
!chmod 600 ~/.kaggle/kaggle.json

# Searching for dataset
!kaggle datasets list -s dogbreedidfromcomp

# Downloading dataset in the current directory
!kaggle datasets download catherinehorng/dogbreedidfromcomp

# Unzipping downloaded file and removing unusable file
!unzip dog_dataset/dogbreedidfromcomp.zip -d dog_dataset