PySpark Vs Pandas

PySpark Vs Pandas

PySpark is the Python library for Spark programming. It allows you to use the powerful and efficient data processing capabilities of Apache Spark from within the Python programming language. PySpark provides a high-level API for distributed data processing that can be used to perform common data analysis tasks, such as filtering, aggregation, and transformation of large datasets.

Pandas is a Python library for data manipulation and analysis. It provides powerful data structures, such as the DataFrame and Series, that are designed to make it easy to work with structured data in Python. With pandas, you can perform a wide range of data analysis tasks, such as filtering, aggregation, and transformation of data, as well as data cleaning and preparation.

PySpark Pandas
PySpark is a library for working with large datasets in a distributed computing environment. Pandas is a library for working with smaller, tabular datasets on a single machine.
PySpark is built on top of the Apache Spark framework and uses the Resilient Distributed Datasets (RDD) data structure. Pandas uses the DataFrame data structure.
PySpark is designed to handle data processing tasks that are not feasible with pandas due to memory constraints, such as iterative algorithms and machine learning on large datasets.
PySpark allows for parallel processing of data Pandas does not allows for parallel processing of data.
PySpark can read data from a variety of sources, including Hadoop Distributed File System (HDFS), Amazon S3, and local file systems. Pandas is limited to reading data from local file systems.
PySpark can be integrated with other big data tools like Hadoop and Hive. Pandas cannot be integrated with other big data tools like Hadoop and Hive
PySpark is written in Scala, and runs on the Java Virtual Machine (JVM). Pandas is written in Python.
PySpark has a steeper learning curve than pandas, due to the additional concepts and technologies involved (e.g. distributed computing, RDDs, Spark SQL, Spark Streaming, etc.).

How to decide which library to use — PySpark vs Pandas

The decision of whether to use PySpark or pandas depends on the size and complexity of the dataset and the specific task you want to perform.

  • Size of the dataset: PySpark is designed to handle large datasets that are not feasible to work with on a single machine using pandas. If you have a dataset that is too large to fit in memory, or if you need to perform iterative or distributed computations, PySpark is the better choice.
  • Complexity of the task: PySpark is a powerful tool for big data processing and allows you to perform a wide range of data processing tasks, such as machine learning, graph processing, and stream processing. If you need to perform any of these tasks, PySpark is the better choice.
  • Learning Curve: PySpark has a steeper learning curve than pandas, as it requires knowledge of distributed computing, RDDs, and Spark SQL. If you are new to big data processing and want to get started quickly, pandas may be the better choice.
  • Resources available: PySpark requires a cluster or distributed system to run, so you will need access to the appropriate infrastructure and resources. If you do not have access to these resources, then pandas is a good choice.

Resources

  • PySpark documentation is a great resource for learning PySpark, as it provides detailed information on the library’s API and includes examples of common use cases.
  • Databricks - PySpark tutorials are a good resource for learning PySpark, as they provide hands-on examples and explanations of how to use the library.

Pandas DataFrame

  • Pandas is an open-source Python library based on the NumPy library (numpy library lets you manipulate numerical data and time series using a variety of data structures and operations)
  • Pandas DataFrame is a potentially heterogeneous two-dimensional size-mutable tabular data structure with labeled axes (rows and columns).
  • The data, rows, and columns are the three main components of a Pandas DataFrame.

Advantages:

  • Pandas Dataframe able to Data Manipulation such as indexing, renaming, sorting, merging data frame.
  • Updating, adding, and deleting columns are quite easier using Pandas.
  • Pandas Dataframe supports multiple file formats
  • Processing Time is too high due to the inbuilt function.

Disadvantages:

  • Manipulation becomes complex while we use a Huge dataset.
  • Processing time can be slow during manipulation.

Spark DataFrame

  • Spark is a system for cluster computing.
  • Spark is faster When compared to other cluster computing systems (such as Hadoop).
  • It has Python, Scala, and Java high-level APIs.
  • In Spark, writing parallel jobs is simple.
  • Spark is written in Scala and provides API in Python, Scala, Java, and R.
  • In Spark, DataFrames are distributed data collections that are organized into rows and columns. Each column in a DataFrame is given a name and a type.

Advantages:

  • Spark carry easy to use API for operation large dataset.
  • It not only supportsMAP’ and ‘reduce’, Machine learning (ML), Graph algorithms, Streaming data, SQL queries, etc.
  • Spark uses in-memory(RAM) for computation.
  • It offers 80 high-level operators to develop parallel applications.

Disadvantages:

  • No automatic optimization process
  • Very few Algorithms.
  • Small Files Issue
Spark DataFrame Pandas DataFrame
Spark DataFrame supports parallelization. Pandas DataFrame does not support parallelization.
Spark DataFrame has Multiple Nodes. Pandas DataFrame has a Single Node.
It follows Lazy Execution which means that a task is not executed until an action is performed. It follows Eager Execution, which means task is executed immediately.
Spark DataFrame is Immutable. Pandas DataFrame is Mutable.
Complex operations are difficult to perform as compared to Pandas DataFrame. Complex operations are easier to perform as compared to Spark DataFrame.
Spark DataFrame is distributed and hence processing in the Spark DataFrame is faster for a large amount of data. Pandas DataFrame is not distributed and hence processing in the Pandas DataFrame will be slower for a large amount of data.
sparkDataFrame.count() returns the number of rows. pandasDataFrame.count() returns the number of non NA/null observations for each column.
Spark DataFrames are excellent for building a scalable application. Pandas DataFrames can’t be used to build a scalable application.
Spark DataFrame assures fault tolerance. Pandas DataFrame does not assure fault tolerance. We need to implement our own framework to assure it.

Pandas Transformations

  • df.count() – Returns the count of each column (the count includes only non-null values).
  • df.corr() – Returns the correlation between columns in a data frame.
  • df.head(n) – Returns first n rows from the top.
  • df.max() – Returns the maximum of each column.
  • df.mean() – Returns the mean of each column.
  • df.median() – Returns the median of each column.
  • df.min() – Returns the minimum value in each column.
  • df.std() – Returns the standard deviation of each column
  • df.tail(n) – Returns last n rows.

PySpark Transformations

  • df.select() – Choose specific columns from a DataFrame.
  • df.filter() – Filter rows based on a condition.
  • df.groupBy() – Group rows based on one or more columns.
  • df.agg() – Perform aggregate functions (e.g., sum, average) on grouped data.
  • df.orderBy() – Sort rows based on one or more columns.
  • df.dropDuplicates() – Remove duplicate rows from the DataFrame.
  • df.withColumn() – Add a new column or replace an existing column with modified data.
  • df.drop() – Remove one or more columns from the DataFrame.
  • df.join() – Merge two DataFrames based on a common column or index.
  • df.pivot() – Pivot the DataFrame to reorganize data based on column values.

Connecting and Downloading Kaggle Dataset from colab

Register https://www.kaggle.com and generate API token via https://www.kaggle.com/settings

1
2
3
4
5
6
7
# Run this cell and select the kaggle.json file downloaded
# from the Kaggle account settings page.
 
from google.colab import files
files.upload()
 
# This will prompt the file upload control, so that we can uppload the file to the temporark work space.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Next, install the Kaggle API client.
!pip install -q kaggle
 
# The Kaggle API client expects this file to be in ~/.kaggle, so move it there.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
 
# This permissions change avoids a warning on Kaggle tool startup.
!chmod 600 ~/.kaggle/kaggle.json
 
# Searching for dataset
!kaggle datasets list -s dogbreedidfromcomp
 
# Downloading dataset in the current directory
!kaggle datasets download catherinehorng/dogbreedidfromcomp
 
# Unzipping downloaded file and removing unusable file
!unzip dog_dataset/dogbreedidfromcomp.zip -d dog_dataset

Data Scientist With Microsoft

https://learn.microsoft.com/en-us/users/princeparkyohannanhotmail-8262/transcript/dlmplcnz8w96op1

ASSOCIATE CERTIFICATION: Microsoft Certified: Azure Data Scientist Associate

CERTIFICATION EXAM: Designing and Implementing a Data Science Solution on Azure (Exam DP-100)

Data Scientist Career Path

COURSES

DP-090T00: Implementing a Machine Learning Solution with Microsoft Azure Databricks – Training

Azure Databricks is a cloud-scale platform for data analytics and machine learning. In this course, you’ll learn how to use Azure Databricks to explore, prepare, and model data; and integrate Databricks machine learning processes with Azure Machine Learning.

DP-100T01: Designing and Implementing a Data Science Solution on Azure

This course teaches you to leverage your existing knowledge of Python and machine learning to manage data ingestion and preparation, model training and deployment, and machine learning solution monitoring with Azure Machine Learning and MLflow.

My Learnings.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Calculate the number of empty cells in each column
# The following line consists of three commands. Try
# to think about how they work together to calculate
# the number of missing entries per column
missing_data = dataset.isnull().sum().to_frame()
 
# Rename column holding the sums
missing_data = missing_data.rename(columns={0:'Empty Cells'})
 
# Print the results
print(missing_data)
 
## OR
print(dataset.isnull().sum().to_frame().rename(columns={0:'Empty Cells'}))
 
# Show the missing value rows
dataset[dataset.isnull().any(axis=1)]

EDA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import pandas as pd
 
# Load data from a text file
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/grades.csv
df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')
 
# Remove any rows with missing data
df_students = df_students.dropna(axis=0, how='any')
 
# Calculate who passed, assuming '60' is the grade needed to pass
passes  = pd.Series(df_students['Grade'] >= 60)
 
# Save who passed to the Pandas dataframe
df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)
 
# Create a figure for 2 subplots (1 row, 2 columns)
fig, ax = plt.subplots(1, 2, figsize = (10,4))
 
# Create a bar plot of name vs grade on the first axis
ax[0].bar(x=df_students.Name, height=df_students.Grade, color='orange')
ax[0].set_title('Grades')
ax[0].set_xticklabels(df_students.Name, rotation=90)
 
# Create a pie chart of pass counts on the second axis
pass_counts = df_students['Pass'].value_counts()
ax[1].pie(pass_counts, labels=pass_counts)
ax[1].set_title('Passing Grades')
ax[1].legend(pass_counts.keys().tolist())
 
# Add a title to the Figure
fig.suptitle('Student Data')
 
# Show the figure
fig.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Create a function that we can re-use
# Create a function that we can re-use
def show_distribution_with_quantile(var_data, quantile = 0):
    '''
    This function will make a distribution (graph) and display it
    '''
 
    if(quantile > 0){
        # calculate the quantile percentile
        q01 = var_data.quantile(quantile)
        print(f"quantile = {q01}")
 
        var_data = var_data[var_data>q01]
    }
 
    # Get statistics
    min_val = var_data.min()
    max_val = var_data.max()
    mean_val = var_data.mean()
    med_val = var_data.median()
    mod_val = var_data.mode()[0]
 
    print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\n'.format(min_val,
                                                                                            mean_val,
                                                                                            med_val,
                                                                                            mod_val,
                                                                                            max_val))
 
    # Create a figure for 2 subplots (2 rows, 1 column)
    fig, ax = plt.subplots(2, 1, figsize = (10,4))
 
    # Plot the histogram  
    ax[0].hist(var_data)
    ax[0].set_ylabel('Frequency')
 
    # Add lines for the mean, median, and mode
    ax[0].axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)
 
    # Plot the boxplot  
    ax[1].boxplot(var_data, vert=False)
    ax[1].set_xlabel('Value')
 
    # Add a title to the Figure
    fig.suptitle('Data Distribution')
 
    # Show the figure
    fig.show()
 
# Get the variable to examine
col = df_students['Grade']
# Call the function
show_distribution(col)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def show_density(var_data):
    fig = plt.figure(figsize=(10,4))
 
    # Plot density
    var_data.plot.density()
 
    # Add titles and labels
    plt.title('Data Density')
 
    # Show the mean, median, and mode
    plt.axvline(x=var_data.mean(), color = 'cyan', linestyle='dashed', linewidth = 2)
    plt.axvline(x=var_data.median(), color = 'red', linestyle='dashed', linewidth = 2)
    plt.axvline(x=var_data.mode()[0], color = 'yellow', linestyle='dashed', linewidth = 2)
 
    # Show the figure
    plt.show()
 
# Get the density of StudyHours
show_density(col)

Azure Databricks

Mount a remote Azure storage account as a DBFS folder, using the dbutils module:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
data_storage_account_name = '<data_storage_account_name>'
data_storage_account_key = '<data_storage_account_key>'
 
data_mount_point = '/mnt/data'
 
data_file_path = '/bronze/wwi-factsale.csv'
 
dbutils.fs.mount(
  source = f"wasbs://dev@{data_storage_account_name}.blob.core.windows.net",
  mount_point = data_mount_point,
  extra_configs = {f"fs.azure.account.key.{data_storage_account_name}.blob.core.windows.net": data_storage_account_key})
 
display(dbutils.fs.ls("/mnt/data"))
#this path is available as dbfs:/mnt/data for spark APIs, e.g. spark.read
#this path is available as file:/dbfs/mnt/data for regular APIs, e.g. os.listdir
 
# %fs magic command - for accessing the dbutils filesystem module. Most dbutils.fs commands are available using %fs magic commands

We can override the cell’s default programming language by using one of the following magic commands at the start of the cell:

  • %python – for cells running python code
  • %scala – for cells running scala code
  • %r – for cells running R code
  • %sql – for cells running sql code

Additional magic commands are available:

  • %md – for descriptive cells using markdown
  • %sh – for cells running shell commands
  • %run – for cells running code defined in a separate notebook
  • %fs – for cells running code that uses dbutils commands

OpenCV(cv2) Vs Pillow(PIL)

_ OpenCV is 1.4 Times faster than PIL _

Image is simply a matrix of pixels and each pixel is a single, square-shaped point of colored light. This can be explained quickly with a grayscaled image. grayscaled image is the image where each pixel represents different shades of a gray color.

Difference between OpenCV and PIL | Image by Author

I mostly use OpenCV to complete my tasks as I find it 1.4 times quicker than PIL.

Let’s see, how the image can be processed using both — OpenCV and PIL.

## Installation & importing

1
2
3
4
5
6
7
8
9
# cv2
pip install opencv-python
import cv2
 
---
 
# PIL
pip install Pillow
from PIL import Image, ImageEnhance

## Read the image

1
2
3
4
5
6
7
# Read/open a colorful image
pil_img = Image.open("your_image.jpg"# RGB
cv2_img = cv2.imread("your_image.jpg"# BGR
 
# Read/open a grayscale image:
pil_img = Image.open("your_image.jpg").convert("L")
cv2_img = cv2.imread("your_image.jpg", cv2.IMREAD_GRAYSCALE)

## Write/save an image

1
2
3
4
5
6
pil_img.save("new_image.jpg")
cv2.imwrite("new_image.jpg", cv2_img)
 
# Write/save a JPEG image with specific quality:
pil_img.save("new_image.jpg", quality=95)
cv2.imwrite("new_image.jpg", cv2_img, [int(cv2.IMWRITE_JPEG_QUALITY), 95])

## Conversion between both

1
2
3
# Pillow image to OpenCV image:
cv2_img = np.array(pil_img)
cv2_img = cv2.cvtColor(cv2_img, cv2.COLOR_RGB2BGR)
1
2
3
# OpenCV image to Pillow image
cv2_img = cv2.cvtColor(cv2_img, cv2.COLOR_BGR2RGB)
pil_img = Image.fromarray(cv2_img)
Note: OpenCV images are in BGR color format, while Pillow images are in RGB color format. So we have to manually convert the color format from one to another.

## Shape / Size

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# cv2
if cv2_img.ndim == 2:
  height, width = cv2_img.shape
  depth = 1
else:
  height, width, depth = cv2_img.shape
 
# PIL
width, height = pil_img.size
cv2_img = np.array(pil_img)
if cv2_img.ndim == 2:
  depth = 1
else:
  depth = cv2_img.shape[-1]
Note: It is hard to get the depth/channels directly from a Pillow image object, the easier way to do this would be to first convert it to an OpenCV image (ndarray) and then get the shape.

## Resize

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Resize without preserving the aspect ratio:
pil_img_resized = pil_img.resize((NEW_WIDTH, NEW_HEIGHT))
cv2_img_resized = cv2.resize(cv2_img, (NEW_WIDTH, NEW_HEIGHT))
Resize and preserve the aspect ratio:
 
# OpenCV:
scale_ratio = 0.6
width = int(img.shape[1] * scale_ratio)
height = int(img.shape[0] * scale_ratio)
dim = (width, height)
cv2_img_resized = cv2.resize(cv2_img, dim, interpolation=cv2.INTER_AREA)
 
# Pillow:
# scale ratio = min(max_width/width, max_height/height)
max_width = 256
max_height = 256
pil_img.thumbnail((max_width, max_height), Image.ANTIALIAS)

## RGBA to RGB

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Convert transparent pixels to white pixels (by pasting the RGBA image on a white RGB image).
 
 
#cv2
def cv2_RGBA2RGB(img):
  b, g, r, a = cv2.split(img)
  alpha = a / 255
  r = (255 * (1 - alpha) + r * alpha).astype(np.uint8)
  g = (255 * (1 - alpha) + g * alpha).astype(np.uint8)
  b = (255 * (1 - alpha) + b * alpha).astype(np.uint8)
  new_img = cv2.merge((b, g, r))
  return new_img
 
# PIL
def pil_RGBA2RGB(img):
  img.load() # for png.split()
  bg = Image.new("RGB", img.size, (255, 255, 255))
  bg.paste(img, mask=img.split()[3]) # 3 is the alpha channel
  return bg

## Read an image from a URL.

import io
import requests;
import pathlib;
from PIL import Image
# =================================================================== #
def download_image(url):
filename = url.split('/')[-1];
print(f"Filename : {filename}")
if(not pathlib.Path(filename).is_file()):
print(f"Downloading image since '{filename}' is unable to find locally...")
# with open('kitten.jpg', 'wb') as handler: handler.write(requests.get(url).content)
with open(filename, 'wb') as handle:
response = requests.get(url, stream=True)
if not response.ok: print(response)
for block in response.iter_content(1024):
if not block: break
handle.write(block)
del response
# =================================================================== #
def download_N_return_image(url):
filename = url.split('/')[-1];
if(not pathlib.Path(filename).is_file()): download_image(url);
if(not pathlib.Path(filename).is_file()):
print(f"Found '{filename}' locally");
return;
else:
print(f"Filename : {filename}");
return Image.open(filename);
# =================================================================== #
img = download_N_return_image('https://coderslegacy.com/wp-content/uploads/2020/12/kitten.jpg');
img
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# without request headers
 
url = ''
 
# cv2
import cv2
import numpy as np
import requests
cv2_img = cv2.imdecode(np.asarray(requests.get(url, stream=True).content, dtype=np.uint8), cv2.IMREAD_UNCHANGED)
 
# PIL
importt io;
import requests
pil_img = Image.open(io.BytesIO(requests.get(url, stream=True).content))

## Base64 Conversions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Read image file as base64:
import base64
with open("your_image.jpg", "rb") as f:
  base64_str = base64.b64encode(f.read())
 
# Conversion between Pillow & base64:
import base64
from io import BytesIO
from PIL import Image
def pil_to_base64(pil_img):
  img_buffer = BytesIO()
  pil_img.save(img_buffer, format='JPEG')
  byte_data = img_buffer.getvalue()
  base64_str = base64.b64encode(byte_data)
  return base64_str
def base64_to_pil(base64_str):
  pil_img = base64.b64decode(base64_str)
  pil_img = BytesIO(pil_img)
  pil_img = Image.open(pil_img)
  return pil_img
 
# Conversion between OpenCV & base64:
import base64
import numpy as np
import cv2
def cv2_base64(cv2_img):
  base64_str = cv2.imencode('.jpg', cv2_img)[1].tostring()
  base64_str = base64.b64encode(base64_str)
  return base64_str
def base64_cv2(base64_str):
  imgString = base64.b64decode(base64_str)
  nparr = np.fromstring(imgString, np.uint8)
  cv2_img= cv2.imdecode(nparr, cv2.IMREAD_COLOR)
  return cv2_img

Code Implementation of Machine Learning, Deep Learning & Artificial Intelligence Functions

Activation Functions

Sigmoid / Logistic Function

1
2
3
4
import math;
 
def sigmoid(x):
  return 1 / (1 + math.exp(-x))

tanh Function

1
2
3
4
import math;
 
def tanh(x):
  return (math.exp(x) - math.exp(-x)) / (math.exp(x) + math.exp(-x))

ReLU

1
2
3
4
import math;
 
def relu(x):
    return max(0,x)

Leaky ReLU

1
2
3
4
import math;
 
def leaky_relu(x):
    return max(0.1*x,x)

For more info, visit ‘Activation Functions in Neural Networks [12 Types & Use Cases]’ By Pragati Baheti

Loss \ Cost Functions

Mean Absolute Error/L1 Loss (Regression Losses)

Mean absolute error

Mean absolute error, on the other hand, is measured as the average sum of absolute differences between predictions and actual observations. Like MSE, this as well measures the magnitude of error without considering their direction. Unlike MSE, MAE needs more complicated tools such as linear programming to compute the gradients. Plus MAE is more robust to outliers since it does not make use of squares.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Plain implementation
 
import numpy as np
y_hat = np.array([0.000, 0.166, 0.333])
y_true = np.array([0.000, 0.254, 0.998])
 
print("d is: " + str(["%.8f" % elem for elem in y_hat]))
print("p is: " + str(["%.8f" % elem for elem in y_true]))
 
def mae(predictions, targets):
    total_error = 0
    for yp, yt in zip(predictions, targets):
        total_error += abs(yp - yt)
    print("Total error is:",total_error)
    mae = total_error/len(predictions)
    print("Mean absolute error is:",mae)
    return mae
 
# Usage : mae(predictions, targets)
1
2
3
4
5
6
7
8
9
10
11
12
13
# Implementation using numpy
import numpy as np
y_hat = np.array([0.000, 0.166, 0.333])
y_true = np.array([0.000, 0.254, 0.998])
 
print("d is: " + str(["%.8f" % elem for elem in y_hat]))
print("p is: " + str(["%.8f" % elem for elem in y_true]))
 
def mae_np(predictions, targets):
    return np.mean(np.abs(predictions-targets))
 
mae_val = mae_np(y_hat, y_true)
print ("mae error is: " + str(mae_val))

Mean Square Error/Quadratic Loss/L2 Loss (Regression Losses)

Mean square error is measured as the average of the squared difference between predictions and actual observations. It’s only concerned with the average magnitude of error irrespective of their direction. However, due to squaring, predictions that are far away from actual values are penalized heavily in comparison to less deviated predictions. Plus MSE has nice mathematical properties which make it easier to calculate gradients.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Plain implementation
 
import numpy as np
 
y_hat = np.array([0.000, 0.166, 0.333])
y_true = np.array([0.000, 0.254, 0.998])
 
def rmse(predictions, targets):
    total_error = 0
    for yt, yp in zip(targets, predictions):
        total_error += (yt-yp)**2
    print("Total Squared Error:",total_error)
    mse = total_error/len(y_true)
    print("Mean Squared Error:",mse)
    return mse
 
print("d is: " + str(["%.8f" % elem for elem in y_hat]))
print("p is: " + str(["%.8f" % elem for elem in y_true]))
 
rmse_val = rmse(y_hat, y_true)
print("rms error is: " + str(rmse_val))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Implementation using numpy
 
import numpy as np
 
y_hat = np.array([0.000, 0.166, 0.333])
y_true = np.array([0.000, 0.254, 0.998])
 
def rmse(predictions, targets):
    return np.mean(np.square(targets-predictions))
 
print("d is: " + str(["%.8f" % elem for elem in y_hat]))
print("p is: " + str(["%.8f" % elem for elem in y_true]))
 
rmse_val = rmse(y_hat, y_true)
print("rms error is: " + str(rmse_val))

Log Loss or Binary Cross Entropy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
 
y_predicted = np.array([[0.25,0.25,0.25,0.25],[0.01,0.01,0.01,0.96]])
y_true = np.array([[0,0,0,1],[0,0,0,1]])
 
def cross_entropy(predictions, targets, epsilon=1e-10):
    predictions = np.clip(predictions, epsilon, 1. - epsilon)
    N = predictions.shape[0]
    ce_loss = -np.sum(np.sum(targets * np.log(predictions + 1e-5)))/N
    return ce_loss
cross_entropy_loss = cross_entropy(predictions, targets)
print ("Cross entropy loss is: " + str(cross_entropy_loss))
 
# OR
 
def log_loss(predictions, targets, epsilon=1e-10):
    predicted_new = [max(i,epsilon) for i in predictions]
    predicted_new = [min(i,1-epsilon) for i in predicted_new]
    predicted_new = np.array(predicted_new)
    return -np.mean(targets*np.log(predicted_new)+(1-y_true)*np.log(1-predicted_new))

For more functions refer ‘Common Loss functions in machine learning’ By Ravindra Parmar

Gradient Descent

Gradient Descent

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Single Feature
import numpy as np
import matplotlib.pyplot as plt
 
%matplotlib inline
 
def gradient_descent(x, y, epochs = 10000, loss_thresold = 0.5, rate = 0.01):
    w1 = bias = 0
    n = len(x)
    plt.scatter(x, y, color='red', marker='+', linewidth='5')
    for i in range(epochs):
        y_predicted = (w1 * x)+ bias
        plt.plot(x, y_predicted, color='green')
        md = -(2/n)*sum(x*(y-y_predicted))
        yd = -(2/n)*sum(y-y_predicted)
        w1 = w1 - rate * md
        bias = bias - rate * yd
        print ("m {}, b {}, cost {} iteration {}".format(m_curr,b_curr,cost, i))
 
# Usage
x = np.array([1,2,3,4,5])
y = np.array([5,7,9,11,13])
gradient_descent(x, y, 500)
 
---
 
# Multiple Feature
def gradient_descent(x1, x2, y, epochs = 10000, loss_thresold = 0.5, rate = 0.01):
    w1 = w2 = bias = 1
    rate = 0.5
    n = len(x1)
    for i in range(epochs):
        weighted_sum = (w1 * x1) + (w2 * x2) + bias
        y_predicted = sigmoid_numpy(weighted_sum)
        loss = log_loss(y_predicted, y)
 
        w1d = (1/n)*np.dot(np.transpose(x1),(y_predicted-y))
        w2d = (1/n)*np.dot(np.transpose(x2),(y_predicted-y))
 
        bias_d = np.mean(y_predicted-y)
        w1 = w1 - rate * w1d
        w2 = w2 - rate * w2d
        bias = bias - rate * bias_d
 
        print (f'Epoch:{i}, w1:{w1}, w2:{w2}, bias:{bias}, loss:{loss}')
 
        if loss<=loss_thresold:
            break
 
    return w1, w2, bias
 
# Usage
gradient_descent(
    X_train_scaled['age'],
    X_train_scaled['affordibility'],
    y_train,
    1000,
    0.4631
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# custom neural network class
 
class myNN:
 
    def __init__(self):
        self.w1 = 1
        self.w2 = 1
        self.bias = 0
     
    def sigmoid_numpy(self, X):
        import numpy as np;
        return 1/(1+np.exp(-X))
 
    def log_loss(self, y_true, y_predicted):
        import numpy as np;
        epsilon = 1e-15
        y_predicted_new = [max(i,epsilon) for i in y_predicted]
        y_predicted_new = [min(i,1-epsilon) for i in y_predicted_new]
        y_predicted_new = np.array(y_predicted_new)
        return -np.mean(y_true*np.log(y_predicted_new)+(1-y_true)*np.log(1-y_predicted_new))
     
    def fit(self, X, y, epochs, loss_thresold):
        self.w1, self.w2, self.bias = self.gradient_descent(X['age'],X['affordibility'],y, epochs, loss_thresold)
        print(f"Final weights and bias: w1: {self.w1}, w2: {self.w2}, bias: {self.bias}")
     
    def predict(self, X_test):
        weighted_sum = self.w1*X_test['age'] + self.w2*X_test['affordibility'] + self.bias
        return self.sigmoid_numpy(weighted_sum)
 
    def gradient_descent(self, age,affordability, y_true, epochs, loss_thresold):
        import numpy as np;
        w1 = w2 = 1
        bias = 0
        rate = 0.5
        n = len(age)
        for i in range(epochs):
            weighted_sum = w1 * age + w2 * affordability + bias
            y_predicted = self.sigmoid_numpy(weighted_sum)
            loss = self.log_loss.log_loss(y_true, y_predicted)
             
            w1d = (1/n)*np.dot(np.transpose(age),(y_predicted-y_true))
            w2d = (1/n)*np.dot(np.transpose(affordability),(y_predicted-y_true))
 
            bias_d = np.mean(y_predicted-y_true)
            w1 = w1 - rate * w1d
            w2 = w2 - rate * w2d
            bias = bias - rate * bias_d
             
            if i%50==0:
                print (f'Epoch:{i}, w1:{w1}, w2:{w2}, bias:{bias}, loss:{loss}')
             
            if loss<=loss_thresold:
                print (f'Epoch:{i}, w1:{w1}, w2:{w2}, bias:{bias}, loss:{loss}')
                break
 
        return w1, w2, bias
   
 
# Usage
customModel = myNN()
customModel.fit(X_train_scaled, y_train, epochs=8000, loss_thresold=0.4631)
 
# Usage
customModel = myNN()
customModel.fit(
        X_train_scaled,
        y_train,
        epochs=8000,
        loss_thresold=0.4631
)

Batch Gradient Descent

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def batch_gradient_descent(X, y_true, epochs, learning_rate = 0.01):
 
    number_of_features = X.shape[1]
    # numpy array with 1 row and columns equal to number of features. In
    # our case number_of_features = 2 (area, bedroom)
    w = np.ones(shape=(number_of_features))
    b = 0
    total_samples = X.shape[0] # number of rows in X
     
    cost_list = []
    epoch_list = []
     
    for i in range(epochs):       
        y_predicted = np.dot(w, X.T) + b
 
        w_grad = -(2/total_samples)*(X.T.dot(y_true-y_predicted))
        b_grad = -(2/total_samples)*np.sum(y_true-y_predicted)
         
        w = w - learning_rate * w_grad
        b = b - learning_rate * b_grad
         
        cost = np.mean(np.square(y_true-y_predicted)) # MSE (Mean Squared Error)
         
        if i%10==0:
            cost_list.append(cost)
            epoch_list.append(i)
         
    return w, b, cost, cost_list, epoch_list
 
w, b, cost, cost_list, epoch_list = batch_gradient_descent(
    scaled_X,
    scaled_y.reshape(scaled_y.shape[0],),
    500
)
w, b, cost

Mini Batch Gradient Descent

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def mini_batch_gradient_descent(X, y_true, epochs = 100, batch_size = 5, learning_rate = 0.01):
     
    number_of_features = X.shape[1]
    # numpy array with 1 row and columns equal to number of features. In
    # our case number_of_features = 3 (area, bedroom and age)
    w = np.ones(shape=(number_of_features))
    b = 0
    total_samples = X.shape[0] # number of rows in X
     
    if batch_size > total_samples: # In this case mini batch becomes same as batch gradient descent
        batch_size = total_samples
         
    cost_list = []
    epoch_list = []
     
    num_batches = int(total_samples/batch_size)
     
    for i in range(epochs):   
        random_indices = np.random.permutation(total_samples)
        X_tmp = X[random_indices]
        y_tmp = y_true[random_indices]
         
        for j in range(0,total_samples,batch_size):
            Xj = X_tmp[j:j+batch_size]
            yj = y_tmp[j:j+batch_size]
            y_predicted = np.dot(w, Xj.T) + b
             
            w_grad = -(2/len(Xj))*(Xj.T.dot(yj-y_predicted))
            b_grad = -(2/len(Xj))*np.sum(yj-y_predicted)
             
            w = w - learning_rate * w_grad
            b = b - learning_rate * b_grad
                 
            cost = np.mean(np.square(yj-y_predicted)) # MSE (Mean Squared Error)
         
        if i%10==0:
            cost_list.append(cost)
            epoch_list.append(i)
         
    return w, b, cost, cost_list, epoch_list
 
w, b, cost, cost_list, epoch_list = mini_batch_gradient_descent(
    scaled_X,
    scaled_y.reshape(scaled_y.shape[0],),
    epochs = 120,
    batch_size = 5
)
w, b, cost

Stochastic Gradient Descent

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def stochastic_gradient_descent(X, y_true, epochs, learning_rate = 0.01):
  
    number_of_features = X.shape[1]
    # numpy array with 1 row and columns equal to number of features. In
    # our case number_of_features = 3 (area, bedroom and age)
    w = np.ones(shape=(number_of_features))
    b = 0
    total_samples = X.shape[0]
     
    cost_list = []
    epoch_list = []
     
    for i in range(epochs):   
        random_index = random.randint(0,total_samples-1) # random index from total samples
        sample_x = X[random_index]
        sample_y = y_true[random_index]
         
        y_predicted = np.dot(w, sample_x.T) + b
     
        w_grad = -(2/total_samples)*(sample_x.T.dot(sample_y-y_predicted))
        b_grad = -(2/total_samples)*(sample_y-y_predicted)
         
        w = w - learning_rate * w_grad
        b = b - learning_rate * b_grad
         
        cost = np.square(sample_y-y_predicted)
         
        if i%100==0: # at every 100th iteration record the cost and epoch value
            cost_list.append(cost)
            epoch_list.append(i)
         
    return w, b, cost, cost_list, epoch_list
 
w_sgd, b_sgd, cost_sgd, cost_list_sgd, epoch_list_sgd = SGD(
    scaled_X,
    scaled_y.reshape(scaled_y.shape[0],),
    10000
)
w_sgd, b_sgd, cost_sgd