Restoring SQL Server BACPAC Files on Mac M3

Overview

This guide covers how to restore a .bacpac database backup file to SQL Server running in Docker on Mac M3 (Apple Silicon) using SqlPackage.

Prerequisites

  • SQL Server running in Docker
  • .bacpac backup file
  • Terminal access
  • Docker container running and accessible

Installation Steps

1. Download SqlPackage for macOS

SqlPackage is Microsoft’s command-line utility for importing and exporting SQL Server databases.

# Navigate to Downloads folder
cd ~/Downloads

# Download SqlPackage for macOS (universal binary - works on ARM64/M3)
curl -L -o sqlpackage.zip https://aka.ms/sqlpackage-macos

# Extract the package
unzip sqlpackage.zip -d sqlpackage

# Make the binary executable
chmod +x sqlpackage/sqlpackage

Restoring a BACPAC File

Basic Restore Command

./sqlpackage/sqlpackage /Action:Import \
  /SourceFile:"your-database.bacpac" \
  /TargetServerName:"localhost,PORT" \
  /TargetDatabaseName:"YourDatabaseName" \
  /TargetUser:"sa" \
  /TargetPassword:'YourPassword' \
  /TargetTrustServerCertificate:True

Example Usage

./sqlpackage/sqlpackage /Action:Import \
  /SourceFile:"SampleDatabase.bacpac" \
  /TargetServerName:"localhost,5433" \
  /TargetDatabaseName:"ESOLAut0mater12" \
  /TargetUser:"sa" \
  /TargetPassword:'SecureP@ssword' \
  /TargetTrustServerCertificate:True

Example with Extended Timeout (for large databases)

./sqlpackage/sqlpackage /Action:Import \
  /SourceFile:"SampleDatabase.bacpac" \
  /TargetServerName:"localhost,5433" \
  /TargetDatabaseName:"ESOLAut0mater12" \
  /TargetUser:"sa" \
  /TargetPassword:'SecureP@ssword' \
  /TargetTrustServerCertificate:True \
  /TargetTimeout:600

Important Parameters Explained

Parameter Description
/Action:Import Specifies that we’re importing a BACPAC file
/SourceFile Path to your .bacpac file
/TargetServerName Server address with port (format: hostname,port)
/TargetDatabaseName Name for the restored database
/TargetUser SQL Server username (typically sa)
/TargetPassword SQL Server password
/TargetTrustServerCertificate:True Required for Docker SQL Server – trusts self-signed certificates
/TargetTimeout:600 Optional – Connection timeout in seconds (use for large databases)

Common Issues and Solutions

1. SSL Certificate Error

Error: The remote certificate was rejected by the provided RemoteCertificateValidationCallback

Solution: Add /TargetTrustServerCertificate:True parameter (already included in the example above)

2. Password with Special Characters

If your password contains special characters like !, @, #, etc., wrap it in single quotes:

/TargetPassword:'SecureP@ssword'

3. Connection Format

  • ✅ Correct: localhost,5433 (comma separator)
  • ❌ Incorrect: localhost;5433 (semicolon separator)
  • ❌ Incorrect: localhost:5433 (colon separator)

4. Port Numbers

Common SQL Server ports:

  • Default: 1433
  • Custom Docker: 5433 (or whatever you mapped in your Docker run command)

Find your Docker port mapping:

docker ps
# Look for something like: 0.0.0.0:5433->1433/tcp

5. Transport-Level Errors with Large Tables

Error: A transport-level error has occurred when receiving results from the server

Symptoms:

  • Import fails consistently at the same tables (often audit or log tables)
  • Error message: Transport-level error or Invalid argument
  • Import works up to a certain point, then crashes

Root Cause: Some tables (like AbpAuditLogs, AbpEntityChanges, AbpEntityPropertyChanges) can contain massive amounts of data that cause timeout or memory issues during bulk import on Mac M3 with Docker SQL Server.

Solution: Manually Remove Large Table Data from BACPAC

Since BACPAC is essentially a ZIP file, you can extract it, remove problematic table data, and re-package it.

Step-by-Step Process
# Navigate to your BACPAC location
cd ~/Downloads

# Remove previous folders if they exist
rm -rf ESOLAut0mater_Modified.bacpac
rm -rf bacpac_extracted

# Unzip the BACPAC
unzip YourDatabase.bacpac -d bacpac_extracted

# Navigate to the Data folder
cd bacpac_extracted/Data

# Remove the large table data files (BCP files contain the actual data)
# Common problematic tables (adjust based on your error messages):
rm -rf dbo.AbpAuditLogs/*.BCP
rm -rf dbo.AbpEntityChanges/*.BCP
rm -rf dbo.AbpEntityPropertyChanges/*.BCP

# Add more tables if needed:
# rm -rf dbo.YourLargeTable/*.BCP

# Return to parent directory and re-zip
cd ..
zip -r ../YourDatabase_Modified.bacpac *

# Go back to Downloads folder
cd ..

# Import the modified BACPAC
./sqlpackage/sqlpackage /Action:Import \
  /SourceFile:"YourDatabase_Modified.bacpac" \
  /TargetServerName:"localhost,5433" \
  /TargetDatabaseName:"YourDatabaseName" \
  /TargetUser:"sa" \
  /TargetPassword:'YourPassword' \
  /TargetTrustServerCertificate:True \
  /TargetTimeout:600
Complete Example
cd ~/Downloads

# Clean up previous attempts
rm -rf ESOLAut0mater_Modified.bacpac
rm -rf bacpac_extracted

# Unzip the BACPAC
unzip SampleDatabase.bacpac -d bacpac_extracted

# Remove problematic table data
cd bacpac_extracted/Data
rm -rf dbo.AbpAuditLogs/*.BCP
rm -rf dbo.AbpEntityChanges/*.BCP
rm -rf dbo.AbpEntityPropertyChanges/*.BCP

# Re-zip it
cd ..
zip -r ../ESOLAut0mater_Modified.bacpac *

# Import the modified BACPAC
cd ..
./sqlpackage/sqlpackage /Action:Import \
  /SourceFile:"ESOLAut0mater_Modified.bacpac" \
  /TargetServerName:"localhost,5433" \
  /TargetDatabaseName:"ESOLAut0mater12" \
  /TargetUser:"sa" \
  /TargetPassword:'SecureP@ssword' \
  /TargetTrustServerCertificate:True \
  /TargetTimeout:600
How to Identify Problematic Tables
  1. Look at the import output – it shows which table was processing when it failed:

    Processing Table '[dbo].[AbpAuditLogs]'.
    *** A transport-level error has occurred...
    
  2. Check the Data folder structure in the extracted BACPAC:

    cd bacpac_extracted/Data
    du -sh */ | sort -h
    # This shows folder sizes - large folders are likely culprits
    
Important Notes
  • Schema Preservation: This method only removes table DATA, not the table structure. The tables will exist but will be empty.
  • Foreign Keys: If other tables reference the data you’re removing, you may need to handle those dependencies.
  • Audit Tables: Tables like AbpAuditLogs, AbpEntityChanges, AbpEntityPropertyChanges are typically audit/log tables and are safe to empty for development environments.
  • Production Warning: Do not use this method for production imports where you need complete data integrity.
  • Alternative: If you need the data from these tables, consider using Azure Data Studio which sometimes handles large imports better, or increase Docker memory allocation significantly (8GB+).
Verify After Import

After importing the modified BACPAC, verify your data:

USE YourDatabaseName;
GO

-- Check table count
SELECT COUNT(*) AS TotalTables FROM sys.tables;

-- Check row counts for all tables
SELECT 
    SCHEMA_NAME(t.schema_id) + '.' + t.name AS TableName,
    SUM(p.rows) AS RowCount
FROM sys.tables t
INNER JOIN sys.partitions p ON t.object_id = p.object_id
WHERE p.index_id IN (0, 1)
GROUP BY t.schema_id, t.name
ORDER BY RowCount DESC;

-- Verify important business tables have data
SELECT COUNT(*) FROM [YourSchema].[YourImportantTable];

Optional: Install Globally

To use SqlPackage from any directory:

# Move to permanent location
sudo mkdir -p /usr/local/sqlpackage
sudo cp -r sqlpackage/* /usr/local/sqlpackage/

# Create symbolic link
sudo ln -s /usr/local/sqlpackage/sqlpackage /usr/local/bin/sqlpackage

# Now you can run from anywhere
sqlpackage /Action:Import \
  /SourceFile:"~/path/to/database.bacpac" \
  /TargetServerName:"localhost,5433" \
  /TargetDatabaseName:"DatabaseName" \
  /TargetUser:"sa" \
  /TargetPassword:'YourPassword' \
  /TargetTrustServerCertificate:True

Verifying the Restore

After successful import, verify the database using VS Code with mssql extension or any SQL client:

-- List all databases
SELECT name FROM sys.databases;

-- Check table count in restored database
USE YourDatabaseName;
SELECT COUNT(*) AS TableCount 
FROM INFORMATION_SCHEMA.TABLES 
WHERE TABLE_TYPE = 'BASE TABLE';

Alternative Tools

If SqlPackage doesn’t work for your use case:

  1. Azure Data Studio (Recommended GUI option)

    brew install --cask azure-data-studio
    
    • Right-click “Databases” → “Import Data-tier Application”
    • Browse to your .bacpac file
  2. Direct Docker approach (if SqlPackage is available in container)

    docker cp database.bacpac container_name:/tmp/
    docker exec -it container_name /path/to/sqlpackage /Action:Import ...
    

Additional SqlPackage Actions

SqlPackage supports other database operations:

# Export database to BACPAC
./sqlpackage/sqlpackage /Action:Export \
  /SourceServerName:"localhost,5433" \
  /SourceDatabaseName:"DatabaseName" \
  /SourceUser:"sa" \
  /SourcePassword:'Password' \
  /TargetFile:"output.bacpac" \
  /SourceTrustServerCertificate:True

# Extract to DACPAC (schema only)
./sqlpackage/sqlpackage /Action:Extract \
  /SourceServerName:"localhost,5433" \
  /SourceDatabaseName:"DatabaseName" \
  /SourceUser:"sa" \
  /SourcePassword:'Password' \
  /TargetFile:"schema.dacpac" \
  /SourceTrustServerCertificate:True

Resources

Notes

  • BACPAC files contain both schema and data
  • DACPAC files contain only schema (no data)
  • SqlPackage on Mac M3 runs natively on ARM64 architecture
  • Self-signed certificates in Docker SQL Server require the TrustServerCertificate parameter
  • For production environments, consider using proper SSL certificates
  • Large Tables: If import fails with transport-level errors on specific tables (especially audit/log tables), you can manually remove their data from the BACPAC by unzipping it, deleting the problematic .BCP files, and re-zipping (see section 5 in Common Issues)
  • Docker Resources: For large database imports, ensure Docker has sufficient resources allocated (4GB+ RAM, 2+ CPU cores)
  • Development vs Production: The manual BACPAC modification method is suitable for development environments where audit history is not critical

🚀 Installing Odoo 16/17/18 on a Free Cloud Server (AWS Lightsail, DigitalOcean, etc.)

🔹 Scenario

If you need to install Odoo 16, 17, or 18 on a free cloud server like AWS Lightsail, DigitalOcean Droplets, or similar, this guide will help you set up an Odoo instance at zero cost. This setup is perfect for testing functionalities, running demos, or short-term development.

🛠 Supported Versions

  • Odoo Versions: 16, 17, 18, 19 (tested)
  • Ubuntu Version: 24.04 LTS

✅ Step-by-Step Installation Guide

1️⃣ Create a Free Ubuntu 24.04 Server

  • Sign up for AWS Lightsail and create a 90-day free Ubuntu 24.04 instance.
  • Choose a basic server configuration (e.g., 1GB RAM, 1vCPU, 20GB SSD).

2️⃣ Apply the Launch Script

During the instance creation process, paste the following launch script in the “Launch Script” section:

https://github.com/princeppy/odoo-install-scripts/blob/main/lightsail.aws/launch_script.sh

This script automates the initial setup, including system updates, package installations, and preparing the Odoo environment.

3️⃣ Access the Server via Browser-Based SSH

Once your instance is up and running:

  • Open AWS Lightsail and select your instance.
  • Click “Connect using SSH” to access the terminal.

4️⃣ Monitor Installation Progress

Run the following command to track installation logs in real time:

tail -f /tmp/launchscript.log

• Wait until you see:

Preinstallation Completed........

This indicates that the server setup is complete.

5️⃣ Elevate to Root User

Once the installation completes, switch to the root user to run administrative commands:

sudo su

6️⃣ Run the Odoo Installation Script

Now, execute the Odoo installation script:

bash /InstallScript/install_odoo.sh

• The script will download, install, and configure Odoo on your server. • Once completed, look for the confirmation message:

Done

• Your Odoo instance is now ready to use! 🎉

📌 References & Additional Resources

For further reading and alternative installation scripts, check out these resources: • Odoo Install Script by Yenthe666 • Odoo Install Script by Moaaz • Odoo Install Script by Ventor Tech

🚀 Conclusion

By following this guide, you can quickly deploy Odoo 16/17/18/19 on a free Ubuntu 24.04 server using AWS Lightsail or similar platforms. This setup allows you to test Odoo functionalities, run demos, or perform short-term development—all without any cost.

💡 Got questions or need help? Drop a comment below! 🚀

Azure Data Lake Storage Gen1 vs. Gen2

Feature Data Lake Storage Gen1 Data Lake Storage Gen2
Architecture Standalone hierarchical file system Built on Azure Blob Storage with Hierarchical Namespace (HNS)
Performance Slower due to standalone architecture Optimized performance with tiered storage & caching
Security ACLs (Access Control Lists) & RBAC RBAC, ACLs, Azure AD (more granular access control)
Cost Efficiency Higher cost, no tiered storage Lower cost with hot, cool, and archive tiers
Integration Limited compatibility with Azure services Fully compatible with Blob APIs, Synapse, Databricks, Spark
Scalability Limited to single-region storage Globally distributed, supports Geo-redundancy (GRS)
Protocol Support Proprietary protocol, limited interoperability Supports HDFS, Blob APIs, better integration with analytics tools
Availability Regional storage only Supports multi-region & geo-redundant storage
Migration No easy migration to Blob storage Can integrate with Azure Blob Storage, simplifying migration
Support Status Deprecated (support ends Feb 29, 2024) Actively developed & recommended for new workloads

PySpark Vs Pandas

Pandas Transformations

  • df.count() – Returns the count of each column (the count includes only non-null values).
  • df.corr() – Returns the correlation between columns in a data frame.
  • df.head(n) – Returns first n rows from the top.
  • df.max() – Returns the maximum of each column.
  • df.mean() – Returns the mean of each column.
  • df.median() – Returns the median of each column.
  • df.min() – Returns the minimum value in each column.
  • df.std() – Returns the standard deviation of each column
  • df.tail(n) – Returns last n rows.

PySpark Transformations

  • df.select() – Choose specific columns from a DataFrame.
  • df.filter() – Filter rows based on a condition.
  • df.groupBy() – Group rows based on one or more columns.
  • df.agg() – Perform aggregate functions (e.g., sum, average) on grouped data.
  • df.orderBy() – Sort rows based on one or more columns.
  • df.dropDuplicates() – Remove duplicate rows from the DataFrame.
  • df.withColumn() – Add a new column or replace an existing column with modified data.
  • df.drop() – Remove one or more columns from the DataFrame.
  • df.join() – Merge two DataFrames based on a common column or index.
  • df.pivot() – Pivot the DataFrame to reorganize data based on column values.

How to Use Let’s Encrypt on Windows Server with IIS

This guide explains how to enable a FREE SSL certificate using Let’s Encrypt on a Windows Server running IIS. Specifically, it addresses the challenges of using wildcard certificates for multiple websites with different domains and subdomains.

Scenario:

You have a Windows Server 2019 with IIS 10, a single IP address, and multiple HTTPS websites hosted with different domain names. For subdomains under the same primary domain (like *.example.com), a wildcard certificate works perfectly. However, complications arise when adding websites from different domains, such as mydomain.com.

Initial Setup Example:

IIS 10 hosts the following sites:

  • ABC Server (Website)
    • abc.api.example.com – HTTPS @ 443
  • ABC Client (Website)
    • abc.example.com – HTTPS @ 443
    • bcd.example.com – HTTPS @ 443
    • cde.example.com – HTTPS @ 443
    • admin.example.com – HTTPS @ 443
  • XYZ App (Website)
    • xyz.example.com – HTTPS @ 443
  • SEQ (Website)
    • seq.mydomain.com – HTTPS @ 443

Managing these SSL certificates with multiple domain names can be tricky, but Let’s Encrypt simplifies the process.

Steps to Enable Let’s Encrypt SSL on IIS

  • Enable IIS and Create the .well-known Folder
    • Follow this guide to create the .well-known directory for SSL validation
      • Create a folder on the C drive named well-known. Inside, create another folder called pki-validation. Example: C:\well-known\pki-validation.
      • Place the required validation file in the pki-validation folder.
      • Open IIS Manager and for each site, right-click and select Add Virtual Directory.
      • In the Alias field, enter .well-known. In the Physical Path field, enter the path to the folder you created, e.g., C:\well-known\pki-validation.
      • Confirm with OK. The folder and files should now be accessible via the web.
  • Set Proper Permissions for the C:\well-known\pki-validation Folder
    • Follow this IIS 403 Forbidden solution:
      • Right-click the .well-known folder and select Properties.
      • Navigate to the Security tab.
      • Click Edit and ensure IIS_IUSRS is listed. If not, click Add
      • In the Enter the object names box, type IIS_IUSRS and click OK.
      • Set Read & execute, List folder contents, and Read permissions for IIS_IUSRS.
  • Validate DNS Entries for Each Domain/Subdomain
    • Use a tool like Google Dig to validate DNS entries for the following domains:
      • abc.api.example.com
      • abc.example.com
      • bcd.example.com
      • cde.example.com
      • admim.example.com
      • xyz.example.com
      • seq.mydomain.com
  • Download and Install win-acme
    • Download win-acme from https://www.win-acme.com.
    • After downloading, unblock the files and extract them to C:\win-acme.
  • Run win-acme to Generate SSL Certificates
    • Navigate to C:\win-acme and run win-acme.exe as Administrator.
    • Follow the prompts to select the appropriate site for which you want to generate the SSL certificate.
    • Once complete, your sites will be secured with Let’s Encrypt SSL certificates.

By following these steps, you can manage multiple websites with different domains and subdomains on a single IIS server with Let’s Encrypt SSL certificates, solving the issues typically associated with wildcard certificates for different domains.

Other Resources

Connecting and Downloading Kaggle Dataset from colab

Register https://www.kaggle.com and generate API token via https://www.kaggle.com/settings

# Run this cell and select the kaggle.json file downloaded
# from the Kaggle account settings page.

from google.colab import files
files.upload()

# This will prompt the file upload control, so that we can uppload the file to the temporark work space.
# Next, install the Kaggle API client.
!pip install -q kaggle

# The Kaggle API client expects this file to be in ~/.kaggle, so move it there.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# This permissions change avoids a warning on Kaggle tool startup.
!chmod 600 ~/.kaggle/kaggle.json

# Searching for dataset
!kaggle datasets list -s dogbreedidfromcomp

# Downloading dataset in the current directory
!kaggle datasets download catherinehorng/dogbreedidfromcomp

# Unzipping downloaded file and removing unusable file
!unzip dog_dataset/dogbreedidfromcomp.zip -d dog_dataset

SelfSigned Certificate for WebApplications (PowerSchool Test Server)

When we are creating a PowerSchool in OnPrimise, One of the difficult part is to generate a proper self-signed certificate. This blog is for all the PowerSchool administrators out there who have to Build the PowerSchool Test server or they want to Dump the Latest database backup to the test server.

OpenSSL

commonly we use OpenSSL to generate the certificate, by default OpenSSL is not available in Windows Server. Below are the steps (Credits to: https://tecadmin.net/install-openssl-on-windows)

Step 1 – Download OpenSSL Binary

You need to download the latest OpenSSL Windows installer file. Click the below link to visit the OpenSSL download page https://slproweb.com/products/Win32OpenSSL.html

Step 2 – Run OpenSSL Installer

Now run the OpenSSL installer on your system. The OpenSSL required Microsoft Visual C++ to be installed on your system. If your system doesn’t have Microsoft Visual C++ installed, the installer will show your message like:

Click Yes to download and install required Microsoft Visual C++ package on your system.

Then again run the OpenSSL installer and follow the wizard.

Make sure you change the folder to “C:\OpenSSL-Win64” for easy handling in later stage

Step 3 – Setup Environment Variables

Now set the environment variables to function OpenSSL properly on your system. You are required to set OPENSSL_CONF and Path environment variables.

Use the following commands to set the environment for the permanently (use PowerShell):

[System.Environment]::SetEnvironmentVariable('OPENSSL_CONF','C:\OpenSSL-Win64\bin\openssl.cfg', 'Machine')
$Path = [Environment]::GetEnvironmentVariable('PATH', 'Machine') + [IO.Path]::PathSeparator + 'C:\OpenSSL-Win64\bin'
[System.Environment]::SetEnvironmentVariable('Path',$Path, 'Machine')

Step 4 – Run OpenSSL Binary

Open cmd / powershell and test ‘openssl’

Generating Self-Signed certificate via OpenSSL

Please change the below names according to your situations
(credits to https://stackoverflow.com/questions/10175812/how-to-generate-a-self-signed-ssl-certificate-using-openssl)

  • powerschool.local.school => to your test server FQDN
  • -subj “/C=BH/ST=Riffa/……CN=powerschool.local.school” => to your version
  • subjectAltName => alter DNS and IP to match your server settings
# generate selfsigned certificate and key for 10 years with specific Subject and Additional subjectAltName

openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 \
  -nodes -keyout powerschool.local.school.key -out powerschool.local.school.crt -subj "/C=BH/ST=Riffa/L=Riffa/O=School/OU=PowerSchool/emailAddress=prince@xxxxxxx.com/CN=powerschool.local.school" \
  -addext "subjectAltName=DNS:powerschool.local.school,DNS:powerschool-altername.local.school,IP:172.10.1.151,IP:172.10.1.152"

the above code process will generate the key file in the modern format (PKCS#8), which PowerSchool doesn’t like. We need to manually convert the key file to traditional format (PKCS#1)

# traditional format (pkcs1)
$ cat pkcs1.pem
-----BEGIN RSA PRIVATE KEY-----
...
-----END RSA PRIVATE KEY-----
# modern format (pkcs8)
$ cat pkcs1.pem
-----BEGIN PRIVATE KEY-----
...
-----END PRIVATE KEY-----
# Convert from PKCS#8 to PKCS#1:
openssl pkey -in powerschool.local.school.key -traditional -out powerschool.local.school.rsa.key

The above code will generate a new key file (‘powerschool.local.school.rsa.key’) from your available modern key file (‘powerschool.local.school.key’)

Additional usefull commands

# Convert from PKCS#1 to PKCS#8:
openssl pkey -in pkcs1.pem -out pkcs8.pem

# Convert from PKCS#8 to PKCS#1:
openssl pkey -in pkcs8.pem -traditional -out pkcs1.pem

# RSA private key - To convert from PKCS#1 to PKCS#8:
openssl pkcs8 -topk8 -inform pem -in private_pkcs1.pem -outform pem -nocrypt \
 -out private_pkcs8.pem

# RSA private key - To convert from PKCS#8 to PKCS#1:
openssl rsa -in private_pkcs8.pem -out private_pkcs1.pem

# RSA public key - To convert from PKCS#8 to PKCS#1:
openssl rsa -pubin -in public_pkcs8.pem -RSAPublicKey_out -out public_pkcs1.pem

# RSA public key - To convert from PKCS#1 to PKCS#8:
openssl rsa -RSAPublicKey_in -in public_pkcs1.pem -pubout -out public_pkcs8.pem

# Create SSL identity file in PKCS12
openssl pkcs12 -export -in example.com.crt -inkey example.com.key -out example.com.pfx

# Convert P12 into PEM
openssl pkcs12 -in powerschool.local.school.pfx -nocerts -nodes -out powerschool.local.school.key
openssl pkcs12 -in powerschool.local.school.pfx -clcerts -nokeys -out powerschool.local.school.crt
openssl pkey -in powerschool.local.school.key -traditional -out powerschool.local.school.rsa.key

Additional Readings

Data Scientist With Microsoft

https://learn.microsoft.com/en-us/users/princeparkyohannanhotmail-8262/transcript/dlmplcnz8w96op1

ASSOCIATE CERTIFICATION: Microsoft Certified: Azure Data Scientist Associate

CERTIFICATION EXAM: Designing and Implementing a Data Science Solution on Azure (Exam DP-100)

Data Scientist Career Path

COURSES

DP-090T00: Implementing a Machine Learning Solution with Microsoft Azure Databricks – Training

Azure Databricks is a cloud-scale platform for data analytics and machine learning. In this course, you’ll learn how to use Azure Databricks to explore, prepare, and model data; and integrate Databricks machine learning processes with Azure Machine Learning.

DP-100T01: Designing and Implementing a Data Science Solution on Azure

This course teaches you to leverage your existing knowledge of Python and machine learning to manage data ingestion and preparation, model training and deployment, and machine learning solution monitoring with Azure Machine Learning and MLflow.

My Learnings.

# Calculate the number of empty cells in each column
# The following line consists of three commands. Try
# to think about how they work together to calculate
# the number of missing entries per column
missing_data = dataset.isnull().sum().to_frame()

# Rename column holding the sums
missing_data = missing_data.rename(columns={0:'Empty Cells'})

# Print the results
print(missing_data)

## OR 
print(dataset.isnull().sum().to_frame().rename(columns={0:'Empty Cells'}))

# Show the missing value rows
dataset[dataset.isnull().any(axis=1)]

EDA

import pandas as pd

# Load data from a text file
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/grades.csv
df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')

# Remove any rows with missing data
df_students = df_students.dropna(axis=0, how='any')

# Calculate who passed, assuming '60' is the grade needed to pass
passes  = pd.Series(df_students['Grade'] >= 60)

# Save who passed to the Pandas dataframe
df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)

# Create a figure for 2 subplots (1 row, 2 columns)
fig, ax = plt.subplots(1, 2, figsize = (10,4))

# Create a bar plot of name vs grade on the first axis
ax[0].bar(x=df_students.Name, height=df_students.Grade, color='orange')
ax[0].set_title('Grades')
ax[0].set_xticklabels(df_students.Name, rotation=90)

# Create a pie chart of pass counts on the second axis
pass_counts = df_students['Pass'].value_counts()
ax[1].pie(pass_counts, labels=pass_counts)
ax[1].set_title('Passing Grades')
ax[1].legend(pass_counts.keys().tolist())

# Add a title to the Figure
fig.suptitle('Student Data')

# Show the figure
fig.show()

# Create a function that we can re-use
# Create a function that we can re-use
def show_distribution_with_quantile(var_data, quantile = 0):
    '''
    This function will make a distribution (graph) and display it
    '''

    if(quantile > 0){
        # calculate the quantile percentile
        q01 = var_data.quantile(quantile) 
        print(f"quantile = {q01}")

        var_data = var_data[var_data>q01]
    }

    # Get statistics
    min_val = var_data.min()
    max_val = var_data.max()
    mean_val = var_data.mean()
    med_val = var_data.median()
    mod_val = var_data.mode()[0]

    print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\n'.format(min_val,
                                                                                            mean_val,
                                                                                            med_val,
                                                                                            mod_val,
                                                                                            max_val))

    # Create a figure for 2 subplots (2 rows, 1 column)
    fig, ax = plt.subplots(2, 1, figsize = (10,4))

    # Plot the histogram   
    ax[0].hist(var_data)
    ax[0].set_ylabel('Frequency')

    # Add lines for the mean, median, and mode
    ax[0].axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)
    ax[0].axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)

    # Plot the boxplot   
    ax[1].boxplot(var_data, vert=False)
    ax[1].set_xlabel('Value')

    # Add a title to the Figure
    fig.suptitle('Data Distribution')

    # Show the figure
    fig.show()

# Get the variable to examine
col = df_students['Grade']
# Call the function
show_distribution(col)
def show_density(var_data):
    fig = plt.figure(figsize=(10,4))

    # Plot density
    var_data.plot.density()

    # Add titles and labels
    plt.title('Data Density')

    # Show the mean, median, and mode
    plt.axvline(x=var_data.mean(), color = 'cyan', linestyle='dashed', linewidth = 2)
    plt.axvline(x=var_data.median(), color = 'red', linestyle='dashed', linewidth = 2)
    plt.axvline(x=var_data.mode()[0], color = 'yellow', linestyle='dashed', linewidth = 2)

    # Show the figure
    plt.show()

# Get the density of StudyHours
show_density(col)

Azure Databricks

Mount a remote Azure storage account as a DBFS folder, using the dbutils module:

data_storage_account_name = '<data_storage_account_name>'
data_storage_account_key = '<data_storage_account_key>'

data_mount_point = '/mnt/data'

data_file_path = '/bronze/wwi-factsale.csv'

dbutils.fs.mount(
  source = f"wasbs://dev@{data_storage_account_name}.blob.core.windows.net",
  mount_point = data_mount_point,
  extra_configs = {f"fs.azure.account.key.{data_storage_account_name}.blob.core.windows.net": data_storage_account_key})

display(dbutils.fs.ls("/mnt/data"))
#this path is available as dbfs:/mnt/data for spark APIs, e.g. spark.read
#this path is available as file:/dbfs/mnt/data for regular APIs, e.g. os.listdir

# %fs magic command - for accessing the dbutils filesystem module. Most dbutils.fs commands are available using %fs magic commands

We can override the cell’s default programming language by using one of the following magic commands at the start of the cell:

  • %python – for cells running python code
  • %scala – for cells running scala code
  • %r – for cells running R code
  • %sql – for cells running sql code

Additional magic commands are available:

  • %md – for descriptive cells using markdown
  • %sh – for cells running shell commands
  • %run – for cells running code defined in a separate notebook
  • %fs – for cells running code that uses dbutils commands

Update RDP Source IP For Azure Virtual Machine

Remove-Variable * -ErrorAction SilentlyContinue; Remove-Module *; $error.Clear();
$config = @{
	SubscriptionName = "";
	VMName           = ""
}
$myip = (Invoke-WebRequest -uri "http://ifconfig.me/ip").Content
Write-Host "My Public IP is", $myip;

if (-Not(Get-InstalledModule -Name "Az")) { 
	Write-Host "Installing Azure Package from PS Gallery...";
	Install-Module -Name Az -Scope CurrentUser -Repository PSGallery -Force; 
}

if (-Not(Get-Module -Name "Az") -and -Not(Get-Command -Name Get-AZContext)) {
	Write-Host "Importing Azure Module..."
	Import-Module -Name Az -Force; 
}

if (-Not (Get-AZContext)) {
	Write-Host "Connecting to Azure..."
	Connect-AzAccount #-Credential $azCredF
}

if ([bool]((Get-AZContext))) {	Write-Host "Connected to Azure using :", ((Get-AZContext).Account) }
else {	Write-Error "Failed to connect to Azure"; return; }

if (((Get-AZContext).Subscription.Name) -ne $config.SubscriptionName) {
	Write-Host "Switching Subscription Context...";
	(Get-AzSubscription -SubscriptionName $config.SubscriptionName) | 	Set-AzContext | Out-Null;
}
Write-Host "Subscription =", ((Get-AZContext).Subscription.Name);

$vm = Get-AzVM -VMName $config.VMName
Write-Host "Located Azure VM '$($vm.Name)' within '$($vm.ResourceGroupName)' ResourceGroup";

$nic = $vm.NetworkProfile.NetworkInterfaces;
$networkinterface = ($nic.id -split '/')[-1];
$nicdetails = Get-AzNetworkInterface -Name $networkinterface;

$nsg = Get-AzNetworkSecurityGroup -Name (($nicdetails.NetworkSecurityGroup.Id -split '/')[-1]) -ResourceGroupName ($vm.ResourceGroupName)

# $rules = Get-AzNetworkSecurityRuleConfig -NetworkSecurityGroup $networkSecurityGroup
# $rdpRule = $rules  | Where-Object { $_.Protocol -eq 'TCP' -and $_.DestinationPortRange -contains 3389 }

Write-Host "Before Updating public ip";
($nsg.SecurityRules | Where-Object { $_.Name -eq "RDP" }).SourceAddressPrefix

Write-Host ("*" * 10)
($nsg.SecurityRules | Where-Object { $_.Name -eq "RDP" }).SourceAddressPrefix = ([System.String[]] @("xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx", $myip))
$nsg | Set-AzNetworkSecurityGroup | Get-AzNetworkSecurityRuleConfig -Name "RDP" | Out-Null

Write-Host "After Updating public ip";
($nsg.SecurityRules | Where-Object { $_.Name -eq "RDP" }).SourceAddressPrefix

Disconnect-AzAccount | Out-Null