Azure SQL Data Warehouse

data : https://rbdevdwhstorage.blob.core.windows.net/rbdevdwhtestdata/FlightData.csv

In my 3rd example, I explored SQL Data warehouse on AZURE.

The idea was to take a half a million rows CSV and then process it using SQL Data Warehouse and see the performance.

I provisioned one azure storage where i saved my test data and then one azure sql data warehouse of 100 dTUs. (least in data warehouses).

Snapshot of data stored to Azure Storage –

w1

Azure SQL Data Warehouse. Copy the Server Name from here and open management studio.

w2.jpg

Using SQL Management studio, connect to this data warehouse like any MSSQL database.

w3.jpg

 

Once you are connected, open new query windows.

First step is to securely connect to the source data, here a csv with half a million rows.

CREATE MASTER KEY;

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCreds
WITH IDENTITY = ‘[identityName]’
, Secret = ‘[azureStorageAccountKey]’
;

CREATE EXTERNAL DATA SOURCE azure_storage
WITH
(
TYPE = HADOOP
, LOCATION =’wasbs://[containername]@[accountname].blob.core.windows.net/[path]’
, CREDENTIAL = AzureStorageCreds
);

next define the format of the file we will be reading

CREATE EXTERNAL FILE FORMAT text_file_format
WITH
(
FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS (
FIELD_TERMINATOR =’,’,
STRING_DELIMITER = ‘”‘,
USE_TYPE_DEFAULT = TRUE
)
);

next step is to make an external table, which is kind of a view to our source data

CREATE EXTERNAL TABLE FlightData
(
[Year] varchar(255),
[Quarter] varchar(255),
[Month] varchar(255),
[DayofMonth] varchar(255),
[DayOfWeek] varchar(255),
FlightDate varchar(255),
UniqueCarrier varchar(255),
AirlineID varchar(255),
Carrier varchar(255),
TailNum varchar(255),
FlightNum varchar(255),
OriginAirportID varchar(255),
OriginAirportSeqID varchar(255),
OriginCityMarketID varchar(255),
Origin varchar(255),
OriginCityName varchar(255),
OriginState varchar(255),
OriginStateFips varchar(255),
OriginStateName varchar(255),
OriginWac varchar(255),
DestAirportID varchar(255),
DestAirportSeqID varchar(255),
DestCityMarketID varchar(255),
Dest varchar(255),
DestCityName varchar(255),
DestState varchar(255),
DestStateFips varchar(255),
DestStateName varchar(255),
DestWac varchar(255),
CRSDepTime varchar(128),
DepTime varchar(128),
DepDelay varchar(128),
DepDelayMinutes varchar(128),
DepDel15 varchar(255),
DepartureDelayGroups varchar(255),
DepTimeBlk varchar(255),
TaxiOut varchar(255),
WheelsOff varchar(255),
WheelsOn varchar(255),
TaxiIn varchar(255),
CRSArrTime varchar(128),
ArrTime varchar(128),
ArrDelay varchar(255),
ArrDelayMinutes varchar(255),
ArrDel15 varchar(255),
ArrivalDelayGroups varchar(255),
ArrTimeBlk varchar(255),
Cancelled varchar(255),
CancellationCode varchar(255),
Diverted varchar(255),
CRSElapsedTime varchar(255),
ActualElapsedTime varchar(255),
AirTime varchar(255),
Flights varchar(255),
Distance varchar(255),
DistanceGroup varchar(255),
CarrierDelay varchar(255),
WeatherDelay varchar(255),
NASDelay varchar(255),
SecurityDelay varchar(255),
LateAircraftDelay varchar(255),
FirstDepTime varchar(255),
TotalAddGTime varchar(255),
LongestAddGTime varchar(255),
DivAirportLandings varchar(255),
DivReachedDest varchar(255),
DivActualElapsedTime varchar(255),
DivArrDelay varchar(255),
DivDistance varchar(255),
Div1Airport varchar(255),
Div1AirportID varchar(255),
Div1AirportSeqID varchar(255),
Div1WheelsOn varchar(255),
Div1TotalGTime varchar(255),
Div1LongestGTime varchar(255),
Div1WheelsOff varchar(255),
Div1TailNum varchar(255),
Div2Airport varchar(255),
Div2AirportID varchar(255),
Div2AirportSeqID varchar(255),
Div2WheelsOn varchar(255),
Div2TotalGTime varchar(255),
Div2LongestGTime varchar(255),
Div2WheelsOff varchar(255),
Div2TailNum varchar(255),
Div3Airport varchar(255),
Div3AirportID varchar(255),
Div3AirportSeqID varchar(255),
Div3WheelsOn varchar(255),
Div3TotalGTime varchar(255),
Div3LongestGTime varchar(255),
Div3WheelsOff varchar(255),
Div3TailNum varchar(255),
Div4Airport varchar(255),
Div4AirportID varchar(255),
Div4AirportSeqID varchar(255),
Div4WheelsOn varchar(255),
Div4TotalGTime varchar(255),
Div4LongestGTime varchar(255),
Div4WheelsOff varchar(255),
Div4TailNum varchar(255),
Div5Airport varchar(255),
Div5AirportID varchar(255),
Div5AirportSeqID varchar(255),
Div5WheelsOn varchar(255),
Div5TotalGTime varchar(255),
Div5LongestGTime varchar(255),
Div5WheelsOff varchar(255),
Div5TailNum varchar(255)
)
WITH
(
LOCATION = ‘/’,
DATA_SOURCE = azure_storage,
FILE_FORMAT = text_file_format,
REJECT_TYPE = value,
REJECT_VALUE = 100000
);

Once this done, we will run one final command to pull data to data warehouse

CREATE TABLE FlightDataStaging
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT * FROM FlightData

Once this is done, lets run a group by command on all the data and see how long it takes.

w4.jpg

you will notice it barely took 2 seconds.

If we want, we can also link our data warehouse to powerBI , and display results.

I have tried to run similar report as above and it came up in no time.

w5.jpg

Advertisements

Azure real time analytics

I have been thinking of exploring Azure PAAS offering in Data warehousing/analytics for a while and finally decided to go ahead with 2-3 examples that will help me cover most of the PAAS offering.

In the first example, I am using Event hubs and see if i can try some real time analytics there using azure stream analytics jobs. I assumed that I am sitting in a big mall where thermometers keep sending temperature readings frequently to Azure Event Hubs. Then i am using this data to check temperature every 2 minutes and see if there is no drop/rise in data that indicate some issue with cooling system/heaters.

I started off by provisioning a event hub with 4 partitions in Azure.s1

Then I made a code in c# which sends seed data to this Azure Event Hub in the same pattern a real thermometer/thermostat would do.s2.jpg

Data was send in JSON format. Here is the example of data send.

s3

Once I am sure data is being received by Events Hubs, I want to do some real time stream analytics and make sure that i get updates for every few minutes for each device.

For Stream Analytics in this example, I Went ahead with azure PAAS offering, Stream Analytics Jobs.

s4

Every Stream Analytics has 3 important setups. Input, thats where our real time data will be picked from (here an Event Hub). Output, where the results of analysis will be saved, (here Power BI) and a query/function which manipulates input data.

See below on how we have setup these 3 in stream analysis.

On the Event Hub Dashboard in portal click input and add these details to attach it to an event hub for input.

s5.jpg

Click on test to make sure input endpoint is valid. Then go back to Event Hub Portal and click Output. And then add Power BI as output.

s6

In this case we will setup a query to check Min, Max and Average Temperature every 2 minutes and publish results on power bi so that end user can check if temperature stats are ok.

See below for query –

s7

Save the Query and Then go back to Event Hub Portal. Start the Stream Analytics Service and Check The graph to make sure Input is processed and Output is generated.

s8.jpg

Once Output is generated, go to Power BI and see if dataset is created for this exercise and if its getting refreshed.

You should see a new Dataset created.

s9.jpg

We can now use this dataset to created any report. I have made a simple table to show data produced every 2 minutes.

s11.jpg

azure sql backup to azure storage

Use the following script to backup azure sql to storage.
replace database details and storage details in the scripts below.
Also you would need to import your azure publishing file to run this script.
refer to this article on how to azure publishing file.

https://devopsandcloud.wordpress.com/2017/01/21/download-and-import-publish-settings-and-subscription-information/

 

# Check if Windows Azure Powershell is avaiable
try{
Import-Module Azure -ErrorAction Stop
}catch{
throw “Windows Azure Powershell not found! Please make sure to install them from http://www.windowsazure.com/en-us/downloads/#cmd-line-tools”
}

Import-AzurePublishSettingsFile “C:\jenkinsjobs\Pay-As-You-Go.publishsettings” #replace with your publishing file path

$DatabaseServerName=”azureservername.database.windows.net”
$DatabaseName= “DBName”
$DatabasePassword=”azure sql password”
$DatabaseUsername=”azure sql user”
$StorageName=”storage name”
$StorageKey=”storage key”
$StorageContainerName=”containername”
$dateTime = get-date -Format u
$blobName = “$DatabaseName.$dateTime.bacpac”
Write-Host “Using blobName: $blobName”

# Create Database Connection
$securedPassword = ConvertTo-SecureString -String $DatabasePassword -asPlainText -Force
$serverCredential = new-object System.Management.Automation.PSCredential($DatabaseUsername, $securedPassword)
$databaseContext = New-AzureSqlDatabaseServerContext -FullyQualifiedServerName $DatabaseServerName -Credential $serverCredential

# Create Storage Connection
$storageContext = New-AzureStorageContext -StorageAccountName $StorageName -StorageAccountKey $StorageKey

# Initiate the Export
$operationStatus = Start-AzureSqlDatabaseExport -StorageContext $storageContext -SqlConnectionContext $databaseContext -BlobName $blobName -DatabaseName $DatabaseName -StorageContainerName $StorageContainerName

# Wait for the operation to finish
do{
if ($operationStatus)
{
$status = Get-AzureSqlDatabaseImportExportStatus -Request $operationStatus
if ($status){
Start-Sleep -s 3
$progress =$status.Status.ToString()
Write-Host “Waiting for database export completion. Operation status: $progress”
}
else
{
Write-Host “Null Status. Awating updates.”
}
}
}until ($status.Status -eq “Completed”)
Write-Host “Database export is complete”