Azure SQL Data Warehouse

data : https://rbdevdwhstorage.blob.core.windows.net/rbdevdwhtestdata/FlightData.csv

In my 3rd example, I explored SQL Data warehouse on AZURE.

The idea was to take a half a million rows CSV and then process it using SQL Data Warehouse and see the performance.

I provisioned one azure storage where i saved my test data and then one azure sql data warehouse of 100 dTUs. (least in data warehouses).

Snapshot of data stored to Azure Storage –

w1

Azure SQL Data Warehouse. Copy the Server Name from here and open management studio.

w2.jpg

Using SQL Management studio, connect to this data warehouse like any MSSQL database.

w3.jpg

 

Once you are connected, open new query windows.

First step is to securely connect to the source data, here a csv with half a million rows.

CREATE MASTER KEY;

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCreds
WITH IDENTITY = ‘[identityName]’
, Secret = ‘[azureStorageAccountKey]’
;

CREATE EXTERNAL DATA SOURCE azure_storage
WITH
(
TYPE = HADOOP
, LOCATION =’wasbs://[containername]@[accountname].blob.core.windows.net/[path]’
, CREDENTIAL = AzureStorageCreds
);

next define the format of the file we will be reading

CREATE EXTERNAL FILE FORMAT text_file_format
WITH
(
FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS (
FIELD_TERMINATOR =’,’,
STRING_DELIMITER = ‘”‘,
USE_TYPE_DEFAULT = TRUE
)
);

next step is to make an external table, which is kind of a view to our source data

CREATE EXTERNAL TABLE FlightData
(
[Year] varchar(255),
[Quarter] varchar(255),
[Month] varchar(255),
[DayofMonth] varchar(255),
[DayOfWeek] varchar(255),
FlightDate varchar(255),
UniqueCarrier varchar(255),
AirlineID varchar(255),
Carrier varchar(255),
TailNum varchar(255),
FlightNum varchar(255),
OriginAirportID varchar(255),
OriginAirportSeqID varchar(255),
OriginCityMarketID varchar(255),
Origin varchar(255),
OriginCityName varchar(255),
OriginState varchar(255),
OriginStateFips varchar(255),
OriginStateName varchar(255),
OriginWac varchar(255),
DestAirportID varchar(255),
DestAirportSeqID varchar(255),
DestCityMarketID varchar(255),
Dest varchar(255),
DestCityName varchar(255),
DestState varchar(255),
DestStateFips varchar(255),
DestStateName varchar(255),
DestWac varchar(255),
CRSDepTime varchar(128),
DepTime varchar(128),
DepDelay varchar(128),
DepDelayMinutes varchar(128),
DepDel15 varchar(255),
DepartureDelayGroups varchar(255),
DepTimeBlk varchar(255),
TaxiOut varchar(255),
WheelsOff varchar(255),
WheelsOn varchar(255),
TaxiIn varchar(255),
CRSArrTime varchar(128),
ArrTime varchar(128),
ArrDelay varchar(255),
ArrDelayMinutes varchar(255),
ArrDel15 varchar(255),
ArrivalDelayGroups varchar(255),
ArrTimeBlk varchar(255),
Cancelled varchar(255),
CancellationCode varchar(255),
Diverted varchar(255),
CRSElapsedTime varchar(255),
ActualElapsedTime varchar(255),
AirTime varchar(255),
Flights varchar(255),
Distance varchar(255),
DistanceGroup varchar(255),
CarrierDelay varchar(255),
WeatherDelay varchar(255),
NASDelay varchar(255),
SecurityDelay varchar(255),
LateAircraftDelay varchar(255),
FirstDepTime varchar(255),
TotalAddGTime varchar(255),
LongestAddGTime varchar(255),
DivAirportLandings varchar(255),
DivReachedDest varchar(255),
DivActualElapsedTime varchar(255),
DivArrDelay varchar(255),
DivDistance varchar(255),
Div1Airport varchar(255),
Div1AirportID varchar(255),
Div1AirportSeqID varchar(255),
Div1WheelsOn varchar(255),
Div1TotalGTime varchar(255),
Div1LongestGTime varchar(255),
Div1WheelsOff varchar(255),
Div1TailNum varchar(255),
Div2Airport varchar(255),
Div2AirportID varchar(255),
Div2AirportSeqID varchar(255),
Div2WheelsOn varchar(255),
Div2TotalGTime varchar(255),
Div2LongestGTime varchar(255),
Div2WheelsOff varchar(255),
Div2TailNum varchar(255),
Div3Airport varchar(255),
Div3AirportID varchar(255),
Div3AirportSeqID varchar(255),
Div3WheelsOn varchar(255),
Div3TotalGTime varchar(255),
Div3LongestGTime varchar(255),
Div3WheelsOff varchar(255),
Div3TailNum varchar(255),
Div4Airport varchar(255),
Div4AirportID varchar(255),
Div4AirportSeqID varchar(255),
Div4WheelsOn varchar(255),
Div4TotalGTime varchar(255),
Div4LongestGTime varchar(255),
Div4WheelsOff varchar(255),
Div4TailNum varchar(255),
Div5Airport varchar(255),
Div5AirportID varchar(255),
Div5AirportSeqID varchar(255),
Div5WheelsOn varchar(255),
Div5TotalGTime varchar(255),
Div5LongestGTime varchar(255),
Div5WheelsOff varchar(255),
Div5TailNum varchar(255)
)
WITH
(
LOCATION = ‘/’,
DATA_SOURCE = azure_storage,
FILE_FORMAT = text_file_format,
REJECT_TYPE = value,
REJECT_VALUE = 100000
);

Once this done, we will run one final command to pull data to data warehouse

CREATE TABLE FlightDataStaging
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT * FROM FlightData

Once this is done, lets run a group by command on all the data and see how long it takes.

w4.jpg

you will notice it barely took 2 seconds.

If we want, we can also link our data warehouse to powerBI , and display results.

I have tried to run similar report as above and it came up in no time.

w5.jpg

Azure Data Lake

data : https://rbdevdwhstorage.blob.core.windows.net/rbdevdwhtestdata/FlightData.csv

 

In my second example I wanted to use Azure Data Lake PAAS offering and see how it can help me batch process data quickly.

Inorder to check that, I downloaded a test excel file which had info of flight departures and delys from many USA major airports. Excel had close to half a million rows and i want to see how quickly data lake can parse this file for me and then run simple/complex queries.

The first step is to provision Azure Data Lake Storage to save the CSV and results. And to provision Azure Data Lake Analytics to run all the batch jobs and generate results.

Once the Data Lake Storage has been provisioned, click on the Data Explorer button and Upload the test CSV data.

b1

Once the upload is done, go to Azure Data Lake Analytics and Click on New Job.

b2.jpg

We will focus on 3 jobs here.

  • First Job will make a view that will read from the CSV file we have uploaded.
  • Second with then read all the data from the view and save it a new table on azure data lake storage that we can query.
  • we will run a test query that will go through half a milion rows and produce results in few seconds.

All we need to do it copy paste the jobs below one by one and submit them. If commands are ok, you should see screens similar to below

Step 1-

below is the command to create the view.

DROP VIEW IF EXISTS FlightDataView_2014_1;

CREATE VIEW FlightDataView_2014_1
AS
EXTRACT Year int,
Quarter int,
Month int,
DayofMonth int,
DayOfWeek int,
FlightDate string,
UniqueCarrier string,
AirlineID int,
Carrier string,
TailNum string,
FlightNum string,
OriginAirportID int,
OriginAirportSeqID int,
OriginCityMarketID int,
Origin string,
OriginCityName string,
OriginState string,
OriginStateFips string,
OriginStateName string,
OriginWac int,
DestAirportID int,
DestAirportSeqID int,
DestCityMarketID int,
Dest string,
DestCityName string,
DestState string,
DestStateFips string,
DestStateName string,
DestWac int,
CRSDepTime string,
DepTime string,
DepDelay double?,
DepDelayMinutes double?,
DepDel15 double?,
DepartureDelayGroups string,
DepTimeBlk string,
TaxiOut double?,
WheelsOff string,
WheelsOn string,
TaxiIn double?,
CRSArrTime string,
ArrTime string,
ArrDelay double?,
ArrDelayMinutes double?,
ArrDel15 double?,
ArrivalDelayGroups string,
ArrTimeBlk string,
Cancelled double?,
CancellationCode string,
Diverted double?,
CRSElapsedTime double?,
ActualElapsedTime double?,
AirTime double?,
Flights double?,
Distance double?,
DistanceGroup double?,
CarrierDelay double?,
WeatherDelay double?,
NASDelay double?,
SecurityDelay double?,
LateAircraftDelay double?,
FirstDepTime string,
TotalAddGTime string,
LongestAddGTime string,
DivAirportLandings string,
DivReachedDest string,
DivActualElapsedTime string,
DivArrDelay string,
DivDistance string,
Div1Airport string,
Div1AirportID string,
Div1AirportSeqID string,
Div1WheelsOn string,
Div1TotalGTime string,
Div1LongestGTime string,
Div1WheelsOff string,
Div1TailNum string,
Div2Airport string,
Div2AirportID string,
Div2AirportSeqID string,
Div2WheelsOn string,
Div2TotalGTime string,
Div2LongestGTime string,
Div2WheelsOff string,
Div2TailNum string,
Div3Airport string,
Div3AirportID string,
Div3AirportSeqID string,
Div3WheelsOn string,
Div3TotalGTime string,
Div3LongestGTime string,
Div3WheelsOff string,
Div3TailNum string,
Div4Airport string,
Div4AirportID string,
Div4AirportSeqID string,
Div4WheelsOn string,
Div4TotalGTime string,
Div4LongestGTime string,
Div4WheelsOff string,
Div4TailNum string,
Div5Airport string,
Div5AirportID string,
Div5AirportSeqID string,
Div5WheelsOn string,
Div5TotalGTime string,
Div5LongestGTime string,
Div5WheelsOff string,
Div5TailNum string
FROM “/rbdevdls/FlightData.csv”
USING Extractors.Text(‘,’, null, null, null,
System.Text.Encoding.UTF8, true, false, 1);

Upon execution, you should see similar screen

b3

Once the view is created to the CSV file, we will dump this data to a Data Lake catalog table using command belows

DROP TABLE IF EXISTS FlightData_2014_1;

CREATE TABLE FlightData_2014_1(
INDEX idx_year CLUSTERED (Year)
DISTRIBUTED BY RANGE (Year)
) AS
SELECT *
FROM FlightDataView_2014_1;

you will note that Azure Data Lake will quickly dump half a million rows data to the table from excel.

b4.jpg

If you want to see where this data is, Click on Data Explorer in Data Lake, and then click on catalog. Under Catalog, expand Master->Tables and you should see it there.

Now last part to see the performance of azure Data lake. I write the query below to read all the half a millions rows and produce airport list, ordered by number of flights. We will save the result of this job to a CSV file ->busiestairports.csv

@results =
SELECT Origin, COUNT(*) AS Counted
FROM FlightData_2014_1
GROUP BY Origin
ORDER BY Counted DESC
FETCH 30 ROWS;

OUTPUT @results
TO “/rbdevdls/busiestairports.csv”
USING Outputters.Csv();

you will note that azure data lake will wrap that up in 46 secs and create the file for you too.

b5.jpg

Now we can go back to azure data lake storage and view the results

b6

Azure real time analytics

I have been thinking of exploring Azure PAAS offering in Data warehousing/analytics for a while and finally decided to go ahead with 2-3 examples that will help me cover most of the PAAS offering.

In the first example, I am using Event hubs and see if i can try some real time analytics there using azure stream analytics jobs. I assumed that I am sitting in a big mall where thermometers keep sending temperature readings frequently to Azure Event Hubs. Then i am using this data to check temperature every 2 minutes and see if there is no drop/rise in data that indicate some issue with cooling system/heaters.

I started off by provisioning a event hub with 4 partitions in Azure.s1

Then I made a code in c# which sends seed data to this Azure Event Hub in the same pattern a real thermometer/thermostat would do.s2.jpg

Data was send in JSON format. Here is the example of data send.

s3

Once I am sure data is being received by Events Hubs, I want to do some real time stream analytics and make sure that i get updates for every few minutes for each device.

For Stream Analytics in this example, I Went ahead with azure PAAS offering, Stream Analytics Jobs.

s4

Every Stream Analytics has 3 important setups. Input, thats where our real time data will be picked from (here an Event Hub). Output, where the results of analysis will be saved, (here Power BI) and a query/function which manipulates input data.

See below on how we have setup these 3 in stream analysis.

On the Event Hub Dashboard in portal click input and add these details to attach it to an event hub for input.

s5.jpg

Click on test to make sure input endpoint is valid. Then go back to Event Hub Portal and click Output. And then add Power BI as output.

s6

In this case we will setup a query to check Min, Max and Average Temperature every 2 minutes and publish results on power bi so that end user can check if temperature stats are ok.

See below for query –

s7

Save the Query and Then go back to Event Hub Portal. Start the Stream Analytics Service and Check The graph to make sure Input is processed and Output is generated.

s8.jpg

Once Output is generated, go to Power BI and see if dataset is created for this exercise and if its getting refreshed.

You should see a new Dataset created.

s9.jpg

We can now use this dataset to created any report. I have made a simple table to show data produced every 2 minutes.

s11.jpg

Deploying a Web Deploy Package to AWS ElasticBeanstalk

AWS provide an extension to Visual Studio to make interacting with your AWS services easy, including deploying to a Beanstalk environment, which is the recommended way of deploying to a Beanstalk.

This works great, and if you are able to, you should obviously use the recommended approach, but there may be times you don’t have the extension available, or already have a build system setup to use Web Deploy packages. As far as I can tell the Beanstalk just uses MsDeploy packages under the hood, making it easy to deploy these without the extension!

1. Create a Package

If you don’t already have a Web Deploy package, create one. This is simple in Visual Studio, open your Web Application, right click on the Web Application project and select Publish.

This will open the Publish Web dialog:

Deploy1

Select Custom, and give your profile a name (for example the beanstalk environment name). On the Connection screen, update the Publish method to Web Deploy Package.

Deploy2Enter Default Web Site for the Site name, and choose a location on your machine to create the package.

Check the Settings are correct on the next screen, confirm your publish location on the Preview screen, and then click Publish.

Navigate to the folder where the package was created, you should see 5 files, the only one of interest for this case is the ZIP file.

 

2. Deploying the package

Browse to your environment in the AWS Management Console

Deploy3Select Upload and Deploy.

 

Deploy4Choose the ZIP file created earlier, and give this version a label (these should be unique amongst labels used for this application).

Clicking Deploy will start the deployment of the code to this environment, you will be able to monitor the logs in the Console to the status of your deployment.

If everything goes to plan you should see a message in the logs saying “New application version was deployed to running EC2 instances.”.

Next Steps

Just as you can automate these steps using the AWS Visual Studio Toolkit and the Deployment Tool command line program, these steps can be automated

The package can be created using MSBuild and the Package target.

The deployment to AWS can be automated using either the CLI tools, or Powershell tools and the following methods

Minimum configuration to use codedeploy with .NET MVC in AWS

trying codedeploy for the first time in AWS and looking for a minimum to get going? Using code deploy in aws needs 3 minimum steps

  1. install codedeploy agent on targets instances
  2. setup appconfig.yml file to direct how codedeploy will work, where it will deploy using agents etc
  3. setup codedeploy on console to trigger push to target.

to setup codedeploy agent on target instance where you wish to deploy code, please use following commands (for windows server as we are targetting mvc over IIS web server)

Set-ExecutionPolicy RemoteSigned
Import-Module AWSPowerShell

PS C:\Users\Administrator> powershell.exe -Command Set-AWSCredentials 
-AccessKey 'xxxxx' -SecretKey 'xxxx' Initialize-AWSDefaults

(there are better ways to do this but just going with this for demo)
New-Item –Path "c:\temp" –ItemType "directory" -Force
powershell.exe -Command Read-S3Object -BucketName bucket-name 
-Key latest/codedeploy-agent.msi -File c:\temp\codedeploy-agent.msi
-AccessKey 'xxxxx' -SecretKey 'xxxxxx'
c:\temp\codedeploy-agent.msi /quiet /l c:\temp\host-agent-install-log.txt
powershell.exe -Command Get-Service -Name codedeployagent

bucket-name represents one of the following:

  • aws-codedeploy-us-east-1 for instances in the US East (N. Virginia) region
  • aws-codedeploy-us-east-2 for instances in the US East (Ohio) region
  • aws-codedeploy-us-west-1 for instances in the US West (N. California) region
  • aws-codedeploy-us-west-2 for instances in the US West (Oregon) region
  • aws-codedeploy-ca-central-1 for instances in the Canada (Central) region
  • aws-codedeploy-eu-west-1 for instances in the EU (Ireland) region
  • aws-codedeploy-eu-west-2 for instances in the EU (London) region
  • aws-codedeploy-eu-central-1 for instances in the EU (Frankfurt) region
  • aws-codedeploy-ap-northeast-1 for instances in the Asia Pacific (Tokyo) region
  • aws-codedeploy-ap-northeast-2 for instances in the Asia Pacific (Seoul) region
  • aws-codedeploy-ap-southeast-1 for instances in the Asia Pacific (Singapore) region
  • aws-codedeploy-ap-southeast-2 for instances in the Asia Pacific (Sydney) region
  • aws-codedeploy-ap-south-1 for instances in the Asia Pacific (Mumbai) region
  • aws-codedeploy-sa-east-1 for instances in the South America (São Paulo) region

If the AWS CodeDeploy agent is installed and running, after the Get-Service command call, you should see output similar to the following:


Status   Name                DisplayName
------   ----                -----------
Running  codedeployagent    CodeDeploy Host Agent Service

use the below code to make a simple appconfig.yml file and place it in the root if the zip file along with entire publish code. Make sure you published your code using file system option and not the web deploy that is used for beanstalk deployments.

just upload this zip to s3 and use codedeploy to push the file to target.

version: 0.0
os: windows
files:
– source: \
destination: c:\inetpub\wwwroot\

my final zip file structure looked like this and it worked successfully.

appconfig

 

List all azure resources in a csv / excel

Run the following code to list all the azure resources under all of your resources.

# settings
$defaultPath= "c:\Temp\azureresources.csv"
$csvDelimiter= ';'
# set azure account
[void](Login-AzureRmAccount)
# receive all subscriptions
$subscriptions= Get-AzureRmSubscription
$subscriptions| ft SubscriptionId, SubscriptionName
# select azure subscriptions that you want to export
"Please enter subscription ids (comma separated, leave empty to use all subscriptions)"
$subscriptionIds= read-host
if([String]::IsNullOrWhiteSpace($subscriptionIds)) {
    $subscriptionIds= @($subscriptions| select-ExpandPropertySubscriptionId)
}
elseif($subscriptionIds.Contains(',')) {
    $subscriptionIds= $subscriptionIds.Split(',')
}
else{
    $subscriptionIds= @($subscriptionIds)
}
# configure csv output
"Enter destination path - leave it empty to use $defaultPath"
$path= read-host
if([String]::IsNullOrWhiteSpace($path)) {
    $path= $defaultPath
}
if(Test-Path$path) {
    "File $path already exists. Delete? y/n [Default: y)"
    $remove= read-host
    if([String]::IsNullOrWhiteSpace($remove) -or$remove.ToLower().Equals('y')) {
        Remove-Item$path
    }
}
"Start exporting data..."
foreach($subscriptionIdin$subscriptionIds) {
    # change azure subscription
    [void](Set-AzureRmContext-SubscriptionID$subscriptionId)
    # read subscription name as we want to see it in the exported csv
    $subscriptionName= ($subscriptions| Where { $_.SubscriptionId -eq$subscriptionId}).SubscriptionName
    
    $subscriptionSelector= @{ Label="SubscriptionName"; Expression={$subscriptionName} }
    $tagSelector=  @{Name="Tags";Expression={ if($_.Tags -ne$null) { $x= $_.Tags | %{ "{ `"" + $_.Name + "`" : `"" + $_.Value + "`" }, "}; ("{ "+ ([string]$x).TrimEnd(", ") + " }") } }}
    #get resources from azure subscription
    $export= Get-AzureRmResource| select *, $subscriptionSelector, $tagSelector-ExcludeProperty"Tags"
    $export| Export-CSV$path-Delimiter$csvDelimiter-Append-Force-NoTypeInformation
    "Exported "+ $subscriptionId+ " - "+ $subscriptionName
}
"Export done!"
If you want to run this script in scheduler, then you need to save azure profile so that the script can pick it up. Run the following commands
Add-AzureRmAccount
Save-AzureRmProfile -Path “c:\temp\azureprofile.json”
After checking if the file exists, the following (line 7 of the previous script) should load the azure profile:
Select-AzureRmProfile -Path $azureProfilePath

replicate aws s3 to another s3

to replicate one s3 drive to another, one of the easiest and best way is to use s3md command.

To instal this command use –
apt-get install s3cmd

after installation, run the following command to setup s3cmd. this will also ask for your access and secret keys
s3cmd –configure

then run the following command to create a new backup bucket and replicate from existing one.
s3cmd mb s3://mybucket_backup
s3cmd –recursive cp s3://mybucket s3://mybucket_backup