Azure Data Lake

data : https://rbdevdwhstorage.blob.core.windows.net/rbdevdwhtestdata/FlightData.csv

 

In my second example I wanted to use Azure Data Lake PAAS offering and see how it can help me batch process data quickly.

Inorder to check that, I downloaded a test excel file which had info of flight departures and delys from many USA major airports. Excel had close to half a million rows and i want to see how quickly data lake can parse this file for me and then run simple/complex queries.

The first step is to provision Azure Data Lake Storage to save the CSV and results. And to provision Azure Data Lake Analytics to run all the batch jobs and generate results.

Once the Data Lake Storage has been provisioned, click on the Data Explorer button and Upload the test CSV data.

b1

Once the upload is done, go to Azure Data Lake Analytics and Click on New Job.

b2.jpg

We will focus on 3 jobs here.

  • First Job will make a view that will read from the CSV file we have uploaded.
  • Second with then read all the data from the view and save it a new table on azure data lake storage that we can query.
  • we will run a test query that will go through half a milion rows and produce results in few seconds.

All we need to do it copy paste the jobs below one by one and submit them. If commands are ok, you should see screens similar to below

Step 1-

below is the command to create the view.

DROP VIEW IF EXISTS FlightDataView_2014_1;

CREATE VIEW FlightDataView_2014_1
AS
EXTRACT Year int,
Quarter int,
Month int,
DayofMonth int,
DayOfWeek int,
FlightDate string,
UniqueCarrier string,
AirlineID int,
Carrier string,
TailNum string,
FlightNum string,
OriginAirportID int,
OriginAirportSeqID int,
OriginCityMarketID int,
Origin string,
OriginCityName string,
OriginState string,
OriginStateFips string,
OriginStateName string,
OriginWac int,
DestAirportID int,
DestAirportSeqID int,
DestCityMarketID int,
Dest string,
DestCityName string,
DestState string,
DestStateFips string,
DestStateName string,
DestWac int,
CRSDepTime string,
DepTime string,
DepDelay double?,
DepDelayMinutes double?,
DepDel15 double?,
DepartureDelayGroups string,
DepTimeBlk string,
TaxiOut double?,
WheelsOff string,
WheelsOn string,
TaxiIn double?,
CRSArrTime string,
ArrTime string,
ArrDelay double?,
ArrDelayMinutes double?,
ArrDel15 double?,
ArrivalDelayGroups string,
ArrTimeBlk string,
Cancelled double?,
CancellationCode string,
Diverted double?,
CRSElapsedTime double?,
ActualElapsedTime double?,
AirTime double?,
Flights double?,
Distance double?,
DistanceGroup double?,
CarrierDelay double?,
WeatherDelay double?,
NASDelay double?,
SecurityDelay double?,
LateAircraftDelay double?,
FirstDepTime string,
TotalAddGTime string,
LongestAddGTime string,
DivAirportLandings string,
DivReachedDest string,
DivActualElapsedTime string,
DivArrDelay string,
DivDistance string,
Div1Airport string,
Div1AirportID string,
Div1AirportSeqID string,
Div1WheelsOn string,
Div1TotalGTime string,
Div1LongestGTime string,
Div1WheelsOff string,
Div1TailNum string,
Div2Airport string,
Div2AirportID string,
Div2AirportSeqID string,
Div2WheelsOn string,
Div2TotalGTime string,
Div2LongestGTime string,
Div2WheelsOff string,
Div2TailNum string,
Div3Airport string,
Div3AirportID string,
Div3AirportSeqID string,
Div3WheelsOn string,
Div3TotalGTime string,
Div3LongestGTime string,
Div3WheelsOff string,
Div3TailNum string,
Div4Airport string,
Div4AirportID string,
Div4AirportSeqID string,
Div4WheelsOn string,
Div4TotalGTime string,
Div4LongestGTime string,
Div4WheelsOff string,
Div4TailNum string,
Div5Airport string,
Div5AirportID string,
Div5AirportSeqID string,
Div5WheelsOn string,
Div5TotalGTime string,
Div5LongestGTime string,
Div5WheelsOff string,
Div5TailNum string
FROM “/rbdevdls/FlightData.csv”
USING Extractors.Text(‘,’, null, null, null,
System.Text.Encoding.UTF8, true, false, 1);

Upon execution, you should see similar screen

b3

Once the view is created to the CSV file, we will dump this data to a Data Lake catalog table using command belows

DROP TABLE IF EXISTS FlightData_2014_1;

CREATE TABLE FlightData_2014_1(
INDEX idx_year CLUSTERED (Year)
DISTRIBUTED BY RANGE (Year)
) AS
SELECT *
FROM FlightDataView_2014_1;

you will note that Azure Data Lake will quickly dump half a million rows data to the table from excel.

b4.jpg

If you want to see where this data is, Click on Data Explorer in Data Lake, and then click on catalog. Under Catalog, expand Master->Tables and you should see it there.

Now last part to see the performance of azure Data lake. I write the query below to read all the half a millions rows and produce airport list, ordered by number of flights. We will save the result of this job to a CSV file ->busiestairports.csv

@results =
SELECT Origin, COUNT(*) AS Counted
FROM FlightData_2014_1
GROUP BY Origin
ORDER BY Counted DESC
FETCH 30 ROWS;

OUTPUT @results
TO “/rbdevdls/busiestairports.csv”
USING Outputters.Csv();

you will note that azure data lake will wrap that up in 46 secs and create the file for you too.

b5.jpg

Now we can go back to azure data lake storage and view the results

b6

Azure real time analytics

I have been thinking of exploring Azure PAAS offering in Data warehousing/analytics for a while and finally decided to go ahead with 2-3 examples that will help me cover most of the PAAS offering.

In the first example, I am using Event hubs and see if i can try some real time analytics there using azure stream analytics jobs. I assumed that I am sitting in a big mall where thermometers keep sending temperature readings frequently to Azure Event Hubs. Then i am using this data to check temperature every 2 minutes and see if there is no drop/rise in data that indicate some issue with cooling system/heaters.

I started off by provisioning a event hub with 4 partitions in Azure.s1

Then I made a code in c# which sends seed data to this Azure Event Hub in the same pattern a real thermometer/thermostat would do.s2.jpg

Data was send in JSON format. Here is the example of data send.

s3

Once I am sure data is being received by Events Hubs, I want to do some real time stream analytics and make sure that i get updates for every few minutes for each device.

For Stream Analytics in this example, I Went ahead with azure PAAS offering, Stream Analytics Jobs.

s4

Every Stream Analytics has 3 important setups. Input, thats where our real time data will be picked from (here an Event Hub). Output, where the results of analysis will be saved, (here Power BI) and a query/function which manipulates input data.

See below on how we have setup these 3 in stream analysis.

On the Event Hub Dashboard in portal click input and add these details to attach it to an event hub for input.

s5.jpg

Click on test to make sure input endpoint is valid. Then go back to Event Hub Portal and click Output. And then add Power BI as output.

s6

In this case we will setup a query to check Min, Max and Average Temperature every 2 minutes and publish results on power bi so that end user can check if temperature stats are ok.

See below for query –

s7

Save the Query and Then go back to Event Hub Portal. Start the Stream Analytics Service and Check The graph to make sure Input is processed and Output is generated.

s8.jpg

Once Output is generated, go to Power BI and see if dataset is created for this exercise and if its getting refreshed.

You should see a new Dataset created.

s9.jpg

We can now use this dataset to created any report. I have made a simple table to show data produced every 2 minutes.

s11.jpg