Azure Data Lake

data : https://rbdevdwhstorage.blob.core.windows.net/rbdevdwhtestdata/FlightData.csv

 

In my second example I wanted to use Azure Data Lake PAAS offering and see how it can help me batch process data quickly.

Inorder to check that, I downloaded a test excel file which had info of flight departures and delys from many USA major airports. Excel had close to half a million rows and i want to see how quickly data lake can parse this file for me and then run simple/complex queries.

The first step is to provision Azure Data Lake Storage to save the CSV and results. And to provision Azure Data Lake Analytics to run all the batch jobs and generate results.

Once the Data Lake Storage has been provisioned, click on the Data Explorer button and Upload the test CSV data.

b1

Once the upload is done, go to Azure Data Lake Analytics and Click on New Job.

b2.jpg

We will focus on 3 jobs here.

  • First Job will make a view that will read from the CSV file we have uploaded.
  • Second with then read all the data from the view and save it a new table on azure data lake storage that we can query.
  • we will run a test query that will go through half a milion rows and produce results in few seconds.

All we need to do it copy paste the jobs below one by one and submit them. If commands are ok, you should see screens similar to below

Step 1-

below is the command to create the view.

DROP VIEW IF EXISTS FlightDataView_2014_1;

CREATE VIEW FlightDataView_2014_1
AS
EXTRACT Year int,
Quarter int,
Month int,
DayofMonth int,
DayOfWeek int,
FlightDate string,
UniqueCarrier string,
AirlineID int,
Carrier string,
TailNum string,
FlightNum string,
OriginAirportID int,
OriginAirportSeqID int,
OriginCityMarketID int,
Origin string,
OriginCityName string,
OriginState string,
OriginStateFips string,
OriginStateName string,
OriginWac int,
DestAirportID int,
DestAirportSeqID int,
DestCityMarketID int,
Dest string,
DestCityName string,
DestState string,
DestStateFips string,
DestStateName string,
DestWac int,
CRSDepTime string,
DepTime string,
DepDelay double?,
DepDelayMinutes double?,
DepDel15 double?,
DepartureDelayGroups string,
DepTimeBlk string,
TaxiOut double?,
WheelsOff string,
WheelsOn string,
TaxiIn double?,
CRSArrTime string,
ArrTime string,
ArrDelay double?,
ArrDelayMinutes double?,
ArrDel15 double?,
ArrivalDelayGroups string,
ArrTimeBlk string,
Cancelled double?,
CancellationCode string,
Diverted double?,
CRSElapsedTime double?,
ActualElapsedTime double?,
AirTime double?,
Flights double?,
Distance double?,
DistanceGroup double?,
CarrierDelay double?,
WeatherDelay double?,
NASDelay double?,
SecurityDelay double?,
LateAircraftDelay double?,
FirstDepTime string,
TotalAddGTime string,
LongestAddGTime string,
DivAirportLandings string,
DivReachedDest string,
DivActualElapsedTime string,
DivArrDelay string,
DivDistance string,
Div1Airport string,
Div1AirportID string,
Div1AirportSeqID string,
Div1WheelsOn string,
Div1TotalGTime string,
Div1LongestGTime string,
Div1WheelsOff string,
Div1TailNum string,
Div2Airport string,
Div2AirportID string,
Div2AirportSeqID string,
Div2WheelsOn string,
Div2TotalGTime string,
Div2LongestGTime string,
Div2WheelsOff string,
Div2TailNum string,
Div3Airport string,
Div3AirportID string,
Div3AirportSeqID string,
Div3WheelsOn string,
Div3TotalGTime string,
Div3LongestGTime string,
Div3WheelsOff string,
Div3TailNum string,
Div4Airport string,
Div4AirportID string,
Div4AirportSeqID string,
Div4WheelsOn string,
Div4TotalGTime string,
Div4LongestGTime string,
Div4WheelsOff string,
Div4TailNum string,
Div5Airport string,
Div5AirportID string,
Div5AirportSeqID string,
Div5WheelsOn string,
Div5TotalGTime string,
Div5LongestGTime string,
Div5WheelsOff string,
Div5TailNum string
FROM “/rbdevdls/FlightData.csv”
USING Extractors.Text(‘,’, null, null, null,
System.Text.Encoding.UTF8, true, false, 1);

Upon execution, you should see similar screen

b3

Once the view is created to the CSV file, we will dump this data to a Data Lake catalog table using command belows

DROP TABLE IF EXISTS FlightData_2014_1;

CREATE TABLE FlightData_2014_1(
INDEX idx_year CLUSTERED (Year)
DISTRIBUTED BY RANGE (Year)
) AS
SELECT *
FROM FlightDataView_2014_1;

you will note that Azure Data Lake will quickly dump half a million rows data to the table from excel.

b4.jpg

If you want to see where this data is, Click on Data Explorer in Data Lake, and then click on catalog. Under Catalog, expand Master->Tables and you should see it there.

Now last part to see the performance of azure Data lake. I write the query below to read all the half a millions rows and produce airport list, ordered by number of flights. We will save the result of this job to a CSV file ->busiestairports.csv

@results =
SELECT Origin, COUNT(*) AS Counted
FROM FlightData_2014_1
GROUP BY Origin
ORDER BY Counted DESC
FETCH 30 ROWS;

OUTPUT @results
TO “/rbdevdls/busiestairports.csv”
USING Outputters.Csv();

you will note that azure data lake will wrap that up in 46 secs and create the file for you too.

b5.jpg

Now we can go back to azure data lake storage and view the results

b6