In my second example I wanted to use Azure Data Lake PAAS offering and see how it can help me batch process data quickly.
Inorder to check that, I downloaded a test excel file which had info of flight departures and delys from many USA major airports. Excel had close to half a million rows and i want to see how quickly data lake can parse this file for me and then run simple/complex queries.
The first step is to provision Azure Data Lake Storage to save the CSV and results. And to provision Azure Data Lake Analytics to run all the batch jobs and generate results.
Once the Data Lake Storage has been provisioned, click on the Data Explorer button and Upload the test CSV data.
Once the upload is done, go to Azure Data Lake Analytics and Click on New Job.
We will focus on 3 jobs here.
- First Job will make a view that will read from the CSV file we have uploaded.
- Second with then read all the data from the view and save it a new table on azure data lake storage that we can query.
- we will run a test query that will go through half a milion rows and produce results in few seconds.
All we need to do it copy paste the jobs below one by one and submit them. If commands are ok, you should see screens similar to below
below is the command to create the view.
DROP VIEW IF EXISTS FlightDataView_2014_1;
CREATE VIEW FlightDataView_2014_1
EXTRACT Year int,
USING Extractors.Text(‘,’, null, null, null,
System.Text.Encoding.UTF8, true, false, 1);
Upon execution, you should see similar screen
Once the view is created to the CSV file, we will dump this data to a Data Lake catalog table using command belows
DROP TABLE IF EXISTS FlightData_2014_1;
CREATE TABLE FlightData_2014_1(
INDEX idx_year CLUSTERED (Year)
DISTRIBUTED BY RANGE (Year)
you will note that Azure Data Lake will quickly dump half a million rows data to the table from excel.
If you want to see where this data is, Click on Data Explorer in Data Lake, and then click on catalog. Under Catalog, expand Master->Tables and you should see it there.
Now last part to see the performance of azure Data lake. I write the query below to read all the half a millions rows and produce airport list, ordered by number of flights. We will save the result of this job to a CSV file ->busiestairports.csv
SELECT Origin, COUNT(*) AS Counted
GROUP BY Origin
ORDER BY Counted DESC
FETCH 30 ROWS;
you will note that azure data lake will wrap that up in 46 secs and create the file for you too.
Now we can go back to azure data lake storage and view the results