Using Windows Azure HDInsight

(Azure HDInsight uses HortonWorks Data Platform)

  1. Create an Azure storage account. Need Azure subscription.
    • Using Azure management portal, Powershell, or .NET SDK
    • Create new storage account: Name and Location. Take note of Storage account name and key.
  2. Provision HDInsight Cluster: Dashboard, Monitor, and Configuration
    • Quick create. Cluster name, cluster size: 1 or more data nodes. Username: admin. Password: xxx. Storage and Cluster must be on the same location.
    • Linked resources: storage account.
    • Application: /example/jars/Hadoop-examples.jar. Input file: davince.txt
    • Configuration: enable remote desktop (RDP to connect to head node)
  3. Submit a Hive job
    • Submit MapReduce job. Options: Hadoop .NET SDK, RESTful APIs, Windows Azure Powershell
    • Info needed:
      • Cluster name, storage account, blob container,
      • MapReducer job jar file, job input folder, job output folder (wasb://example/data/WordCountOutput/)
    • Powershell ISE
      • Connect subscription options: Get-AzurePublishSettingsFile, Import-AzurePublishSettingsFile, Add-AzureAccount (12 hours), (optional) Select-AzureSubscription
      • Data retrieve options: PowerQuery, Hive ODBC, Apache Sqoop, .NET SDK, Powershell
  4. Import Hive output into Excel using Power Query
    • Info needed: storage account, storage account key, default container, the path of output data file
    • PowerQuery: From other sources -> From Windows Azure HDInsight: account and key. Navigate to the container, and output file. Click binary. Rename column heads. Click ‘Apply and Close’ to start import.

from http://azure.microsoft.com/en-us/

Informatica, SSIS, or Talend ? Case Study: Compare three ETL tools. Part 2: Informatica PowerCenter

Task: Load a flat file to MSSQL table. Split malformed fullName to firstName, middleName, and lastName. Like many BI requirements, the task looks deceivably simple. But the implementation is pretty tricky due to multiple patterns: Various leading space,trailing space, space between names, and some names do not have middlename.

Source: malformed full name field in CSV format

malformatted_name

 

 

 

Target: Normalized firstName, middleName, and lastName in three columns of MS SQL table

wellformatted_name

Study focus: Availability of the string functions, tool usability
Tools: SQL 2012 SSIS, Informatica PowerCenter Express 9.61, Talend

Rationale for tool selection
SSIS is widely available as part of the Microsoft SQL.
Informatica PowerCenter is the most recognized ETL tool.
Talend is the most recognized open source ETL tool.

Source: Malformed employee.csv file from PowerCenter
Target: Normalized columns in employee table of MSSQL 2012

Procedure


1. Start Informatica Services and wait till the processes are started. It will take around 10 minutes. Use netstat to check the processes with ports 7006,9092 are started.
2. Launch Informatica Developer
3. Connect to ModelRepository by providing Password with default User Name: Administrator
4. Right click 'ModelReposiotry' and click 'New' and then 'Project'. Name the project: flatFile2Mssql_splitName
5. Right click 'flatFile2Mssql_splitName' under ModelRepository, click 'New'/'Data Object'/'Physical Data Objects'/'Flat File Data Object'.
Follow the wizard to browse the employee.csv under C:\Informatica\PCExpress\tomcat\bin\source. Name the connection as 'src_employee'.
Code page: MS Windows Latin 1 (ANSI), superset of Latin1.
Check Import column names from first line
6. Right click 'flatFile2Mssql_splitName' under ModelRepository, click 'New'/'Transformation'/'Expression'. Name it as 'exp_splitFullname'.
7. Right click 'flatFile2Mssql_splitName' under ModelRepository, click 'New'/'Data Objects/'Physical Data Objects'/'Relational Data Objects'/.
Browse to the right MSSQL connection, check 'Create data object from existing resource'. When selecting a Resource, uncheck 'Show selected' and 'Show Default Schema Only'.
Then check the table [dbo].[test_employee]. Name the connection as 'tgt_test_employee'. 

Use database for authentication for simplicity.
metadata access:
Connecton String: jdbc:informatica:sqlserver://localhost:1433;SelectMethod=cursor;databaseName=TestData.
data access: (make sure that this setting is correct)
Connecton String: localhost@TestData
Use Windows/Preference/Connection to double check the connection setting.
8. Right click 'flatFile2Mssql_splitName' under ModelRepository, click 'New'/'Mapping'/. Named the mapping 'm_splitNameMapping'.
9. Double click 'm_splitNameMapping' under Object Explore/ModelRepository. The mapping canvas should be displayed.
10. Drag 'src_employee' from Ojbect Explorer to the canvas. Add to Mapping as 'Read', 'As independent data object'.
11. Drag 'trg_test_employee' from Ojbect Explorer to the canvas. Add to Mapping as 'Write'.
12. Drag 'exp_splitFullname' from Ojbect Explorer to the canvas.
13. Click 'exp_splitFullname' on the canvas. In the Properties view, click the General tab. And then click the tranformation link: 'exp_splitFullname'
Under 'Ports'. Highlight 'Input Only', then click the Insrt/New icon on the far right. Change the name to 'name' and length to '50'.

Highlight 'Variable' and then click the 'Add' icon five times and define the variables with the corresponding expressions:
v_nameRemSpace:  REG_REPLACE(name, '\s+', ' ')
v_nameRemSpace1: IIF(REG_MATCH(v_nameRemSpace,'\s.*')=1, REG_REPLACE(v_nameRemSpace, '^\s',''), v_nameRemSpace)
v_firstName: REG_EXTRACT(v_nameRemSpace1, '(\w+)\s*(\w+)\s*(\w*)',1)
v_middleName: REG_EXTRACT(v_nameRemSpace1, '(\w+)\s*(\w+)\s*(\w*)',2)
v_lastName: REG_EXTRACT(v_nameRemSpace1, '(\w+)\s*(\w+)\s*(\w*)',3)

Highlight 'Output Only', then click the 'Add' icon three times and define the variables with the corresponding expressions
firstName:  v_firstName
middleName:  IIF(LTRIM(RTRIM(v_lastName))='','',v_middleName)
lastName: IIF(LTRIM(RTRIM(v_lastName))='',v_middleName,v_lastName) 

14. Save the mapping and resolve any issue if there is any. Make sure the fields are mapped correctly.
15. Validate on each component. Right click each component and then click 'Run Data Viewer' to check if the transformation is performed as expected
16. Right click on canvas background and click 'Run Mapping' to load the data to the destination. 

informatica_expressions

informatica_mapping

Comments:
Pros
1. PowerCenter GUI is based on Eclipse. Its interface is simple and clear.
2. It has its own repository (ModelRepository) for metadata, this is important for team collaboration.
3. The string functions like SUBSTR, INSTR closely match with Oracle SQL functions. However it lacks TOKEN and TOKENINDEX in SSIS.
4. It supports regular expression while SSIS does not support it natively.
5. The ‘Run Data Viewer’ on each mapping components is very helpful when debugging mapping.
6. You can map the passthrough data directly from source to target, while SSIS does it implicitly.
7. Its web-based interface administrator is useful for monitoring the process and for scheduling jobs.

Cons
1. It needs more resources to run both server and client. Starting the processes will take many minutes.
2. Its regular expression function is very tricky and hard to debug.

Informatica, SSIS, or Talend ? Case Study: Three ETL tools comparison. Part 1: SSIS

Task: Load a flat file to MSSQL table. Split malformed full name to firstName, middleName, and lastName. Like many BI requirements, the task looks deceivably simple. But the implementation is pretty tricky due to multiple patterns: Various leading space,trailing space, space between names, and some names do not have middlename.

emplyee_fullname_preview
Study Focus:
String functions and tool usability
Tools: SQL 2012 SSIS, Informatica PowerCenter Express 9.61, Talend
Rationale for tool selection:
SSIS is widely available as part of the Microsoft SQL.
Informatica PowerCenter is the most recognized ETL tool.
Talend is the most recognized open source ETL tool.

Source: Employee.csv with malformed name
Target: MSSQL 2012 relational table.

Procedure


1. Start SQL Server Data Tools (2012)
2. Create a new project: choose template Business Intelligence/Integration Services/Integration Service Proejct. Name the project 'flatFile2Mssql_splitFullName'
3. A package Package.dtsx is created
4. Drag and drop Data Flow Task from SSIS Toolbox to Design area Control Flow tab
5. Double click the Data Flow Task and the tab 'Data Flow' should be activated
6. Drag and drop 'Flat File Source' under 'Other Sources' from Toolbox to 'Data Flow'
7. Drag and drop 'Derived Column' under 'Common'from Toolbox to 'Data Flow'
8. Drag and drop 'OLE DB Connection' under 'Other Destination' from Toolbox to 'Data Flow' 
9. Right click 'Flat File Source' and select 'Edit' , then click 'New' to define Fla File Connection: Browse to the Employee.csv file. 
10. Right click 'Flat File Source' and select 'Show Advanced Editor'. Click 'Input and Output Properties'. Make sure the data type match the definition in MSSQL table definition (unicode string DT_WSTR, nvarchar).  
10. Right clikc 'OLE DB Connection' and select 'Edit'. Make sure the target table [dbo].[test_employee] is identified.
11. Connect 'Flat File Source' to 'Derived Column' to 'OLE DB Connection'
12. Right click 'Derived Column' and select 'Edit'. Add three new columns with following expressions

firstName   TOKEN(TRIM(name),' ',1)
middleName  TOKENCOUNT(TRIM(name),' ') == 3 ? TOKEN(TRIM(name),' ',2) : ''
lastName    TOKENCOUNT(TRIM(name),' ') == 2 ? TOKEN(TRIM(name),' ',2) : TOKEN(TRIM(name),' ',3)

13. Right click 'OLE DB Connection' and make sure the fields are mapped corrected. 
14. Right click on Data Flow canvas, and select 'Execute Task'. 
15. Verify that the firstName, middleName, and lastName are properly loaded. 

Comments:

Pros: SQL Server Data Tools are based on Microsoft Visual Studio. The interface is pretty straight forward and simple to follow.
Cons: SQL Server Data Tools lack some key string functions like regular expression. However with the ?: condition and the combination of TOKEN and TOKENINDEX, the transformation from the malformed full name to well formatted names is accomplished.
There are many other ways to implement this transformation, but I found the combination of TOKEN and TOKENINDEX is the simplest.
ssis_dataflow_stringfx

ssis_splitFullName

WhereScape RED

Data Warehousing and Business Intelligence

About 4 years ago I had an idea (well, more of a dream really) of “building a data warehouse at the press of a button”. This means that everything is automated, including the development of the reports/cubes and the development of ETL. I understand that we can build SSIS, SSRS, and SSAS programmatically (and Informatica PowerCenter mappings too). I also understand that it is painfully slow and tedious process. But we can make the process easier by creating a few templates then at run time we choose the nearest match template and modify it.

This works for ETL, reports and cube. But how about tables? Corresponding dim and fact tables could be generated on the fly by analysing the source tables based on its data types: numerical columns become measures and textual columns become attributes. Yet, I admit it is a difficult process, definitely requires human involvement. Still…

View original post 996 more words

Houston, We’ve got a problem ! Impala reveals no transaction for an advertised item

Tools

Cloudera CDH
Impala
Hive
HDFS
Hue

Procedure

1. Upload web log data to HDFS. Or use Flume to feed live clickstream data.

sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse/original_access_logs
sudo -u hdfs hadoop fs -copyFromLocal /opt/examples/log_files/access.log.2 /user/hive/warehouse/original_access_logs
hadoop fs -ls /user/hive/warehouse/original_access_log

Here are some entries in the log:

79.133.215.123 - - [14/Jun/2014:10:30:13 -0400] "GET /home HTTP/1.1" 200 1671 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
162.235.161.200 - - [14/Jun/2014:10:30:13 -0400] "GET /department/apparel/category/featured%20shops/product/adidas%20Kids'%20RG%20III%20Mid%20Football%20Cleat HTTP/1.1" 200 1175 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/7.0.4 Safari/537.76.4"
39.244.91.133 - - [14/Jun/2014:10:30:14 -0400] "GET /department/fitness HTTP/1.1" 200 1435 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
150.47.54.136 - - [14/Jun/2014:10:30:14 -0400] "GET /department/fan%20shop/category/water%20sports/product/Pelican%20Sunstream%20100%20Kayak/add_to_cart HTTP/1.1" 200 1932 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
217.89.36.129 - - [14/Jun/2014:10:30:14 -0400] "GET /view_cart HTTP/1.1" 200 1401 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0"
36.44.59.115 - - [14/Jun/2014:10:30:15 -0400] "GET /department/footwear/category/cardio%20equipment HTTP/1.1" 200 386 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0"
11.252.83.179 - - [14/Jun/2014:10:30:15 -0400] "GET /view_cart HTTP/1.1" 200 1726 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
56.251.19.230 - - [14/Jun/2014:10:30:15 -0400] "GET /department/footwear/category/fitness%20accessories HTTP/1.1" 200 2076 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
137.95.229.186 - - [14/Jun/2014:10:30:16 -0400] "GET /department/fan%20shop/category/fishing/product/Field%20&%20Stream%20Sportsman%2016%20Gun%20Fire%20Safe HTTP/1.1" 200 1413 "-" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
100.98.159.99 - - [14/Jun/2014:10:30:16 -0400] "GET /department/fan%20shop/category/water%20sports/product/Pelican%20Sunstream%20100%20Kayak HTTP/1.1" 200 396 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

2. Create an intermediate table: use Hive’s SerDes (serializers / deserializers) to parse the logs into individual fields using a regular expression.

hive>CREATE EXTERNAL TABLE intermediate_access_logs (
ip STRING,
date STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) - - \\[([^\\]]*)\\] \"([^\ ]*) ([^\ ]*) ([^\ ]*)\" (\\d*) (\\d*) \"([^\"]*)\" \"([^\"]*)\"",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
LOCATION '/user/hive/warehouse/original_access_logs';

3. Create a table to hold the processed result

hive>CREATE EXTERNAL TABLE tokenized_access_logs (
ip STRING,
date STRING,
method STRING,
url STRING,
http_version STRING,
code1 STRING,
code2 STRING,
dash STRING,
user_agent STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/tokenized_access_logs';

4. Populate the table using data from the intermediate table

hive>INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM intermediate_access_logs;
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1419929171445_0014, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1419929171445_0014/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1419929171445_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-12-30 18:25:05,638 Stage-1 map = 0%,  reduce = 0%
2014-12-30 18:25:16,897 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.64 sec
MapReduce Total cumulative CPU time: 3 seconds 640 msec
Ended Job = job_1419929171445_0014
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/tokenized_access_logs/.hive-staging_hive_2014-12-30_18-24-54_818_2944563302602072732-1/-ext-10000
Loading data to table default.tokenized_access_logs
Table default.tokenized_access_logs stats: [numFiles=1, numRows=180000, totalSize=37433868, rawDataSize=37253868]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.64 sec   HDFS Read: 39594107 HDFS Write: 37433963 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 640 msec
OK
Time taken: 24.707 seconds


hive>select * from intermediate_access_logs limit 3;
OK
79.133.215.123	14/Jun/2014:10:30:13 -0400	GET	/home	HTTP/1.1	200	1671	-	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36
162.235.161.200	14/Jun/2014:10:30:13 -0400	GET	/department/apparel/category/featured%20shops/product/adidas%20Kids'%20RG%20III%20Mid%20Football%20Cleat	HTTP/1.1	200	1175	-	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/7.0.4 Safari/537.76.4
39.244.91.133	14/Jun/2014:10:30:14 -0400	GET	/department/fitness	HTTP/1.1	200	1435	-Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36
Time taken: 0.536 seconds, Fetched: 3 row(s)

hive> select * from tokenized_access_logs limit 3;
OK
79.133.215.123	14/Jun/2014:10:30:13 -0400	GET	/home	HTTP/1.1	200	1671	-	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML
162.235.161.200	14/Jun/2014:10:30:13 -0400	GET	/department/apparel/category/featured%20shops/product/adidas%20Kids'%20RG%20III%20Mid%20Football%20Cleat	HTTP/1.1	200	1175	-	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.76.4 (KHTML
39.244.91.133	14/Jun/2014:10:30:14 -0400	GET	/department/fitness	HTTP/1.1	200	1435	-Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML
Time taken: 0.079 seconds, Fetched: 3 row(s)

5. Check Impala version

 
> version();
Shell version: Impala Shell v2.1.0-cdh5 (e48c2b4) built on Tue Dec 16 19:00:35 PST 2014
Server version: impala version 2.1.0-cdh5 RELEASE (build e48c2b48c53ea9601b8f47a39373aa83ff7ca6e2)
 
> show tables;
Query: show tables
+-------------+
| name        |
+-------------+
| categories  |
| customers   |
| departments |
| order_items |
| orders      |
| products    |
+-------------+

6. Flush metadata


> invalidate metadata;
Query: invalidate metadata

Fetched 0 row(s) in 0.91s 
> show tables;
Query: show tables
+--------------------------+
| name                     |
+--------------------------+
| categories               |
| customers                |
| departments              |
| intermediate_access_logs |
| order_items              |
| orders                   |
| products                 |
| tokenized_access_logs    |
+--------------------------+
Fetched 8 row(s) in 0.01s

7. Find the top 10 products with the most clicks


Query: select count(*),url from tokenized_access_logs
where url like '%\/product\/%'
group by url order by count(*) desc
limit 10
+----------+----------------------------------------------------------------------------------------------------------------------+
| count(*) | url                                                                                                                  |
+----------+----------------------------------------------------------------------------------------------------------------------+
| 1926     | /department/apparel/category/cleats/product/Perfect%20Fitness%20Perfect%20Rip%20Deck                                 |
| 1793     | /department/apparel/category/featured%20shops/product/adidas%20Kids'%20RG%20III%20Mid%20Football%20Cleat             |
| 1780     | /department/golf/category/women's%20apparel/product/Nike%20Men's%20Dri-FIT%20Victory%20Golf%20Polo                   |
| 1757     | /department/apparel/category/men's%20footwear/product/Nike%20Men's%20CJ%20Elite%202%20TD%20Football%20Cleat          |
| 1104     | /department/fan%20shop/category/water%20sports/product/Pelican%20Sunstream%20100%20Kayak                             |
| 1084     | /department/fan%20shop/category/indoor/outdoor%20games/product/O'Brien%20Men's%20Neoprene%20Life%20Vest              |
| 1059     | /department/fan%20shop/category/camping%20&%20hiking/product/Diamondback%20Women's%20Serene%20Classic%20Comfort%20Bi |
| 1028     | /department/fan%20shop/category/fishing/product/Field%20&%20Stream%20Sportsman%2016%20Gun%20Fire%20Safe              |
| 1004     | /department/footwear/category/cardio%20equipment/product/Nike%20Men's%20Free%205.0+%20Running%20Shoe                 |
| 939      | /department/footwear/category/fitness%20accessories/product/Under%20Armour%20Hustle%20Storm%20Medium%20Duffle%20Bag  |
+----------+----------------------------------------------------------------------------------------------------------------------+

8. Find the 10 products with the largest order quantity

Query: select p.product_name, sum(o.order_item_quantity)
from products p join order_items o
on (p.product_id = o.order_item_product_id)
group by p.product_name
order by sum(o.order_item_quantity) desc
limit 10
+-----------------------------------------------+----------------------------+
| product_name                                  | sum(o.order_item_quantity) |
+-----------------------------------------------+----------------------------+
| Perfect Fitness Perfect Rip Deck              | 73698                      |
| Nike Men's Dri-FIT Victory Golf Polo          | 62956                      |
| O'Brien Men's Neoprene Life Vest              | 57803                      |
| Nike Men's Free 5.0+ Running Shoe             | 36680                      |
| Under Armour Girls' Toddler Spine Surge Runni | 31735                      |
| Nike Men's CJ Elite 2 TD Football Cleat       | 22246                      |
| Field & Stream Sportsman 16 Gun Fire Safe     | 17325                      |
| Pelican Sunstream 100 Kayak                   | 15500                      |
| Diamondback Women's Serene Classic Comfort Bi | 13729                      |
| ENO Atlas Hammock Straps                      | 998                        |
+-----------------------------------------------+----------------------------+
Fetched 10 row(s) in 1.17s

9. More digging…

Comparing the top clicks with top order quantity, you will notice that the ‘adidas Kids’ RG III Mid Football Cleat’ has the second largest clicks, but it is not on the top sell list. Further querying indicates that the item did not generate any transactions at all.

Houston, We’ve got a problem !

Query: select p.product_name, sum(o.order_item_quantity)
from products p join order_items o
on (p.product_id = o.order_item_product_id)
where p.product_name like 'adidas%'
group by p.product_name
order by sum(o.order_item_quantity) desc
+-----------------------------------------------+----------------------------+
| product_name                                  | sum(o.order_item_quantity) |
+-----------------------------------------------+----------------------------+
| adidas Youth Germany Black/Red Away Match Soc | 969                        |
| adidas Men's F10 Messi TRX FG Soccer Cleat    | 939                        |
| adidas Men's Germany Black Crest Away Tee     | 859                        |
| adidas Kids' F5 Messi FG Soccer Cleat         | 781                        |
| adidas Brazuca 2014 Official Match Ball       | 65                         |
+-----------------------------------------------+----------------------------+
Fetched 5 row(s) in 1.20s

Adapted from Cloudera

From RDBMS to HDFS: Convert MySQL data tables to HDFS Avro files

Tools

Cloudera CDH 5.3
Sqoop: Load data from MySQL to HDFS while preserving the structure
Avro: Hadoop optimized file format
Hive: Use MapReducer to query the data on HDFS

Procedure

1. Log in to the Master Node of the cluster using SSH

2. Transform relational tables from MySQL to Apache Avro format files in HDFS, and create Avro schema


sqoop import-all-tables \
-m 12 \
--connect jdbc:mysql://{$MASTER_NODE_NAME}:3306/retail_db \
--username=retail_dba \
--password=cloudera \
--compression-codec=snappy \
--as-avrodatafile \
--warehouse-dir=/user/hive/warehouse

3. Verify the data files (.avro) are created in Avro format on HDFS


[cloudera@quickstart Desktop]$ hadoop fs -ls /user/hive/warehouse
Found 6 items
drwxr-xr-x   - cloudera hive          0 2014-12-30 10:41 /user/hive/warehouse/categories
drwxr-xr-x   - cloudera hive          0 2014-12-30 10:30 /user/hive/warehouse/customers
drwxr-xr-x   - cloudera hive          0 2014-12-30 00:53 /user/hive/warehouse/departments
drwxr-xr-x   - cloudera hive          0 2014-12-30 00:54 /user/hive/warehouse/order_items
drwxr-xr-x   - cloudera hive          0 2014-12-30 00:54 /user/hive/warehouse/orders
drwxr-xr-x   - cloudera hive          0 2014-12-30 00:55 /user/hive/warehouse/products

[cloudera@quickstart Desktop]$ hadoop fs -ls /user/hive/warehouse/categories/
rw-r--r--   1 cloudera hive          0 2014-12-30 00:53 /user/hive/warehouse/categories/_SUCCESS
-rw-r--r--   1 cloudera hive       1344 2014-12-30 00:53 /user/hive/warehouse/categories/part-m-00000.avro

4. Verify the Avro Schema files (.avsc) are created on Local. Notice that that the data files are separated from schema files.


[cloudera@quickstart Desktop]$ ls -lat *.avsc
-rw-rw-r-- 1 cloudera cloudera  922 Dec 30 00:54 sqoop_import_products.avsc
-rw-rw-r-- 1 cloudera cloudera  632 Dec 30 00:54 sqoop_import_orders.avsc
-rw-rw-r-- 1 cloudera cloudera  980 Dec 30 00:54 sqoop_import_order_items.avsc
-rw-rw-r-- 1 cloudera cloudera  409 Dec 30 00:53 sqoop_import_departments.avsc
-rw-rw-r-- 1 cloudera cloudera 1324 Dec 30 00:53 sqoop_import_customers.avsc
-rw-rw-r-- 1 cloudera cloudera  541 Dec 30 00:52 sqoop_import_categories.avsc

The Avro Schema files are in Json format:


{
  "type" : "record",
  "name" : "sqoop_import_customers",
  "doc" : "Sqoop import of customers",
  "fields" : [ {
    "name" : "customer_id",
    "type" : [ "int", "null" ],
    "columnName" : "customer_id",
    "sqlType" : "4"
  }, {
    "name" : "customer_fname",
    "type" : [ "string", "null" ],
    "columnName" : "customer_fname",
    "sqlType" : "12"
  }, {
    "name" : "customer_lname",
    "type" : [ "string", "null" ],
    "columnName" : "customer_lname",
    "sqlType" : "12"
  }, {
    "name" : "customer_email",
    "type" : [ "string", "null" ],
    "columnName" : "customer_email",
    "sqlType" : "12"
  }, {
    "name" : "customer_password",
    "type" : [ "string", "null" ],
    "columnName" : "customer_password",
    "sqlType" : "12"
  }, {
    "name" : "customer_street",
    "type" : [ "string", "null" ],
    "columnName" : "customer_street",
    "sqlType" : "12"
  }, {
    "name" : "customer_city",
    "type" : [ "string", "null" ],
    "columnName" : "customer_city",
    "sqlType" : "12"
  }, {
    "name" : "customer_state",
    "type" : [ "string", "null" ],
    "columnName" : "customer_state",
    "sqlType" : "12"
  }, {
    "name" : "customer_zipcode",
    "type" : [ "string", "null" ],
    "columnName" : "customer_zipcode",
    "sqlType" : "12"
  } ],
  "tableName" : "customers"
}

5. Copy schema files to HDFS so Hive and Impala can access them


sudo -u hdfs hadoop fs -mkdir /user/examples
sudo -u hdfs hadoop fs -chmod +rw /user/examples
hadoop fs -copyFromLocal *.avsc /user/examples/
hadoop fs -ls /user/examples/

[cloudera@quickstart Desktop]$ hadoop fs -ls /user/examples/
Found 6 items
-rw-r--r--   1 cloudera supergroup        541 2014-12-30 09:38 /user/examples/sqoop_import_categories.avsc
-rw-r--r--   1 cloudera supergroup       1324 2014-12-30 09:38 /user/examples/sqoop_import_customers.avsc
-rw-r--r--   1 cloudera supergroup        409 2014-12-30 09:38 /user/examples/sqoop_import_departments.avsc
-rw-r--r--   1 cloudera supergroup        980 2014-12-30 09:38 /user/examples/sqoop_import_order_items.avsc
-rw-r--r--   1 cloudera supergroup        632 2014-12-30 09:38 /user/examples/sqoop_import_orders.avsc
-rw-r--r--   1 cloudera supergroup        922 2014-12-30 09:38 /user/examples/sqoop_import_products.avsc

6. Create tables using Hive CLI or Hue Hive GUI. The metadata for these tables is used by Hive and Impala. However the metadata needs to refreshed for Impala to access.


hive>
CREATE EXTERNAL TABLE categories
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///user/hive/warehouse/categories'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/examples/sqoop_import_categories.avsc');

CREATE EXTERNAL TABLE customers
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///user/hive/warehouse/customers'
TBLPROPERTIES ('avro.schema.url'='hdfs://{$MASTER_NODE_NAME}/user/examples/sqoop_import_categories_customers.avsc');

CREATE EXTERNAL TABLE departments
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///user/hive/warehouse/departments'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/examples/sqoop_import_departments.avsc');

CREATE EXTERNAL TABLE orders
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///user/hive/warehouse/orders'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/examples/sqoop_import_orders.avsc');

CREATE EXTERNAL TABLE order_items
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///user/hive/warehouse/order_items'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/examples/sqoop_import_order_items.avsc');

CREATE EXTERNAL TABLE products
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///user/hive/warehouse/products'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/examples/sqoop_import_products.avsc');

hive>show tables;
OK
categories
customers
departments
order_items
orders
products
Time taken: 0.015 seconds, Fetched: 6 row(s)

hive>desc categories;
OK
category_id int from deserializer
category_department_id int from deserializer
category_name string from deserializer
Time taken: 0.244 seconds, Fetched: 3 row(s)

7. Query the data using Hive

There is no MapReduce job involved. The response is very quick.

hive>select * from categories;
OK
1	2	Football
2	2	Soccer
3	2	Baseball & Softball
4	2	Basketball
5	2	Lacrosse
...

For any queries involving aggregation, filtering, and sorting, there are MapReducer jobs running. The response is slow even for a very small table/file.

hive>select count(*) from categories;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1419929171445_0007, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1419929171445_0007/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1419929171445_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-12-30 10:34:11,134 Stage-1 map = 0%,  reduce = 0%
2014-12-30 10:34:20,099 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.89 sec
2014-12-30 10:34:30,772 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.14 sec
MapReduce Total cumulative CPU time: 3 seconds 140 msec
Ended Job = job_1419929171445_0007
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.14 sec   HDFS Read: 5104 HDFS Write: 3 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 140 msec
OK
58
Time taken: 31.67 seconds, Fetched: 1 row(s)

hive>select * from categories where category_id <10;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1419929171445_0009, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1419929171445_0009/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1419929171445_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-12-30 10:38:02,492 Stage-1 map = 0%,  reduce = 0%
2014-12-30 10:38:11,271 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.31 sec
MapReduce Total cumulative CPU time: 2 seconds 310 msec
Ended Job = job_1419929171445_0009
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.31 sec   HDFS Read: 5104 HDFS Write: 145 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 310 msec
OK
1	2	Football
2	2	Soccer
3	2	Baseball & Softball
4	2	Basketball
5	2	Lacrosse
6	2	Tennis & Racquet
7	2	Hockey
8	2	More Sports
9	3	Cardio Equipment
Time taken: 20.234 seconds, Fetched: 9 row(s)

hive>select category_name from categories;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1419929171445_0011, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1419929171445_0011/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1419929171445_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-12-30 10:41:44,418 Stage-1 map = 0%,  reduce = 0%
2014-12-30 10:41:53,186 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.05 sec
MapReduce Total cumulative CPU time: 2 seconds 50 msec
Ended Job = job_1419929171445_0011
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.05 sec   HDFS Read: 5104 HDFS Write: 748 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 50 msec
OK
Football
Soccer
Baseball & Softball
Basketball
Lacrosse
...


					

Pig, Hive, or Impala?

These are very few of the jungle animals in the Hadoop. Hadoop was originally designed for heavy-duty distributed batch processing. So latency was not the major concerns when it was used for ETL. With more people attempt to use Hadoop as a data platform, performance is becoming the big challenge.

Pig was created in Yahoo to allow non Java-developers to perform ETL jobs. Pig is a high level abstract of the underlying MapReducer jobs that are mainly coded in Java or Scala. The main purpose for Pig scripting is to perform data transformation. You can use Pig to achieve the same results with Hive but it takes much more efforts.

Hive was developed in Facebook. Its SQL-like interface provides even higher level of abstraction of the MapReducer jobs. You can pass this query to Hive editor. It will take quite a while to get the result set. If you are like me, you may feel excited to know that you can use SQL to get some result back from Hadoop. But wait for more than 5 seconds to run a simple query like this ? There should be a better way.

The better way is Impala, a real SQL engine designed from the ground up by Cloudera without dependency on MapReducer! According to Mike Olson from Clourera, Impala is installed on every Hadoop node, lives right alongside the MapReduce engine , Apache HBase engine, and the Search engine. The result?  Performance is dramatically improved. So if you are looking for real time response, Impala is the way to go!

SELECT description, salary
FROM sample_07
WHERE salary > 100000
ORDER BY salary DESC
LIMIT 10

Looking at the log (from Cloudera), you will find out that many steps are involved to get this simple query to execute: semantic analysis -> MetaData -> plan generation -> query drive retrieving hive schema -> query drive launching job -> setting up parameters -> submitting job -> staring job.

Hive logs:

14/12/29 21:15:13 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
14/12/29 21:15:13 INFO parse.SemanticAnalyzer: Completed phase 1 of Semantic Analysis
14/12/29 21:15:13 INFO parse.SemanticAnalyzer: Get metadata for source tables
14/12/29 21:15:13 INFO parse.SemanticAnalyzer: Get metadata for subqueries
14/12/29 21:15:13 INFO parse.SemanticAnalyzer: Get metadata for destination tables
14/12/29 21:15:13 INFO parse.SemanticAnalyzer: Completed getting MetaData in Semantic Analysis
14/12/29 21:15:13 INFO parse.SemanticAnalyzer: Set stats collection dir : hdfs://quickstart.cloudera:8020/user/hive/warehouse/sample_07/.hive-staging_hive_2014-12-29_21-15-13_402_2743441118894875056-4/-ext-10002
14/12/29 21:15:13 INFO parse.SemanticAnalyzer: Completed plan generation
14/12/29 21:15:13 INFO ql.Driver: Semantic Analysis Completed
14/12/29 21:15:13 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:sample_07.description, type:string, comment:null), FieldSchema(name:sample_07.salary, type:int, comment:null)], properties:null)
14/12/29 21:15:13 INFO ql.Driver: Concurrency mode is disabled, not creating a lock manager
14/12/29 21:15:13 INFO ql.Driver: Starting command: SELECT sample_07.description, sample_07.salary
FROM
  sample_07
WHERE
( sample_07.salary > 100000)
ORDER BY sample_07.salary DESC
LIMIT 1000
14/12/29 21:15:13 INFO ql.Driver: Total jobs = 1
14/12/29 21:15:13 INFO ql.Driver: Launching Job 1 out of 1
14/12/29 21:15:13 INFO exec.Task: Number of reduce tasks determined at compile time: 1
14/12/29 21:15:13 INFO exec.Task: In order to change the average load for a reducer (in bytes):
14/12/29 21:15:13 INFO exec.Task:   set hive.exec.reducers.bytes.per.reducer=<number>
14/12/29 21:15:13 INFO exec.Task: In order to limit the maximum number of reducers:
14/12/29 21:15:13 INFO exec.Task:   set hive.exec.reducers.max=<number>
14/12/29 21:15:13 INFO exec.Task: In order to set a constant number of reducers:
14/12/29 21:15:13 INFO exec.Task:   set mapreduce.job.reduces=<number>
14/12/29 21:15:13 INFO mr.ExecDriver: Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
14/12/29 21:15:13 INFO mr.ExecDriver: adding libjars: file:///usr/lib/hive/lib/hive-contrib.jar
14/12/29 21:15:13 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14/12/29 21:15:13 INFO exec.Task: Starting Job = job_1419908967835_0010, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1419908967835_0010/