Archive

Posts Tagged ‘ETL’

Loading data in Hadoop with Hive

January 31, 2018 Leave a comment

It’s been a busy month but 2018 begins with a new post.

 

A common Big Data scenario is to use Hadoop for transforming data and data ingestion – in other words using Hadoop for ETL.

In this post, I will show an example of how to load a comma separated values text file into HDFS. Once the file is moved in HDFS, use Apache Hive to create a table and load the data into a Hive warehouse. In this case Hive is used as an ETL tool so to speak. Once the data is loaded, it can be analysed with SQL queries in Hive. Then data is available to be provisioned to any BI tool that supports Hadoop Hive connectors like Qlik or Tableau. You can even connect Excel to Hadoop by using Microsoft Power Query for Excel add-in. Exciting, isn’t it !?!

Well, let’s get started.

 

1. Move file into HDFS

First, create an input directory in HDFS and copy the file from the local file system.

I have created a sample person.csv file where the first two records are myself and my lovely daughter Alexandra. Who knows she may teach me some supa dupa technologies one day, though she is more into art and dance for now 🙂

10,Maria,F,London

20,Alexandra,F,London

30,Kate,F,Birmingham

40,Matt,M,Kent

50,John,M,London

1_1

[maria_dev@sandbox ~]$ hdfs dfs -mkdir /user/maria_dev/maria_test

[maria_dev@sandbox ~]$ hdfs dfs -copyFromLocal /home/maria_dev/person.csv /user/maria_dev/maria_test

1

For the purposes of this demo I will use Hortonworks sandbox. If you now log in Ambari and navigate to Files View, you should be able to see the file just copied into HDFS.

2

2. Create an external table

Very quickly here, Hive supports two types of tables:

  • Internal – it is managed by Hive and if deleted both the definition and data will be deleted
  • External – as the name suggests, it is not managed by Hive and if deleted only the metadata is deleted but the data remains

Run the command below either from Hive command line or Hive View in Ambari. I use Ambari.

CREATE EXTERNAL TABLE IF NOT EXISTS PERSON_EXT(

PersonId INT,

FirstName STRING,

Gender CHAR(1),

City STRING)

COMMENT ‘Person external table’

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

STORED AS TEXTFILE

location ‘/user/maria_dev/maria_test’;

3

You can find the external table now in the list of tables.

4

Verify the import is successful by querying the data with HSQL.

5

Before moving to the next step, let me quickly mention Tez engine here. Tez has some great improvements in terms of performance of the Hive queries in comparison with the standard MapReduce execution engine. To get speed improvements, make sure you have Tez enabled which is done via the Hive View Settings tab as show below.

6

3. Create internal table stored as ORC

ORC is an optimized row columnar format that significantly improves Hive performance. There are other formats that can be used such as sequence or text file but ORC and Parquet are most commonly used mainly because of the performance and compression benefits.

Run the create command below.

CREATE TABLE IF NOT EXISTS PERSON(

PersonId INT,

FirstName STRING,

Gender CHAR(1),

City STRING)

COMMENT ‘Person’

STORED AS ORC;

After successfully ran the command you will see person table on the right.

7

So far so good. It is time now to load the actual data.

4. Copy data from the external table into the internal Hive table

Use INSERT SELECT to load the data.

8

Verify data has been successfully loaded.

9

 

Fab, good stuff – nice and simple.

Hope you enjoyed this!

 

Cheers,

Maria

 

 

Advertisements
Categories: Big Data Tags: , , , ,
%d bloggers like this: