Posts Tagged ‘Hadoop’

Ingesting data into Hive using Spark

February 6, 2018 1 comment

Before heading off for my hols at the end of this week to soak up some sun with my gorgeous princess, I thought I would write another blog post. Two posts in a row – that has not happened for ages 🙂


My previous post was about loading a text file into Hadoop using Hive. Today what I am looking to do is to load the same file into a Hive table but using Spark this time.

I like it when you have options to choose from and use the most appropriate technology that will best fit into customers requirements and needs with respect to technical architecture and design. You obviously need to bare in mind performance as always!

You would usually consider using Spark when the requirements call for fast interactive processing. Conversely, in the case of batch processing, you would go for Hive or Pig. So going back to Apache Spark, it is a powerful engine for large-scale in-memory data processing and this is where Spark fits against all data access tools in the Hadoop ecosystem.

In a nutshell, Spark is fast and runs programs up to x100 faster than MapReduce jobs in memory, or x10 faster on disk. You can use it from Scala, Python and R shells.

I am not going into much more details here, but just a quick intro though to flesh out the concepts so that we all stay on the same page.  Spark’s data abstractions are:

  • RDD – resilient distributed dataset eg RDDs can be created from HDFS files
  • DataFrame – built on top of RDD or created from Hive tables or external SQL/NoSQL databases. Similar to the RDBMS world, data is organized into columns


Alright, let’s get started. Butterflies in my stomach again.

I will use PySpark – Spark via Python as you can guess.

[maria_dev@sandbox ~]$ pyspark


1. First import the local raw csv file into a Spark RDD

>>> csv_person = sc.textFile(“file:///home/maria_dev/person.csv”)

>>> type(csv_person)

By using the type command above, you can quickly double check the import into the RDD is successful.


2. Use Spark’s map( ) function to split csv data into a new csv_person RDD

>>> csv_person = p: p.split(“,”))


3. Use toDF() function to put the data from the new RDD into a Spark DataFrame

Notice the use of the map() function to associate each RDD item with a row object in the DataFrame. Row() captures the mapping of the single values into named columns in a row to be saved into the new DataFrame.

>>> df_csv = p: Row(PersonId = int(p[0]), FirstName = p[1], Gender=p[2], City=p[3])).toDF()


4. Verify all 6 rows of data in df_csv DataFrame with show command



5. Finally, use saveAsTable() to store the data from the DataFrame into a Hive table in ORC format

>>> from pyspark.sql import HiveContext
>>> hc = HiveContext(sc)
>>> df_csv.write.format(“orc”).saveAsTable(“person_spark”)


6. Log in Ambari Hive to check your table



There you go. Happy days 🙂





Categories: Big Data Tags: , , , ,

Loading data in Hadoop with Hive

January 31, 2018 2 comments

It’s been a busy month but 2018 begins with a new post.


A common Big Data scenario is to use Hadoop for transforming data and data ingestion – in other words using Hadoop for ETL.

In this post, I will show an example of how to load a comma separated values text file into HDFS. Once the file is moved in HDFS, use Apache Hive to create a table and load the data into a Hive warehouse. In this case Hive is used as an ETL tool so to speak. Once the data is loaded, it can be analysed with SQL queries in Hive. Then data is available to be provisioned to any BI tool that supports Hadoop Hive connectors like Qlik or Tableau. You can even connect Excel to Hadoop by using Microsoft Power Query for Excel add-in. Exciting, isn’t it !?!

Well, let’s get started.


1. Move file into HDFS

First, create an input directory in HDFS and copy the file from the local file system.

I have created a sample person.csv file where the first two records are myself and my lovely daughter Alexandra. Who knows she may teach me some supa dupa technologies one day, though she is more into art and dance for now 🙂







[maria_dev@sandbox ~]$ hdfs dfs -mkdir /user/maria_dev/maria_test

[maria_dev@sandbox ~]$ hdfs dfs -copyFromLocal /home/maria_dev/person.csv /user/maria_dev/maria_test


For the purposes of this demo I will use Hortonworks sandbox. If you now log in Ambari and navigate to Files View, you should be able to see the file just copied into HDFS.


2. Create an external table

Very quickly here, Hive supports two types of tables:

  • Internal – it is managed by Hive and if deleted both the definition and data will be deleted
  • External – as the name suggests, it is not managed by Hive and if deleted only the metadata is deleted but the data remains

Run the command below either from Hive command line or Hive View in Ambari. I use Ambari.


PersonId INT,

FirstName STRING,

Gender CHAR(1),


COMMENT ‘Person external table’




location ‘/user/maria_dev/maria_test’;


You can find the external table now in the list of tables.


Verify the import is successful by querying the data with HSQL.


Before moving to the next step, let me quickly mention Tez engine here. Tez has some great improvements in terms of performance of the Hive queries in comparison with the standard MapReduce execution engine. To get speed improvements, make sure you have Tez enabled which is done via the Hive View Settings tab as show below.


3. Create internal table stored as ORC

ORC is an optimized row columnar format that significantly improves Hive performance. There are other formats that can be used such as sequence or text file but ORC and Parquet are most commonly used mainly because of the performance and compression benefits.

Run the create command below.


PersonId INT,

FirstName STRING,

Gender CHAR(1),


COMMENT ‘Person’


After successfully ran the command you will see person table on the right.


So far so good. It is time now to load the actual data.

4. Copy data from the external table into the internal Hive table

Use INSERT SELECT to load the data.


Verify data has been successfully loaded.



Fab, good stuff – nice and simple.

Hope you enjoyed this!






Categories: Big Data Tags: , , , ,

Big Data Marathon

November 16, 2017 Leave a comment

This week there is a Big Data event in London, gathering Big Data clients, geeks and vendors from all over to speak on the latest trends, projects, platforms and products which helps everyone to stay on the same page and align the steering wheel as well as get a feeling of where the fast-pacing technology world is going. The event is massive but I am glad I could make it even only for one hour and have some fruitful chats with the exhibitors. The ones which are of particular interest to me are Cloudera, Hortonworks, Informatica with Big Data Management and ICS, MapR, Datalytyx. Check here for more information


Nowadays, with the massive explosion of data in the digital era we all live in, it is not difficult to notice the evolution of Data to Big Data. With the new sources of data, the variety of data has changed – unstructured(text, images, videos), semi-structured(JSON, facebook graphs) and structured. Big Data drives asking new questions which require the use of predictive, behavioural models and machine learning. In a nut shell, in terms of physical storage the options are Cloud or on premise and in terms of logical storage – RDBMS, Hadoop, NoSQL.


The use of Hadoop may include the use of libraries that layer NoSQL implementations on top of HDFS data. Hadoop is a datastore or a data lake for a lot of customers and is a replacement in some way of a relational database. On this point here, I would like to stress out that RDBMS like Oracle are still around and they are not going to disappear from the Big Data picture as they are designed to solve specific set of data problems that Hadoop is not designed to solve. Having said that, bringing in Hadoop as a solution will be as an addition to existing RDBMS and not a replacement of customers’ EDW.




Categories: Big Data Tags: ,
%d bloggers like this: