Archive

Posts Tagged ‘Spark’

Ingesting data into Hive using Spark

February 6, 2018 Leave a comment

Before heading off for my hols at the end of this week to soak up some sun with my gorgeous princess, I thought I would write another blog post. Two posts in a row – that has not happened for ages 🙂

 

My previous post was about loading a text file into Hadoop using Hive. Today what I am looking to do is to load the same file into a Hive table but using Spark this time.

I like it when you have options to choose from and use the most appropriate technology that will best fit into customers requirements and needs with respect to technical architecture and design. You obviously need to bare in mind performance as always!

You would usually consider using Spark when the requirements call for fast interactive processing. Conversely, in the case of batch processing, you would go for Hive or Pig. So going back to Apache Spark, it is a powerful engine for large-scale in-memory data processing and this is where Spark fits against all data access tools in the Hadoop ecosystem.

In a nutshell, Spark is fast and runs programs up to x100 faster than MapReduce jobs in memory, or x10 faster on disk. You can use it from Scala, Python and R shells.

I am not going into much more details here, but just a quick intro though to flesh out the concepts so that we all stay on the same page.  Spark’s data abstractions are:

  • RDD – resilient distributed dataset eg RDDs can be created from HDFS files
  • DataFrame – built on top of RDD or created from Hive tables or external SQL/NoSQL databases. Similar to the RDBMS world, data is organized into columns

 

Alright, let’s get started. Butterflies in my stomach again.

I will use PySpark – Spark via Python as you can guess.

[maria_dev@sandbox ~]$ pyspark

1

1. First import the local raw csv file into a Spark RDD

>>> csv_person = sc.textFile(“file:///home/maria_dev/person.csv”)

>>> type(csv_person)

By using the type command above, you can quickly double check the import into the RDD is successful.

2

2. Use Spark’s map( ) function to split csv data into a new csv_person RDD

>>> csv_person = csv_person.map(lambda p: p.split(“,”))

3

3. Use toDF() function to put the data from the new RDD into a Spark DataFrame

Notice the use of the map() function to associate each RDD item with a row object in the DataFrame. Row() captures the mapping of the single values into named columns in a row to be saved into the new DataFrame.

>>> df_csv = csv_person.map(lambda p: Row(PersonId = int(p[0]), FirstName = p[1], Gender=p[2], City=p[3])).toDF()

4

4. Verify all 6 rows of data in df_csv DataFrame with show command

>>> df_csv.show(6)

5

5. Finally, use saveAsTable() to store the data from the DataFrame into a Hive table in ORC format

>>> from pyspark.sql import HiveContext
>>> hc = HiveContext(sc)
>>> df_csv.write.format(“orc”).saveAsTable(“person_spark”)

6

6. Log in Ambari Hive to check your table

7

 

There you go. Happy days 🙂

 

Cheers,

Maria

 

Advertisements
Categories: Big Data Tags: , , , ,
%d bloggers like this: