Drilling into Big Data — Data Querying and Analysis (6)

By Priyankaa Arunachalam, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud’s incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.

In the previous article, we have discussed about Spark for big data and show you how to set it up on Alibaba Cloud.

In this blog series, we will walk you through the basics of Hive, including table creation and other underlying concepts for big data applications.

“Our ability to do great things with data will make a real difference in every aspect of our lives,” Jennifer Pahlka

There are different ways of executing MapReduce operations. First is the traditional approach, where we use Java MapReduce program for all types of data. The second approach is the scripting approach for MapReduce to process structured and semi-structured data. This approach is achieved by using Pig. Then comes the Hive Query Language, HiveQL or HQL, for MapReduce to process structured data. This is achieved by Hive.

The Case for Hive

As discussed in our previous article, Hadoop is a vast array of tools and technologies and at this point, it is more convenient to deploy Hive and Pig. Hive has its advantages over Pig, especially since it can make data reporting and analyzing easier through warehousing.

Hive is built on top of Hadoop and used for querying and analysis of data that is stored in HDFS. It is a tool which helps programmers analyze large data sets and access the data easily with the help of a query language called HiveQL. This language internally converts the SQL-like queries into MapReduce jobs for deploying it on Hadoop.

We also have Impala at this standpoint, which is quite commonly heard along with Hive, but if you watch keenly, Hive has its own space in the market place and hence it has better support too. Impala is also a query engine built on top of Hadoop. It makes use of existing Hive as many Hadoop users already have it in place to perform batch oriented jobs.

The main goal of Impala is to make fast and efficient operations through SQL. Integrating Hive with Impala gives users an advantage to use either Hive or Impala for processing or to create tables. Impala uses a language called ImpalaQL which is a subset of HiveQL. In this article, we will focus on Hive.

Features of Hive

  • Hive is designed for managing and querying only structured data

Hive and Relational Databases

Relational databases are of “Schema on Reaad and Schema on Write”, where functions like Insertions, Updates, and Modifications can be performed. By borrowing the concept of “write once read many (WORM)”, Hive was designed based on “Schema on Read only”. A typical Hive query runs on multiple Data Nodes and hence it was tough to update and modify data across multiple nodes. But this has been sorted out in the latest versions of Hive.

File Formats

Hive supports various file formats like the flat Files or text files, SequenceFiles, RC and ORC Files, Avro Files, Parquet and custom input and output formats. Text file is the default file format of Hive.

Storage options in Hive

  • Metastore: Metastore is a major component to look at, that keeps a track of all the metadata of database, tables, datatypes, etc.

Hive SerDe

Serializer/Deserializer in Hive or Hive SerDe is used for the purpose of IO which can handle both serialization and deserialization in Hive. There are different types of SerDe like native SerDe and custom SerDe with which you can create tables. If the ROW FORMAT is not specified, then use native SerDe. Apart from different types of SerDe and we can also write our own SerDe for our own data formats. At this initial stage, we will just get familiar with this concept, as it is something important to concentrate in Hive.

To recollect, whatever file you write to HDFS, it is just stored as a file over there. Here comes Hive, which can impose structure on different data formats. The points below elaborates on initiating a hive shell, its usage and some basic queries to start with and understand the working of hive.

Initiating Hive Shell

Log in to the Alibaba Cloud master Elastic Compue Service (ECS) Instance, and just type “hive” which leads to a screen as shown below. If you have configured Hive to run on a different worker node, rather than the master, then login to that particular host and open the hive shell.

Image for post
Image for post

Let’s look through some basic queries in HiveQL. The very first is creating a table in Hive.

Syntax

CREATE TABLE [IF NOT EXISTS] table_name[(col_name data_type [COMMENT col_comment], ...)][COMMENT table_comment][ROW FORMAT row_format][STORED AS file_format]

Output

OKTime taken: 5.905 seconds

In our case, let’s try creating a table with the columns present in the tripadvisor_merged sheet.

create table tripadvisor_museum_USonly (Sno string, Address string, Description string, Fee string, Longitude string, Latitude string, Lengthofvisit string, Museumname string, Phonenum string, Rank string, Rating float, Reviewcount string, TotalThingsToDo string, Country string, State string, Rankpercentage string, Art_Galleries string, Art_Museums string, Auto_Race_Tracks string,  Families_Count string, Couple_Count string, Solo_Count string,Friends_Count string)row format serde 'com.bizo.hive.serde.csv.CSVSerde'with serdeproperties ("separatorChar" = ",","quoteChar"  = "\"")stored as textfiletblproperties ("skip.header.line.count"="1");
Image for post
Image for post

For the same case, let’s try creating an Avro table, as we have been speaking about this file format in the entire blog series for its best performance.

create table tripadvisor_museum_USonly_avroROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'TBLPROPERTIES ('avro.schema.literal'='{"namespace": "com.trip.avro","name": "tripadvisor","type": "record","fields": [{"name":"Sno","type":"string"},{"name":"Address","type":"string"},{"name":"Description","type":"string"},{"name":"FeatureCount","type":"string"},{"name":"Fee","type":"string"},{"name":"Longitude","type":"string"},{"name":"Latitude","type":"string"},{"name":"Length_of_visit","type":"string"},{"name":"Museumname","type":"string},{"name":"descri_sub","type":"string"}]}');

Now let’s insert the data into the Avro table using insert statement. Below is a sample insert query for Avro table.

insert overwrite table tripavro select Sno, Address, Description, FeatureCount, Fee,Longitude, Latitude, Lengthofvisit, Museumname, descri_sub fromtripadvisor_museum_USonly;

Similarly create all columns for the columns in the file and create a corresponding insert statement to insert the data into the created tables

Image for post
Image for post

Once done, list the folder to view it

Image for post
Image for post

Load Data

Generally, after creating a table in SQL, we used to insert data using the Insert statement. But in Hive, we can insert an entire dataset using the “LOAD DATA” statement.

Syntax

LOAD DATA [LOCAL] INPATH 'filepath'[OVERWRITE] INTO TABLE tablename ;

Using OSS in Hive

To use Alibaba Cloud Object Storage Service (OSS) as storage in Hive, create an External table as follows

CREATE EXTERNAL TABLE demo_table (userid INT, name STRING)LOCATION 'oss://demo1bucket/users';

For example, write a script for creating an external table and save it as hiveDemo.sql. Once done, upload it to OSS

USE DEFAULT;set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;set hive.stats.autogather=false;CREATE EXTERNAL TABLE demo_table (userid INT,name STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'STORED AS TEXTFILELOCATION 'oss://demo1bucket/users';

Create a new job in E-MapReduce based on the following configuration

-f ossref://${bucket}/yourpath/hiveDemo.sql

Specify the bucket name in “${bucket} and mention the location where you have saved the hive script in “yourpath”

When Is Hive Not Suitable?

  • Hive may not be the best fit for OnLine Transaction Processing (OLTP)

Best Practices

  • Use Partitioning to improve query performance

Reference: https://www.alibabacloud.com/blog/drilling-into-big-data-data-querying-and-analysis_594662?spm=a2c41.12741469.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store