Drilling into Big Data — Data Querying and Analysis (6)

By Priyankaa Arunachalam, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud’s incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.

In the previous article, we have discussed about Spark for big data and show you how to set it up on Alibaba Cloud.

In this blog series, we will walk you through the basics of Hive, including table creation and other underlying concepts for big data applications.

“Our ability to do great things with data will make a real difference in every aspect of our lives,” Jennifer Pahlka

There are different ways of executing MapReduce operations. First is the traditional approach, where we use Java MapReduce program for all types of data. The second approach is the scripting approach for MapReduce to process structured and semi-structured data. This approach is achieved by using Pig. Then comes the Hive Query Language, HiveQL or HQL, for MapReduce to process structured data. This is achieved by Hive.

The Case for Hive

As discussed in our previous article, Hadoop is a vast array of tools and technologies and at this point, it is more convenient to deploy Hive and Pig. Hive has its advantages over Pig, especially since it can make data reporting and analyzing easier through warehousing.

Hive is built on top of Hadoop and used for querying and analysis of data that is stored in HDFS. It is a tool which helps programmers analyze large data sets and access the data easily with the help of a query language called HiveQL. This language internally converts the SQL-like queries into MapReduce jobs for deploying it on Hadoop.

We also have Impala at this standpoint, which is quite commonly heard along with Hive, but if you watch keenly, Hive has its own space in the market place and hence it has better support too. Impala is also a query engine built on top of Hadoop. It makes use of existing Hive as many Hadoop users already have it in place to perform batch oriented jobs.

The main goal of Impala is to make fast and efficient operations through SQL. Integrating Hive with Impala gives users an advantage to use either Hive or Impala for processing or to create tables. Impala uses a language called ImpalaQL which is a subset of HiveQL. In this article, we will focus on Hive.

Features of Hive

  • Hive is designed for managing and querying only structured data
  • While dealing with these structured data, Map Reduce doesn’t have optimization features like UDFs but Hive framework can help you better in terms of optimizations
  • The complexity of Map Reduce programming can be reduced using Hive as it uses HQL instead of Java. Hence the concepts are similar to SQL.
  • Hive uses partitioning to improve performance on certain queries.
  • A significant component of Hive is Metastore which resides in a relational database.

Hive and Relational Databases

Relational databases are of “Schema on Reaad and Schema on Write”, where functions like Insertions, Updates, and Modifications can be performed. By borrowing the concept of “write once read many (WORM)”, Hive was designed based on “Schema on Read only”. A typical Hive query runs on multiple Data Nodes and hence it was tough to update and modify data across multiple nodes. But this has been sorted out in the latest versions of Hive.

File Formats

Hive supports various file formats like the flat Files or text files, SequenceFiles, RC and ORC Files, Avro Files, Parquet and custom input and output formats. Text file is the default file format of Hive.

Storage options in Hive

  • Metastore: Metastore is a major component to look at, that keeps a track of all the metadata of database, tables, datatypes, etc.
  • Tables: There are two different types of tables available in Hive. They are normal tables and external tables. Both are similar to common tables in a Database but the word EXTERNAL lets you create a table and provide a LOCATION so that Hive does not use the default location.
    Another difference is that when the external table is deleted, you still have the data residing in HDFS. On the other hand, the data in the normal table gets deleted on deleting the table.
  • Partitions: Partitioning is slicing the tables, which are stored in different subdirectory within a table’s directory. This helps in boosting the query performance especially in cases like select statements with “WHERE” clause.
  • Buckets: Buckets are hashed partitions and they help in accelerating the speed of joins and sampling of data. To achieve this bucketing concept in hive, use clustered by command while creating a table. Each bucket will be stored as a file under the Table directory. At the time of table creation, fix the number of buckets. Based on the hash value, the data will be distributed across various buckets.

Hive SerDe

Serializer/Deserializer in Hive or Hive SerDe is used for the purpose of IO which can handle both serialization and deserialization in Hive. There are different types of SerDe like native SerDe and custom SerDe with which you can create tables. If the ROW FORMAT is not specified, then use native SerDe. Apart from different types of SerDe and we can also write our own SerDe for our own data formats. At this initial stage, we will just get familiar with this concept, as it is something important to concentrate in Hive.

To recollect, whatever file you write to HDFS, it is just stored as a file over there. Here comes Hive, which can impose structure on different data formats. The points below elaborates on initiating a hive shell, its usage and some basic queries to start with and understand the working of hive.

Initiating Hive Shell

Log in to the Alibaba Cloud master Elastic Compue Service (ECS) Instance, and just type “hive” which leads to a screen as shown below. If you have configured Hive to run on a different worker node, rather than the master, then login to that particular host and open the hive shell.

Image for post
Image for post

Let’s look through some basic queries in HiveQL. The very first is creating a table in Hive.

Syntax

[(col_name data_type [COMMENT col_comment], ...)][COMMENT table_comment][ROW FORMAT row_format][STORED AS file_format]

Output

Time taken: 5.905 seconds

In our case, let’s try creating a table with the columns present in the tripadvisor_merged sheet.

Friends_Count string)row format serde 'com.bizo.hive.serde.csv.CSVSerde'with serdeproperties ("separatorChar" = ",","quoteChar"  = "\"")stored as textfiletblproperties ("skip.header.line.count"="1");
Image for post
Image for post

For the same case, let’s try creating an Avro table, as we have been speaking about this file format in the entire blog series for its best performance.

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'TBLPROPERTIES ('avro.schema.literal'='{"namespace": "com.trip.avro","name": "tripadvisor","type": "record","fields": [{"name":"Sno","type":"string"},{"name":"Address","type":"string"},{"name":"Description","type":"string"},{"name":"FeatureCount","type":"string"},{"name":"Fee","type":"string"},{"name":"Longitude","type":"string"},{"name":"Latitude","type":"string"},{"name":"Length_of_visit","type":"string"},{"name":"Museumname","type":"string},{"name":"descri_sub","type":"string"}]}');

Now let’s insert the data into the Avro table using insert statement. Below is a sample insert query for Avro table.

Longitude, Latitude, Lengthofvisit, Museumname, descri_sub fromtripadvisor_museum_USonly;

Similarly create all columns for the columns in the file and create a corresponding insert statement to insert the data into the created tables

Image for post
Image for post

Once done, list the folder to view it

Image for post
Image for post

Load Data

Generally, after creating a table in SQL, we used to insert data using the Insert statement. But in Hive, we can insert an entire dataset using the “LOAD DATA” statement.

Syntax

[OVERWRITE] INTO TABLE tablename ;

Using OSS in Hive

To use Alibaba Cloud Object Storage Service (OSS) as storage in Hive, create an External table as follows

userid INT, name STRING)LOCATION 'oss://demo1bucket/users';

For example, write a script for creating an external table and save it as hiveDemo.sql. Once done, upload it to OSS

set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;set hive.stats.autogather=false;CREATE EXTERNAL TABLE demo_table (userid INT,name STRING)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'STORED AS TEXTFILELOCATION 'oss://demo1bucket/users';

Create a new job in E-MapReduce based on the following configuration

Specify the bucket name in “${bucket} and mention the location where you have saved the hive script in “yourpath”

When Is Hive Not Suitable?

  • Hive may not be the best fit for OnLine Transaction Processing (OLTP)
  • Real-time queries
  • Hive is not designed for updates at row level

Best Practices

  • Use Partitioning to improve query performance
  • If the bucket key and join keys are common, then bucketing can help improve the join performance
  • Selecting an input format plays a critical role which we are talking about in all our articles. Hive is not an exception in this aspect. Choosing appropriate file format can make Hive performance better
  • Sampling allows users to take a subset of dataset and analyze it, without having to struggle with the entire data set. Hive offers a built-in TABLESAMPLE clause which allows sampling of tables

Reference: https://www.alibabacloud.com/blog/drilling-into-big-data-data-querying-and-analysis_594662?spm=a2c41.12741469.0.0

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store