Best Usages of Hadoop Hive HiveQL Languages For Greater Performance

To handle massive volume of data, Facebook has launched Apache Hadoop in 2007. Facebook data was  growing in a very fast way but earlier Facebook infrastructure was built in commercial relational database which was taking a long time to complete its daily jobs. So to scale along with data, Hadoop was launched.

You can find out Apache Hive on top of Hadoop. It helps in data summarization, inquiring and analysis of data. You can create a database,tables, views and also you can access in queries in data.

Hive has made analysis easier. It provides a sequel like query language called HQL(Hive Query Language) which has a well defined data definition language and data manipulation language like sequel.

You can easily use this apache hadoop to store data and run applications on bunches of commodity hardware. To project structure onto Hadoop data set, Hive provides a mechanism. Neither you write complex map reduce program nor you need to learn Java programming. Apache Hadoop is the best project of apache.

Characteristics of Hive on Hadoop

Hive is one of the primary sections of Apache Hadoop. Basically it reduces infusibility of writing complex java Mapreduce programs. You can store and organize huge amounts of data by Hafoop. Data can be any size, shape or amount. Hive covers vital verticals of the Hadoop ecosystem. If you check its feature than you can find out below characteristics of Hive.

  • Hive is a budget friendly software,
  • Through browser you can easily access the solution by pc,smartphones or tablets from any location, you will not require any software to download or install.
  • You can keep all your information safe and secure, confidential and up to date by using Hive.
  • Hive supports online analytical processing.
  • You can stock schema in a database and procedure data into HDFS.
  • It can conduct low level interface requirement of apache hadoop perfectly.
  • Hive helps in improving the overall performance.
  • It assists in client application which are written in java, PHP,Python, C++ and Ruby.
  • Hive helps in data determination and it is a skilled Extract Transform Load tool.
  • Basis SQL query knowledge is enough for working with HiveQL
  • It manages internal storage by creating tables, partitions, and bucketing to improve the performance.
  • Hive applied in some other areas like-Log Processing, Text mining, Business analytics,Predictive modeling, document indexing and Data mining.

Basic HiveQL Commands

Hive has both DDL and DML commands like sequel through which you can create tables, databases or use show table command to get the tables present in the database.HiveQL is basically large dataset. Hive’s simple SQL query language manages HiveQL. If you are familiar with SQL language then you do not need to know other programming language.Through this language programmers can perform better analysis. Here you can get the basic information about the HiveQL commands of Data Definition Language, Data Manipulation Language.

Data Definition Language(DDL)

DDL command is used to construct and modify tables and other affairs in the database. For instances -Create, Drop, Truncate, Alter, Show, Describe statements. 

To CREATE  database-

  • First You have to give the command sudo hive in Hive shell and then enter the command create database<database name> and create a new database in the hive.
  • Type the command Show databases to list out the databases in Hive.
  • You can see a default location has been created of the hive warehouse.You can find out hive database stores in cloudera in a user/hive/warehouse
  • To use database type the command Use<database name>
  • Copy the input data by resorting the copy from local command and now you can transfer the data from HDFS to hive table.
  • By typing user/hive/warehouse/retail.db you can create a table within this location.

Data Manipulation Language(DML)

This command is used to recover, store, delete, insert and upgrade data in the database. For instances- Load, insert statements.

Sequence-Load Data<Local>inpath<file path>into table(table name)

Insert Command

This is used to energize or load the data hive table.You can create an insert to a table or a partition.

  • To overwrite the data in the table or partition, use  Insert Overwrite.
  • To connect the data into existing data in a table use Insert Into.

Build Create External Table

To create a table and give a place where this table will be build, we use the command Create External table so that hive can not choose a default location for this table. It basically indicates HDFS location for its storage.

Use Partitioned By and Clustered By Command

To part the table, you can use Partitioned By command and to divide into buckets use Clustered By command.

To set dynamic partition mode enable so that hive does not show error, type-

set hive.exec.dynamic.partition=true

To set dynamic partition mode nonstrict you need type-

set hive.exec.dynamic.partition.mode=nonstrict

Aggregation

By using this command you can three different categories “cate” table. You need to select count (Distinct category) from table name.

Grouping

By using this group command you can sum up the result that has been set by one or more columns. To do that follow the steps-Select category>sum amount from txt records group by category.

Join Command

This command includes left outer join, right outer join and full join. You can conduct fields from two tables and use common values to each table by using these commands.

Left outer join consists all records of the left table and right outer join is a resemblance of a left outer join.You can see at least once all rows of the right table will arrive in the joined table. In full join you can get all records from both tables. After creating the whole process you need to type quit command to egress from the hive.

Concluding Notes

Apache Hadoop is  a tool through which makes inquiring and analyzing simple.It demonstrates how you can use HiveQL languages to navigate hadoop file system. Data scientist must have potentiality to operate data that is why programmers should use hadoop for data science.