Hive
- Is data warehouse
- Uses HQL
Services
- Hive
- Metastore
- Stores metadata of tables and partitions
- Hive in remote mode:
- Metastore use its own JVM process and other services communite with it using Thrift API
- HiveServer2
Tuning
- Memory configuration
- GC limits
- Partitions number limits
HA
- Configure load balancer for HiveServer2
- Enable HA for metastore
Replication
Security
- Authentication: LDAP and Kerberos
- If Kerberos is not enabled no need to provide password when connecting using Beeline CLI
- Authorization: Ranger
- Encryption SASL QOP or SSL/TLS
About
- Hive do not enforce schema on write, but it
hive-site.xml
- hive.metastore.warehouse.dir : default dir in hdfs where data is stored
- Metastore
- Driver: javax.jdo.option.ConnectionDriverName
- Password: javax.jdo.option.ConnectionPassword
- Url: javax.jdo.option.ConnectionURL
core-site.xml
Data Model
Table
-
Show
show tables in <namespace>show tables 'emp*'
-
Describe
describe extended <tablename>describe formatted <tablename>
-
Describe column
describe extended <tablename.columnname>
-
Internal (managed)
- Data stored in default location
- The data is dropped when the table is dropped
-
External
- Location is not the default one
- Dropping table will not drop data, although table's metadata will be dropped
Partition
-
strict: prohibit queries without where clause in partitioned tables
-
nonstrict: allow running queries without where clause
set hive.mapred.mode=strict; -
We can describe and show partitions
show partitions <tablename>show partitions <tablename> PARTITION(col_name='value')
-
We can alter table by adding partition
-
no support for insert, update and delete
-
doesn't support transactions
Create table if not exists mydb.employee(
name string,
salary float,
adgress struct<street: string, city: string, state: string, zip: int>
)
PARTITIONED BY (country string, state string);- Load data from file to partitioned table
LOAD DATA LOCAL INPATH '${env:HOME}/path/to/dir'
INTO TABLE employees
PARTITION (col_part_name1 = 'value1', col_part_name2 = 'value2');
Bucket
- Each bucket is stored as file in partition directory
Types
- Java types (int, string, float, ...)
- Complex types: Array, Map, Struct
Internal vs external table
- Hive manages interanl but non external
- Drop internal delete metadata and path, but drop external remove only metadata
- Internal support archive, unarchive, ACID/transactional and result caching ?? but none of there features are supported by an external