HIVE常用命令之MSCK REPAIR TABLE命令简述

您所在的位置:网站首页 济南土屋路8号地图 HIVE常用命令之MSCK REPAIR TABLE命令简述

HIVE常用命令之MSCK REPAIR TABLE命令简述

2024-07-13 16:40:49| 来源: 网络整理| 查看: 265

工作中发现很多同事连基础的hive命令都不知道,所以准备写一个系列把hive一些常用的命令进行一个总结。第一个讲的命令是MSCK REPAIR TABLE。

MSCK REPAIR TABLE 命令是做啥的

MSCK REPAIR TABLE命令主要是用来解决通过hdfs dfs -put或者hdfs api写入hive分区表的数据在hive中无法被查询到的问题。

我们知道hive有个服务叫metastore,这个服务主要是存储一些元数据信息,比如数据库名,表名或者表的分区等等信息。如果不是通过hive的insert等插入语句,很多分区信息在metastore中是没有的,如果插入分区数据量很多的话,你用 ALTER TABLE table_name ADD PARTITION 一个个分区添加十分麻烦。这时候MSCK REPAIR TABLE就派上用场了。只需要运行MSCK REPAIR TABLE命令,hive就会去检测这个表在hdfs上的文件,把没有写入metastore的分区信息写入metastore。

例子

我们先创建一个分区表,然后往其中的一个分区插入一条数据,在查看分区信息

 

CREATE TABLE repair_test (col_a STRING) PARTITIONED BY (par STRING); INSERT INTO TABLE repair_test PARTITION(par="partition_1") VALUES ("test"); SHOW PARTITIONS repair_test;

查看分区信息的结果如下

 

0: jdbc:hive2://localhost:10000> show partitions repair_test; INFO : Compiling command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21): show partitions repair_test INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21); Time taken: 0.029 seconds INFO : Executing command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21): show partitions repair_test INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20180810175151_5260f52e-10bb-4589-ad48-31ba72a81c21); Time taken: 0.017 seconds INFO : OK +------------------+--+ | partition | +------------------+--+ | par=partition_1 | +------------------+--+ 1 row selected (0.073 seconds) 0: jdbc:hive2://localhost:10000>

然后我们通过hdfs的put命令手动创建一个数据

 

[ericsson@h3cnamenode1 pcc]$ echo "123123" > test.txt [ericsson@h3cnamenode1 pcc]$ hdfs dfs -mkdir -p /user/hive/warehouse/test.db/repair_test/par=partition_2/ [ericsson@h3cnamenode1 pcc]$ hdfs dfs -put -f test.txt /user/hive/warehouse/test.db/repair_test/par=partition_2/ [ericsson@h3cnamenode1 pcc]$ hdfs dfs -ls -R /user/hive/warehouse/test.db/repair_test drwxrwxrwt - ericsson hive 0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1 drwxrwxrwt - ericsson hive 0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/.hive-staging_hive_2018-08-10_17-45-59_029_1594310228554990949-1 drwxrwxrwt - ericsson hive 0 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/.hive-staging_hive_2018-08-10_17-45-59_029_1594310228554990949-1/-ext-10000 -rwxrwxrwt 3 ericsson hive 5 2018-08-10 17:46 /user/hive/warehouse/test.db/repair_test/par=partition_1/000000_0 drwxr-xr-x - ericsson hive 0 2018-08-10 17:57 /user/hive/warehouse/test.db/repair_test/par=partition_2 -rw-r--r-- 3 ericsson hive 7 2018-08-10 17:57 /user/hive/warehouse/test.db/repair_test/par=partition_2/test.txt [ericsson@h3cnamenode1 pcc]$

 

这时候我们查询分区信息,发现partition_2这个分区并没有加入到hive中

 

0: jdbc:hive2://localhost:10000> show partitions repair_test; INFO : Compiling command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79): show partitions repair_test INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79); Time taken: 0.029 seconds INFO : Executing command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79): show partitions repair_test INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20180810175959_e7cefe8c-57b5-486c-8e03-b1201dac4d79); Time taken: 0.02 seconds INFO : OK +------------------+--+ | partition | +------------------+--+ | par=partition_1 | +------------------+--+ 1 row selected (0.079 seconds) 0: jdbc:hive2://localhost:10000>

运行MSCK REPAIR TABLE 命令后再查询分区信息,可以看到通过put命令放入的分区已经可以查询了

 

0: jdbc:hive2://localhost:10000> MSCK REPAIR TABLE repair_test; INFO : Compiling command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f): MSCK REPAIR TABLE repair_test INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f); Time taken: 0.004 seconds INFO : Executing command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f): MSCK REPAIR TABLE repair_test INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20180810180000_7099daf2-6fde-44dd-8938-d2a02589358f); Time taken: 0.138 seconds INFO : OK No rows affected (0.154 seconds) 0: jdbc:hive2://localhost:10000> show partitions repair_test; INFO : Compiling command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25): show partitions repair_test INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:partition, type:string, comment:from deserializer)], properties:null) INFO : Completed compiling command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25); Time taken: 0.045 seconds INFO : Executing command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25): show partitions repair_test INFO : Starting task [Stage-0:DDL] in serial mode INFO : Completed executing command(queryId=hive_20180810180000_ff711820-6f41-4d5d-9fee-b6e1cdbe1e25); Time taken: 0.016 seconds INFO : OK +------------------+--+ | partition | +------------------+--+ | par=partition_1 | | par=partition_2 | +------------------+--+ 2 rows selected (0.088 seconds) 0: jdbc:hive2://localhost:10000> select * from repair_test; INFO : Compiling command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38): select * from repair_test INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:repair_test.col_a, type:string, comment:null), FieldSchema(name:repair_test.par, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38); Time taken: 0.059 seconds INFO : Executing command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38): select * from repair_test INFO : Completed executing command(queryId=hive_20180810180101_1225075e-43c8-4a49-b8ef-a12f72544a38); Time taken: 0.001 seconds INFO : OK +--------------------+------------------+--+ | repair_test.col_a | repair_test.par | +--------------------+------------------+--+ | test | partition_1 | | 123123 | partition_2 | +--------------------+------------------+--+ 2 rows selected (0.121 seconds) 0: jdbc:hive2://localhost:10000>

 

后续

后面发生了更有意思的事情。大致情况是很多人以为alter table drop partition只能删除一个分区的数据,结果用hdfs dfs -rmr 删除hive分区表的hdfs文件。这就导致了一个问题hdfs上的文件虽然删除了,但是hive metastore中的原信息没有删除。如果用show parttions table_name 这些分区信息还在,需要把这些分区原信息清除。

后来我想看看MSCK REPAIR TABLE这个命令能否删除已经不存在hdfs上的表分区信息,发现不行,我去jira查了下,发现Fix Version/s: 3.0.0, 2.4.0, 3.1.0 这几个版本的hive才支持这个功能。但由于我们的hive版本是1.1.0-cdh5.11.0, 这个方法无法使用。

附上官网的链接Recover Partitions (MSCK REPAIR TABLE)

Recover Partitions (MSCK REPAIR TABLE)

Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command) or removed from HDFS, the metastore (and hence Hive) will not be aware of these changes to partition information unless the user runs ALTER TABLE table_name ADD/DROP PARTITION commands on each of the newly added or removed partitions, respectively. However, users can run a metastore check command with the repair table option:MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS]; which will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS. See HIVE-874 and HIVE-17824 for more details. When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once. MSCK command without the REPAIR option can be used to find details about metadata mismatch metastore. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:ALTER TABLE table_name RECOVER PARTITIONS; Starting with Hive 1.3, MSCK will throw exceptions if directories with disallowed characters in partition values are found on HDFS. Use hive.msck.path.validation setting on the client to alter this behavior; "skip" will simply skip the directories. "ignore" will try to create partitions anyway (old behavior). This may or may not work.

HIVE-17824 是关于hive msck repair 增加清理metastore中已经不在hdfs上的分区信息

作者:润土1030 链接:https://www.jianshu.com/p/c1b0dc86f9b0 来源:简书 著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。



【本文地址】

公司简介

联系我们

今日新闻


点击排行

实验室常用的仪器、试剂和
说到实验室常用到的东西,主要就分为仪器、试剂和耗
不用再找了,全球10大实验
01、赛默飞世尔科技(热电)Thermo Fisher Scientif
三代水柜的量产巅峰T-72坦
作者:寞寒最近,西边闹腾挺大,本来小寞以为忙完这
通风柜跟实验室通风系统有
说到通风柜跟实验室通风,不少人都纠结二者到底是不
集消毒杀菌、烘干收纳为一
厨房是家里细菌较多的地方,潮湿的环境、没有完全密
实验室设备之全钢实验台如
全钢实验台是实验室家具中较为重要的家具之一,很多

推荐新闻


图片新闻

实验室药品柜的特性有哪些
实验室药品柜是实验室家具的重要组成部分之一,主要
小学科学实验中有哪些教学
计算机 计算器 一般 打孔器 打气筒 仪器车 显微镜
实验室各种仪器原理动图讲
1.紫外分光光谱UV分析原理:吸收紫外光能量,引起分
高中化学常见仪器及实验装
1、可加热仪器:2、计量仪器:(1)仪器A的名称:量
微生物操作主要设备和器具
今天盘点一下微生物操作主要设备和器具,别嫌我啰嗦
浅谈通风柜使用基本常识
 众所周知,通风柜功能中最主要的就是排气功能。在

专题文章

    CopyRight 2018-2019 实验室设备网 版权所有 win10的实时保护怎么永久关闭