Getting File MetadataΒΆ

Let us see how to get metadata for the files stored in HDFS using hdfs fsck command.

  • We have files copied under HDFS location /user/${USER}/retail_db. We also have some sample large files copied under HDFS location /public/randomtextwriter. We can use hdfs fsck command.

  • We will first see how to get metadata of these files and then try to interpret it in subsequent topics.

  • HDFS stands for Hadoop Distributed File System. It means files are copied in distributed fashion.

  • Our cluster have master nodes and worker nodes, in this case the files will be physically copied in the worker nodes where data node process is running. We will cover this as part of the HDFS architecture.

  • Here are the details about worker nodes along with corresponding private ips.

Private ip

Full DNS

Short DNS

172.16.1.102

wn01.itversity.com

wn01

172.16.1.103

wn02.itversity.com

wn02

172.16.1.104

wn03.itversity.com

wn03

172.16.1.107

wn04.itversity.com

wn04

172.16.1.108

wn05.itversity.com

wn05

%%sh

hdfs fsck -help
Usage: hdfs fsck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]] [-includeSnapshots] [-storagepolicies] [-blockId <blk_Id>]
	<path>	start checking from this path
	-move	move corrupted files to /lost+found
	-delete	delete corrupted files
	-files	print out files being checked
	-openforwrite	print out files opened for write
	-includeSnapshots	include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
	-list-corruptfileblocks	print out list of missing blocks and files they belong to
	-blocks	print out block report
	-locations	print out locations for every block
	-racks	print out network topology for data-node locations
	-storagepolicies	print out storage policy summary for the blocks
	-blockId	print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)
	-replicaDetails	print out each replica details 

Please Note:
	1. By default fsck ignores files opened for write, use -openforwrite to report such files. They are usually  tagged CORRUPT or HEALTHY depending on their block allocation status
	2. Option -includeSnapshots should not be used for comparing stats, should be used only for HEALTH check, as this may contain duplicates if the same file present in both original fs tree and inside snapshots.

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
  • We can get high level overview for a retail_db folder by using hdfs fsck retail_db

%%sh

hdfs fsck /user/${USER}/retail_db
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /user/itversity/retail_db at Thu Jan 21 05:34:39 EST 2021
......Status: HEALTHY
 Total size:	9537787 B
 Total dirs:	7
 Total files:	6
 Total symlinks:		0
 Total blocks (validated):	6 (avg. block size 1589631 B)
 Minimally replicated blocks:	6 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	2
 Average block replication:	2.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		5
 Number of racks:		1
FSCK ended at Thu Jan 21 05:34:39 EST 2021 in 1 milliseconds


The filesystem under path '/user/itversity/retail_db' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&path=%2Fuser%2Fitversity%2Fretail_db
  • We can get details about file names using -files option.

%%sh

hdfs fsck /user/${USER}/retail_db -files
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /user/itversity/retail_db at Thu Jan 21 05:35:17 EST 2021
/user/itversity/retail_db <dir>
/user/itversity/retail_db/categories <dir>
/user/itversity/retail_db/categories/part-00000 1029 bytes, 1 block(s):  OK
/user/itversity/retail_db/customers <dir>
/user/itversity/retail_db/customers/part-00000 953719 bytes, 1 block(s):  OK
/user/itversity/retail_db/departments <dir>
/user/itversity/retail_db/departments/part-00000 60 bytes, 1 block(s):  OK
/user/itversity/retail_db/order_items <dir>
/user/itversity/retail_db/order_items/part-00000 5408880 bytes, 1 block(s):  OK
/user/itversity/retail_db/orders <dir>
/user/itversity/retail_db/orders/part-00000 2999944 bytes, 1 block(s):  OK
/user/itversity/retail_db/products <dir>
/user/itversity/retail_db/products/part-00000 174155 bytes, 1 block(s):  OK
Status: HEALTHY
 Total size:	9537787 B
 Total dirs:	7
 Total files:	6
 Total symlinks:		0
 Total blocks (validated):	6 (avg. block size 1589631 B)
 Minimally replicated blocks:	6 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	2
 Average block replication:	2.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		5
 Number of racks:		1
FSCK ended at Thu Jan 21 05:35:17 EST 2021 in 1 milliseconds


The filesystem under path '/user/itversity/retail_db' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&path=%2Fuser%2Fitversity%2Fretail_db
  • Files in HDFS will be physically stored in worker nodes as blocks. We can get details of blocks associated with files using -blocks option.

%%sh

hdfs fsck /user/${USER}/retail_db -files -blocks
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /user/itversity/retail_db at Thu Jan 21 05:36:09 EST 2021
/user/itversity/retail_db <dir>
/user/itversity/retail_db/categories <dir>
/user/itversity/retail_db/categories/part-00000 1029 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455898_41737435 len=1029 repl=2

/user/itversity/retail_db/customers <dir>
/user/itversity/retail_db/customers/part-00000 953719 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455899_41737436 len=953719 repl=2

/user/itversity/retail_db/departments <dir>
/user/itversity/retail_db/departments/part-00000 60 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455900_41737437 len=60 repl=2

/user/itversity/retail_db/order_items <dir>
/user/itversity/retail_db/order_items/part-00000 5408880 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455901_41737438 len=5408880 repl=2

/user/itversity/retail_db/orders <dir>
/user/itversity/retail_db/orders/part-00000 2999944 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455902_41737439 len=2999944 repl=2

/user/itversity/retail_db/products <dir>
/user/itversity/retail_db/products/part-00000 174155 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455903_41737440 len=174155 repl=2

Status: HEALTHY
 Total size:	9537787 B
 Total dirs:	7
 Total files:	6
 Total symlinks:		0
 Total blocks (validated):	6 (avg. block size 1589631 B)
 Minimally replicated blocks:	6 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	2
 Average block replication:	2.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		5
 Number of racks:		1
FSCK ended at Thu Jan 21 05:36:09 EST 2021 in 1 milliseconds


The filesystem under path '/user/itversity/retail_db' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&path=%2Fuser%2Fitversity%2Fretail_db
  • -blocks will only provide details about the names of the blocks, we need to use -locations as well to get the details about the worker nodes where the blocks are physically stored.

  • A block is nothing but a physical file in HDFS. We will understand more about blocks as part of the subsequent topics.

  • To understand where a block is physically stored you can get the infromation from DatanodeInfoWithStorage part of the output. It will contain ip address and we can get the corresponding DNS from the above table.

%%sh

hdfs fsck /user/${USER}/retail_db -files -blocks -locations
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /user/itversity/retail_db at Thu Jan 21 05:38:08 EST 2021
/user/itversity/retail_db <dir>
/user/itversity/retail_db/categories <dir>
/user/itversity/retail_db/categories/part-00000 1029 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455898_41737435 len=1029 repl=2 [DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]

/user/itversity/retail_db/customers <dir>
/user/itversity/retail_db/customers/part-00000 953719 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455899_41737436 len=953719 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK]]

/user/itversity/retail_db/departments <dir>
/user/itversity/retail_db/departments/part-00000 60 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455900_41737437 len=60 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK]]

/user/itversity/retail_db/order_items <dir>
/user/itversity/retail_db/order_items/part-00000 5408880 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455901_41737438 len=5408880 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]

/user/itversity/retail_db/orders <dir>
/user/itversity/retail_db/orders/part-00000 2999944 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455902_41737439 len=2999944 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]

/user/itversity/retail_db/products <dir>
/user/itversity/retail_db/products/part-00000 174155 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455903_41737440 len=174155 repl=2 [DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]

Status: HEALTHY
 Total size:	9537787 B
 Total dirs:	7
 Total files:	6
 Total symlinks:		0
 Total blocks (validated):	6 (avg. block size 1589631 B)
 Minimally replicated blocks:	6 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	2
 Average block replication:	2.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		5
 Number of racks:		1
FSCK ended at Thu Jan 21 05:38:08 EST 2021 in 1 milliseconds


The filesystem under path '/user/itversity/retail_db' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fuser%2Fitversity%2Fretail_db
%%sh

hdfs dfs -ls -h /public/randomtextwriter/part-m-00000
-rw-r--r--   3 hdfs hdfs      1.0 G 2017-01-18 20:24 /public/randomtextwriter/part-m-00000
%%sh

hdfs fsck /public/randomtextwriter/part-m-00000 -files -blocks -locations
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/randomtextwriter/part-m-00000 at Thu Jan 21 05:39:53 EST 2021
/public/randomtextwriter/part-m-00000 1102230331 bytes, 9 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1074171511_431441 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
1. BP-292116404-172.16.1.101-1479167821718:blk_1074171524_431454 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
2. BP-292116404-172.16.1.101-1479167821718:blk_1074171559_431489 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
3. BP-292116404-172.16.1.101-1479167821718:blk_1074171609_431539 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
4. BP-292116404-172.16.1.101-1479167821718:blk_1074171657_431587 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK]]
5. BP-292116404-172.16.1.101-1479167821718:blk_1074171691_431621 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
6. BP-292116404-172.16.1.101-1479167821718:blk_1074171721_431651 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
7. BP-292116404-172.16.1.101-1479167821718:blk_1074171731_431661 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
8. BP-292116404-172.16.1.101-1479167821718:blk_1074171736_431666 len=28488507 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK]]

Status: HEALTHY
 Total size:	1102230331 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	9 (avg. block size 122470036 B)
 Minimally replicated blocks:	9 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	2
 Average block replication:	3.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		5
 Number of racks:		1
FSCK ended at Thu Jan 21 05:39:53 EST 2021 in 0 milliseconds


The filesystem under path '/public/randomtextwriter/part-m-00000' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Frandomtextwriter%2Fpart-m-00000