Getting File MetadataΒΆ
Let us see how to get metadata for the files stored in HDFS using hdfs fsck
command.
We have files copied under HDFS location
/user/${USER}/retail_db
. We also have some sample large files copied under HDFS location/public/randomtextwriter
. We can usehdfs fsck
command.We will first see how to get metadata of these files and then try to interpret it in subsequent topics.
HDFS stands for Hadoop Distributed File System. It means files are copied in distributed fashion.
Our cluster have master nodes and worker nodes, in this case the files will be physically copied in the worker nodes where data node process is running. We will cover this as part of the HDFS architecture.
Here are the details about worker nodes along with corresponding private ips.
Private ip |
Full DNS |
Short DNS |
---|---|---|
172.16.1.102 |
wn01.itversity.com |
wn01 |
172.16.1.103 |
wn02.itversity.com |
wn02 |
172.16.1.104 |
wn03.itversity.com |
wn03 |
172.16.1.107 |
wn04.itversity.com |
wn04 |
172.16.1.108 |
wn05.itversity.com |
wn05 |
%%sh
hdfs fsck -help
Usage: hdfs fsck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]] [-includeSnapshots] [-storagepolicies] [-blockId <blk_Id>]
<path> start checking from this path
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-includeSnapshots include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
-list-corruptfileblocks print out list of missing blocks and files they belong to
-blocks print out block report
-locations print out locations for every block
-racks print out network topology for data-node locations
-storagepolicies print out storage policy summary for the blocks
-blockId print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)
-replicaDetails print out each replica details
Please Note:
1. By default fsck ignores files opened for write, use -openforwrite to report such files. They are usually tagged CORRUPT or HEALTHY depending on their block allocation status
2. Option -includeSnapshots should not be used for comparing stats, should be used only for HEALTH check, as this may contain duplicates if the same file present in both original fs tree and inside snapshots.
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
We can get high level overview for a retail_db folder by using
hdfs fsck retail_db
%%sh
hdfs fsck /user/${USER}/retail_db
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /user/itversity/retail_db at Thu Jan 21 05:34:39 EST 2021
......Status: HEALTHY
Total size: 9537787 B
Total dirs: 7
Total files: 6
Total symlinks: 0
Total blocks (validated): 6 (avg. block size 1589631 B)
Minimally replicated blocks: 6 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu Jan 21 05:34:39 EST 2021 in 1 milliseconds
The filesystem under path '/user/itversity/retail_db' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&path=%2Fuser%2Fitversity%2Fretail_db
We can get details about file names using
-files
option.
%%sh
hdfs fsck /user/${USER}/retail_db -files
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /user/itversity/retail_db at Thu Jan 21 05:35:17 EST 2021
/user/itversity/retail_db <dir>
/user/itversity/retail_db/categories <dir>
/user/itversity/retail_db/categories/part-00000 1029 bytes, 1 block(s): OK
/user/itversity/retail_db/customers <dir>
/user/itversity/retail_db/customers/part-00000 953719 bytes, 1 block(s): OK
/user/itversity/retail_db/departments <dir>
/user/itversity/retail_db/departments/part-00000 60 bytes, 1 block(s): OK
/user/itversity/retail_db/order_items <dir>
/user/itversity/retail_db/order_items/part-00000 5408880 bytes, 1 block(s): OK
/user/itversity/retail_db/orders <dir>
/user/itversity/retail_db/orders/part-00000 2999944 bytes, 1 block(s): OK
/user/itversity/retail_db/products <dir>
/user/itversity/retail_db/products/part-00000 174155 bytes, 1 block(s): OK
Status: HEALTHY
Total size: 9537787 B
Total dirs: 7
Total files: 6
Total symlinks: 0
Total blocks (validated): 6 (avg. block size 1589631 B)
Minimally replicated blocks: 6 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu Jan 21 05:35:17 EST 2021 in 1 milliseconds
The filesystem under path '/user/itversity/retail_db' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&path=%2Fuser%2Fitversity%2Fretail_db
Files in HDFS will be physically stored in worker nodes as blocks. We can get details of blocks associated with files using
-blocks
option.
%%sh
hdfs fsck /user/${USER}/retail_db -files -blocks
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /user/itversity/retail_db at Thu Jan 21 05:36:09 EST 2021
/user/itversity/retail_db <dir>
/user/itversity/retail_db/categories <dir>
/user/itversity/retail_db/categories/part-00000 1029 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455898_41737435 len=1029 repl=2
/user/itversity/retail_db/customers <dir>
/user/itversity/retail_db/customers/part-00000 953719 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455899_41737436 len=953719 repl=2
/user/itversity/retail_db/departments <dir>
/user/itversity/retail_db/departments/part-00000 60 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455900_41737437 len=60 repl=2
/user/itversity/retail_db/order_items <dir>
/user/itversity/retail_db/order_items/part-00000 5408880 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455901_41737438 len=5408880 repl=2
/user/itversity/retail_db/orders <dir>
/user/itversity/retail_db/orders/part-00000 2999944 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455902_41737439 len=2999944 repl=2
/user/itversity/retail_db/products <dir>
/user/itversity/retail_db/products/part-00000 174155 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455903_41737440 len=174155 repl=2
Status: HEALTHY
Total size: 9537787 B
Total dirs: 7
Total files: 6
Total symlinks: 0
Total blocks (validated): 6 (avg. block size 1589631 B)
Minimally replicated blocks: 6 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu Jan 21 05:36:09 EST 2021 in 1 milliseconds
The filesystem under path '/user/itversity/retail_db' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&path=%2Fuser%2Fitversity%2Fretail_db
-blocks
will only provide details about the names of the blocks, we need to use-locations
as well to get the details about the worker nodes where the blocks are physically stored.A block is nothing but a physical file in HDFS. We will understand more about blocks as part of the subsequent topics.
To understand where a block is physically stored you can get the infromation from DatanodeInfoWithStorage part of the output. It will contain ip address and we can get the corresponding DNS from the above table.
%%sh
hdfs fsck /user/${USER}/retail_db -files -blocks -locations
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /user/itversity/retail_db at Thu Jan 21 05:38:08 EST 2021
/user/itversity/retail_db <dir>
/user/itversity/retail_db/categories <dir>
/user/itversity/retail_db/categories/part-00000 1029 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455898_41737435 len=1029 repl=2 [DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
/user/itversity/retail_db/customers <dir>
/user/itversity/retail_db/customers/part-00000 953719 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455899_41737436 len=953719 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK]]
/user/itversity/retail_db/departments <dir>
/user/itversity/retail_db/departments/part-00000 60 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455900_41737437 len=60 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK]]
/user/itversity/retail_db/order_items <dir>
/user/itversity/retail_db/order_items/part-00000 5408880 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455901_41737438 len=5408880 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
/user/itversity/retail_db/orders <dir>
/user/itversity/retail_db/orders/part-00000 2999944 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455902_41737439 len=2999944 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
/user/itversity/retail_db/products <dir>
/user/itversity/retail_db/products/part-00000 174155 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455903_41737440 len=174155 repl=2 [DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
Status: HEALTHY
Total size: 9537787 B
Total dirs: 7
Total files: 6
Total symlinks: 0
Total blocks (validated): 6 (avg. block size 1589631 B)
Minimally replicated blocks: 6 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu Jan 21 05:38:08 EST 2021 in 1 milliseconds
The filesystem under path '/user/itversity/retail_db' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fuser%2Fitversity%2Fretail_db
%%sh
hdfs dfs -ls -h /public/randomtextwriter/part-m-00000
-rw-r--r-- 3 hdfs hdfs 1.0 G 2017-01-18 20:24 /public/randomtextwriter/part-m-00000
%%sh
hdfs fsck /public/randomtextwriter/part-m-00000 -files -blocks -locations
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/randomtextwriter/part-m-00000 at Thu Jan 21 05:39:53 EST 2021
/public/randomtextwriter/part-m-00000 1102230331 bytes, 9 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1074171511_431441 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
1. BP-292116404-172.16.1.101-1479167821718:blk_1074171524_431454 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
2. BP-292116404-172.16.1.101-1479167821718:blk_1074171559_431489 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
3. BP-292116404-172.16.1.101-1479167821718:blk_1074171609_431539 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
4. BP-292116404-172.16.1.101-1479167821718:blk_1074171657_431587 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK]]
5. BP-292116404-172.16.1.101-1479167821718:blk_1074171691_431621 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
6. BP-292116404-172.16.1.101-1479167821718:blk_1074171721_431651 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
7. BP-292116404-172.16.1.101-1479167821718:blk_1074171731_431661 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
8. BP-292116404-172.16.1.101-1479167821718:blk_1074171736_431666 len=28488507 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK]]
Status: HEALTHY
Total size: 1102230331 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 9 (avg. block size 122470036 B)
Minimally replicated blocks: 9 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu Jan 21 05:39:53 EST 2021 in 0 milliseconds
The filesystem under path '/public/randomtextwriter/part-m-00000' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Frandomtextwriter%2Fpart-m-00000