Copying files from local to HDFS

We can copy files from local file system to HDFS either by using copyFromLocal or put command.

  • hdfs dfs -copyFromLocal or hdfs dfs -put – to copy files or directories from local filesystem into HDFS. We can also use hadoop fs in place of hdfs dfs.

  • However, we will not be able to update or fix data in files when they are in HDFS. If we have to fix any data, we have to move file to local file system, fix data and then copy back to HDFS.

  • Files will be divided into blocks and will be stored on Datanodes in distributed fashion based on block size and replication factor. We will get into the details later.

test

%%sh

hdfs dfs -ls /user/${USER}
%%sh

hdfs dfs -mkdir /user/${USER}/retail_db
%%sh

hdfs dfs -ls /user/${USER}
%%sh

hdfs dfs -ls /user/${USER}/retail_db
%%sh

hdfs dfs -help put
%%sh

hdfs dfs -help copyFromLocal

Warning

This will copy the entire folder to /user/${USER}/retail_db and you will see /user/${USER}/retail_db/retail_db. You can use the next command to get files as expected.

%%sh

ls -ltr /data/retail_db
%%sh

hdfs dfs -put /data/retail_db /user/${USER}/retail_db
%%sh

hdfs dfs -ls /user/${USER}/retail_db
%%sh

hdfs dfs -ls /user/${USER}/retail_db/retail_db

Note

Let’s drop this folder and make sure files are copied as expected. As the folder is pre-created, we can use patterns to copy the sub folders.

%%sh

hdfs dfs -help rm
%%sh

hdfs dfs -rm -R -skipTrash /user/${USER}/retail_db/retail_db
%%sh

hdfs dfs -ls /user/${USER}/retail_db/
%%sh

hdfs dfs -put /data/retail_db/order* /user/${USER}/retail_db
%%sh

hdfs dfs -ls /user/${USER}/retail_db/
%%sh

hdfs dfs -put -f /data/retail_db/* /user/${USER}/retail_db
%%sh

hdfs dfs -ls /user/${USER}/retail_db/
%%sh

hdfs dfs -ls -R /user/${USER}/retail_db/

Note

Alternatively you can use copyFromLocal as well.

%%sh

hdfs dfs -rm -R -skipTrash /user/${USER}/retail_db
%%sh

hdfs dfs -mkdir /user/${USER}/retail_db
%%sh

hdfs dfs -ls /user/itversity/retail_db/
%%sh

hdfs dfs -copyFromLocal /data/retail_db/* /user/${USER}/retail_db
%%sh

hdfs dfs -ls /user/${USER}/retail_db

Note

We can also use this alternative approach to directly copy the folder /data/retail_db to /user/${USER}/retail_db. Let us first delete /user/${USER}/retail_db using skipTrash.

%%sh

hdfs dfs -rm -R -skipTrash /user/${USER}/retail_db

Note

We can specify the target location as /user/${USER}. It will create the retail_db folder and its contents.

%%sh

hdfs dfs -put /data/retail_db /user/${USER}
%%sh

hdfs dfs -ls /user/${USER}/retail_db
  • If we try to run hdfs dfs -put /data/retail_db /user/${USER} again it will fail as the target folder already exists.

%%sh

hdfs dfs -put /data/retail_db /user/${USER}
  • We can use -f as part of put or copyFromLocal to replace existing folder.

%%sh

hdfs dfs -put -f /data/retail_db /user/${USER}
%%sh

hdfs dfs -ls /user/${USER}/retail_db
%%sh

hdfs dfs -ls -R /user/${USER}/retail_db