Search code examples
hiveloadimport-from-csv

Loading Data from Remote Machine to Hive Database


I have a CSV file stored on a remote machine. I need to load this data into my Hive Database which is installed in different machine. Is there any way to do this?

note: I am using Hive 0.12.


Solution

  • Since Hive basically applies a schema to data that resides in HDFS, you'll want to create a location in HDFS, move your data there, and then create a Hive table that points to that location. If you're using a commercial distribution, this may be possible from Hue (the Hadoop User Environment web UI).

    Here's an example from the command line.

    Create csv file on local machine:

    $ vi famous_dictators.csv
    

    ... and this is what the file looks like:

    $ cat famous_dictators.csv 
    1,Mao Zedong,63000000
    2,Jozef Stalin,23000000
    3,Adolf Hitler,17000000
    4,Leopold II of Belgium,8000000
    5,Hideki Tojo,5000000
    6,Ismail Enver Pasha,2500000
    7,Pol Pot,1700000
    8,Kim Il Sung,1600000
    9,Mengistu Haile Mariam,950000
    10,Yakubu Gowon,1100000
    

    Then scp the csv file to a cluster node:

    $ scp famous_dictators.csv hadoop01:/tmp/
    

    ssh into the node:

    $ ssh hadoop01
    

    Create a folder in HDFS:

    [awoolford@hadoop01 ~]$ hdfs dfs -mkdir /tmp/famous_dictators/
    

    Copy the csv file from the local filesystem into the HDFS folder:

    [awoolford@hadoop01 ~]$ hdfs dfs -copyFromLocal /tmp/famous_dictators.csv /tmp/famous_dictators/
    

    Then login to hive and create the table:

    [awoolford@hadoop01 ~]$ hive
    
    hive> CREATE  TABLE `famous_dictators`(
        >   `rank` int, 
        >   `name` string,
        >   `deaths` int)
        > ROW FORMAT DELIMITED 
        >   FIELDS TERMINATED BY ',' 
        >   LINES TERMINATED BY '\n' 
        > LOCATION
        >   'hdfs:///tmp/famous_dictators';
    

    You should now be able to query your data in Hive:

    hive> select * from famous_dictators;
    OK
    1   Mao Zedong  63000000
    2   Jozef Stalin    23000000
    3   Adolf Hitler    17000000
    4   Leopold II of Belgium   8000000
    5   Hideki Tojo 5000000
    6   Ismail Enver Pasha  2500000
    7   Pol Pot 1700000
    8   Kim Il Sung 1600000
    9   Mengistu Haile Mariam   950000
    10  Yakubu Gowon    1100000
    Time taken: 0.789 seconds, Fetched: 10 row(s)