Search code examples
pandasdataframeword2vec

Remove first word and then take the word as a index like one hot encode vector pandas


I have a word2vec dataframe like this which saved from save_word2vec_format using Gensim under txt file. After using pandas to read this file. (Picture below). How to delete the first word and make them as an index? I want to have a dataframe like one hot encoding vector dataframe. This is my txt file https://drive.google.com/file/d/1O206N93hPSmvMjwc0W5ATyqQMdMwhRlF/view?usp=sharing enter image description here


Solution

  • I think need read_csv with omit first row, change separator to \s+ for one or more whitespaces, set first column to index and set default columns names to RangeIndex, last transpose by T:

    df = pd.read_csv('model.txt', sep='\s+', index_col=0, header=None, skiprows=1).T
    print (df.head())
    
    0       the       and        of         a        to        in        he  \
    1 -0.058613  0.015442 -0.158179  0.140175  0.093452  0.018156  0.119811   
    2 -0.167606 -0.107773 -0.029066 -0.206769 -0.091758 -0.089092 -0.154339   
    3  0.050763 -0.017081 -0.124401  0.155085  0.175548 -0.029413  0.246189   
    4  0.283456  0.208988  0.110836 -0.007077  0.265104  0.023497 -0.027724   
    5  0.152869 -0.006580 -0.009774  0.116188  0.039773  0.047682  0.008068   
    
    0        wa        it         i    ...         ammy       mim  candyman  \
    1  0.044857  0.351965  0.480889    ...     0.036848  0.060897  0.072883   
    2 -0.113168 -0.195455 -0.007680    ...    -0.008903 -0.024123 -0.023799   
    3  0.039933  0.143591  0.205823    ...     0.002832  0.014112  0.011426   
    4 -0.074092  0.075550 -0.089214    ...     0.003451  0.012912  0.016158   
    5 -0.107139  0.040009 -0.013390    ...    -0.000931 -0.006203  0.000539   
    
    0  washboiler  mincepie     ruben    croome    mamlet  postnotes   bettina  
    1    0.040233  0.048775  0.059252  0.029014  0.047536   0.034878  0.043068  
    2   -0.013842 -0.015706 -0.023821 -0.014749 -0.013498  -0.011608 -0.019654  
    3    0.006556  0.012816  0.004323 -0.006120  0.006841   0.008062  0.006986  
    4    0.011206  0.010511  0.012700  0.006781  0.007779   0.008678  0.016355  
    5   -0.003435 -0.003693 -0.003387 -0.002963 -0.003910  -0.001301 -0.003683  
    
    [5 rows x 31849 columns]