Search code examples
pythonpython-2.7python-docx

docx to list in python


I am trying to read a docx file and to add the text to a list. Now I need the list to contain lines from the docx file.

example:

docx file:

"Hello, my name is blabla,
I am 30 years old.
I have two kids."

result:

['Hello, my name is blabla', 'I am 30 years old', 'I have two kids']

I cant get it to work.

Using the docx2txt module from here: github link

There is only one command of process and it returns all the text from docx file.

Also I would like it to keep the special characters like ":\-\.\,"


Solution

  • docx2txt module reads docx file and converts it in text format.

    You need to split above output using splitlines() and store it in list.

    Code (Comments inline) :

    import docx2txt
    
    text = docx2txt.process("a.docx")
    
    #Prints output after converting
    print ("After converting text is ",text)
    
    content = []
    for line in text.splitlines():
      #This will ignore empty/blank lines. 
      if line != '':
        #Append to list
        content.append(line)
    
    print (content)
    

    Output:

    C:\Users\dinesh_pundkar\Desktop>python c.py
    After converting text is
     Hello, my name is blabla.
    
    I am 30 years old.
    
    I have two kids.
    
     List is  ['Hello, my name is blabla.', 'I am 30 years old. ', 'I have two kids.']
    
    C:\Users\dinesh_pundkar\Desktop>