Search code examples
pythonpdfpython-3.7pypdf

I need to extract text from PDF file and make a new .txt file to put in


I need help in a PYTHON script to read PDF file and copy every word on it and put them in a new .txt file (every word must take 1 line) ; and then deleted the repeated words and count them after that and print the count in the last line


Solution

  • Install these libraries.

    PyPDF2 (To convert simple, text-based PDF files into text readable by Python)

    textract (To convert non-trivial, scanned PDF files into text readable by Python)

    nltk (To clean and convert phrases into keywords)

    Each of these libraries can be installed with the following commands in side terminal(on macOS):

    pip install Libraryname
    

    See this Tutorial https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

    Use texttrack it support many types of files also PDF. So texttrack better.

    folow these links

    https://github.com/deanmalmgren/textract

    https://textract.readthedocs.io/en/latest/