Search code examples
pythonlistsplitposition

Split function when writing an opened file in Python


So I have a program in which I am supposed to take an external file, open it in python and then separate each word and each punctuation including commas, apostrophes and full stops. Then I am supposed to save this file as the integer positions of when each word and punctuation occurs in the text.

For eg:- I like to code, because to code is fun. A computer's skeleton.

In my program, I have to save this as:-

1,2,3,4,5,6,3,4,7,8,9,10,11,12,13,14

(Help for those who do not understand) 1-I , 2-like, 3-to, 4-code, 5-(,), 6-because, 7-is, 8-fun 9-(.), 10-A, 11-computer, 12-('), 13-s, 14-skeleton

So this has displayed the positions of each of word, even if it repeats, it shows the first occuring postion of the same word

Sorry for the long explanation but here is my actual question. I have done this so far:-

    file = open('newfiles.txt', 'r')
    with open('newfiles.txt','r') as file:
        for line in file:
            for word in line.split():
                 print(word)  

And here is the result:-

  They
  say
  it's
  a
  dog's
  life,.....

Unfortunately this way to split a file does not separate words from punctuation and it does not print out horizontally. .split does not work on a file, does anyone know a more effective way in which i can split the file - words from punctuation? And then store the separated words and punctuation together in a list?


Solution

  • The built-in string method .split can only work with simple delimiters. Without an argument, it simply splits on whitespace. For more complex splitting behavior, the easiest thing is to use regex:

    >>> s = "I like to code, because to code is fun. A computer's skeleton."
    >>> import re
    >>> delim = re.compile(r"""\s|([,.;':"])""")
    >>> tokens = filter(None, delim.split(s))
    >>> idx = {}
    >>> result = []
    >>> i = 1
    >>> for token in tokens:
    ...     if token in idx:
    ...         result.append(idx[token])
    ...     else:
    ...         result.append(i)
    ...         idx[token] = i
    ...         i += 1
    ...
    >>> result
    [1, 2, 3, 4, 5, 6, 3, 4, 7, 8, 9, 10, 11, 12, 13, 14, 9]
    

    Also, I don't think you need to iterate over the file line by line, as per your specifications. You should just do something like:

    with open('my file.txt') as f:
        s = f.read()
    

    Which will put the entire file as a string into s. Note, I never used open before the with statement, that doesn't make any sense.