Search code examples
pythonstringtext-mining

Strings analysis: splitting strings into n parts by percentage of words


I'd need to calculate the length of each string included in the list:

list_strings=["I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best","So many books, so little time.","In three words I can sum up everything I've learned about life: it goes on.","if you tell the truth, you don't have to remember anything.","Always forgive your enemies; nothing annoys them so much."]

to split each of them into three parts:

  • 30 % (first part)
  • 30 % (second part)
  • 40 % (third part)

I'd be able to calculate the length of each string into the list, but I do not know how to split each string into three parts and saved them. E.g.: the first sentence "I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best" has length 201 (tokenisation) so I'd need to take

  • 30% of 201 and save these words into an array (first 60 words approximately);
  • 30% of the remaining (i.e. next 60 words);
  • finally 40%, i.e. the last 80 words.

I read about the use of chunk but I've no idea on how I could apply. Also, I'd need a condition that can ensure me that I am taking integer (elements such words cannot be consider 1/2) words and I am not going beyond the length.


Solution

  • Splitting text according to percents on punctuation marks

    def split_text(s):
      """ Partitions text into three parts
          in proportion 30%, 40%, 30%"""
    
      i1 = int(0.3*len(s))  # first part from 0 to i1
      i2 = int(0.7*len(s))  # 2nd for i1 to i2, 3rd i2 onward
    
      # Use isalpha() to check when we are at a punctuation
      # i.e. . or ; or , or ? " or ' etc.
      # Find nearest alphanumeric boundary
      # backup as long as we are in an alphanumeric
      while s[i1].isalpha() and i1 > 0:
        i1 -= 1
    
      # Find nearest alphanumeric boundary (for 2nd part)
      while s[i2].isalpha() and i2 > i1:
        i2 -= 1
    
      # Returns the three parts
      return s[:i1], s[i1:i2], s[i2:]
    
    
    for s in list_strings:
      # Loop over list reporting lengths of parts
      # Three parts are a, b, c
      a, b, c = split_text(s)
      print(f'{s}\nLengths: {len(a)}, {len(b)}, {len(c)}')
      print()
    

    Output

    I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best
    Lengths: 52, 86, 63
    
    So many books, so little time.
    Lengths: 7, 10, 13
    
    In three words I can sum up everything I've learned about life: it goes on.
    Lengths: 20, 31, 24
    
    if you tell the truth, you don't have to remember anything.
    Lengths: 15, 25, 19
    
    Always forgive your enemies; nothing annoys them so much.
    Lengths: 14, 22, 21
    

    Output of split_text

    Code

    for s in list_strings:
        a, b, c = split_text(s)
        print(a)
        print(b)
        print(c)
        print()
    

    Result

    I'm selfish, impatient and a little insecure. I make
     mistakes, I am out of control and at times hard to handle. But if you can't handle me
     at my worst, then you sure as hell don't deserve me at my best
    
    So many
     books, so
     little time.
    
    In three words I can
     sum up everything I've learned
     about life: it goes on.
    
    if you tell the
     truth, you don't have to
     remember anything.
    
    Always forgive
     your enemies; nothing
     annoys them so much.
    

    To Capture the Partitions

    result_a, result_b, result_c = [], [], []
    for s in list_strings:
          # Loop over list reporting lengths of parts
          # Three parts are a, b, c
          a, b, c = split_text(s)
          result_a.append(a)
          result_b.append(b)
          result_c.append(c)