Search code examples
python-3.xregextext-processing

Split a big text file into multiple smaller one on set parameter of regex


I have a large text file looking like:

....
sdsdsd
..........

asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.

......
ddss
................

asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.

.....
xxxx
.......
asdfghjkl

I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like

group1_sdsdsd.txt

....
sdsdsd
..........

asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.

group1_ddss.txt

ddss
................

asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.

and

group1_xxxx.txt

.....
xxxx
.......

asdfghjkl

I have figured that by usinf regex of sort of following can be done

txt =re.sub(r'(([^\w\s])\2+)', r' ', txt).strip() #for letters more than 2 times

but not able to figure out completely.

The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.


Solution

  • If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like:

    ^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*
    

    Explanation

    • ^ Start of string
    • \.{3,}\n Match 3 or more dots and a newline
    • (\S+)\n Capture 1+ non whitespace chars in group 1 for the filename and match a newline
    • \.{3,} Match 3 or more dots
    • (?: Non capture group to repeat as a whole part
      • \n Match a newline
      • (?!\.{3,}\n\S+\n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between
      • .* Match the whole line
    • )* Close the non capture group and optionally repeat it

    Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename.

    See a regex demo and a Python demo with the separate parts.

    Example code

    import re
    
    pattern = r"^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*"
    
    s = ("....your data here")
    
    matches = re.finditer(pattern, s, re.MULTILINE)
    your_path = "/your/path/"
    
    for matchNum, match in enumerate(matches, start=1):
        f = open(your_path + "group1_{}".format(match.group(1)), 'w')
        f.write(match.group())
        f.close()