Search code examples
pythonregexregex-groupcjkregex-greedy

RegEx for capturing Korean alphabets


My data frame in name is like below:

'가락시장(340)',
'가락시장(8)',
'가산디지털단지(7)',
'강남(222)',
'강남구청',
'강동',
'강동구청',
'강변(214)',
'개롱',
'개화산',
'거여',
'건대입구(212)',
'건대입구(7)',
'경복궁(317)',
'경찰병원(341)',
'고덕',
'고려대',
'고속터미널(329)',
'고속터미널(7)',
'공덕(5)',
'공덕(6)',
'공릉',
'광나루',
...

All list is in this link

Desired Output:

가락시장
가산디지털단지
강남
강남구청
강동
강동구청
강변
개롱
개화산
거여
건대입구
경복궁
경찰병원
고덕
고려대
고속터미널
공덕
공릉
광나루

Attempt

import re

for i in df['name']:
    i = re.match('^[가-힣]*$', i) # '^[가-힣]&$ is extract Hangeul (Korean alphabet)

But df['name'] does not change.

How do I solve this problem?


Solution

  • We might be able to capture your desired output with a simple expression, with only a ' as a left boundary, then collecting the letters, similar to:

    '([\p{L}]+)
    

    Test

    # coding=utf8
    # the above tag defines encoding for this document and is for Python 2.x compatibility
    
    import re
    
    regex = r"'([\p{L}]+)"
    
    test_str = ("'가락시장(340)',\n"
        " '가락시장(8)',\n"
        " '가산디지털단지(7)',\n"
        " '강남(222)',\n"
        " '강남구청',\n"
        " '강동',\n"
        " '강동구청',\n"
        " '강변(214)',\n"
        " '개롱',\n"
        " '개화산',\n"
        " '거여',\n"
        " '건대입구(212)',\n"
        " '건대입구(7)',\n"
        " '경복궁(317)',\n"
        " '경찰병원(341)',\n"
        " '고덕',\n"
        " '고려대',\n"
        " '고속터미널(329)',\n"
        " '고속터미널(7)',\n"
        " '공덕(5)',\n"
        " '공덕(6)',\n"
        " '공릉',\n"
        " '광나루',")
    
    matches = re.finditer(regex, test_str, re.MULTILINE)
    
    for matchNum, match in enumerate(matches, start=1):
    
        print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
        for groupNum in range(0, len(match.groups())):
            groupNum = groupNum + 1
    
            print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
    
    # Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
    

    enter image description here

    RegEx

    If this expression wasn't desired, it can be modified or changed in regex101.com.

    RegEx Circuit

    jex.im visualizes regular expressions:

    Reference

    How to implement \p{L} in python regex