Search code examples
pythontextword-cloud

Removing chars/signs from string


I'm preparing text for a word cloud, but I get stuck.

I need to remove all digits, all signs like . , - ? = / ! @ etc., but I don't know how. I don't want to replace again and again. Is there a method for that?

Here is my concept and what I have to do:

  • Concatenate texts in one string
  • Set chars to lowercase <--- I'm here
  • Now I want to delete specific signs and divide the text into words (list)
  • calculate freq of words
  • next do the stopwords script...
abstracts_list = open('new','r')
abstracts = []
allab = ''
for ab in abstracts_list:
    abstracts.append(ab)
for ab in abstracts:
    allab += ab
Lower = allab.lower()

Text example:

MicroRNAs (miRNAs) are a class of noncoding RNA molecules approximately 19 to 25 nucleotides in length that downregulate the expression of target genes at the post-transcriptional level by binding to the 3'-untranslated region (3'-UTR). Epstein-Barr virus (EBV) generates at least 44 miRNAs, but the functions of most of these miRNAs have not yet been identified. Previously, we reported BRUCE as a target of miR-BART15-3p, a miRNA produced by EBV, but our data suggested that there might be other apoptosis-associated target genes of miR-BART15-3p. Thus, in this study, we searched for new target genes of miR-BART15-3p using in silico analyses. We found a possible seed match site in the 3'-UTR of Tax1-binding protein 1 (TAX1BP1). The luciferase activity of a reporter vector including the 3'-UTR of TAX1BP1 was decreased by miR-BART15-3p. MiR-BART15-3p downregulated the expression of TAX1BP1 mRNA and protein in AGS cells, while an inhibitor against miR-BART15-3p upregulated the expression of TAX1BP1 mRNA and protein in AGS-EBV cells. Mir-BART15-3p modulated NF-κB activity in gastric cancer cell lines. Moreover, miR-BART15-3p strongly promoted chemosensitivity to 5-fluorouracil (5-FU). Our results suggest that miR-BART15-3p targets the anti-apoptotic TAX1BP1 gene in cancer cells, causing increased apoptosis and chemosensitivity to 5-FU.


Solution

  • So to set upper case characters to lower case characters you could do the following: so just store your text to a string variable, for example STRING and next use the command

    STRING=re.sub('([A-Z]{1})', r'\1',STRING).lower()
    

    now your string will be free of capital letters.

    To remove the special characters again module re can help you with the sub command :

    STRING = re.sub('[^a-zA-Z0-9-_*.]', ' ', STRING )
    

    with these command your string will be free of special characters

    And to determine the word frequency you could use the module collections from where you have to import Counter.

    Then use the following command to determine the frequency with which the words occur:

    Counter(STRING.split()).most_common()