Search code examples
pysparkuppercase

Pyspark - How to get capitalized names?


How to get capitalized names?

from pyspark.sql import types as T
import pyspark.sql.functions as F
from datetime import datetime
from pyspark.sql.functions import to_timestamp   

test = spark.createDataFrame(
[
(1,'2021-10-04 09:05:14', "For the 2nd copy of the ticket, access the link: wa.me/11223332211 (Whats) use ID and our number(1112222333344455). Duvidas, www.abtech.com . AB Tech"),
(2,'2021-10-04 09:10:05', ". MARCIOG, let's try again? Get in touch to rectify your situation. For WhatsApp Link: ab-ab.ab.com/n/12345467. AB Tech"),
(3,'2021-10-04 09:27:27', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/99998-88822 (Whats) ou 0800-999-9999. AB Tech"),
(4,'2021-10-04 14:55:26', "Mr, SUELI. enjoy the holiday with money in your account. AB has great conditions for you. Call now and hire 0800899-9999 (Mon to Fri from 12pm to 6pm)"),
(5,'2021-10-06 09:15:11', ". DEPREZC, let's try again? Get in touch to rectify your situation. For whatsapp Link: csi-csi.abtech.com/n/12345467. AB Tech"),
(6,'2022-02-03 08:00:12', "Mr. SARA. We have great discount options. Regularize your situation with AB! Link: wa.me/25544-8855 (Whats) ou 0800-999-9999. AB."),
(7,'2021-10-04 09:26:00', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/999999999 (Whats) or 0800-999-9999. AB Tech"),
(8,'2018-10-09 12:31:33', "Mr.(a) ANTONI, regularize your situation with the Ammmm Bhhhh. Ligue 0800-729-2406 or access the CHAT www.abtech.com. AB Tech."),
(9,'2018-10-09 15:14:51', "Follow code of bars of your updated deal for today (11111.111111 1111.11111 11111.111111 1 11111111111). Doubts call 0800-999-9999. AB Tech.")
],
T.StructType(
[
T.StructField("id_mt", T.StringType(), True),
T.StructField("date_send", T.StringType(), True),
T.StructField("message", T.StringType(), True),
]
),
)

Could you tell me what is the logic to check the uppercase names?

So, there is a column name 'names' which is answer:

enter image description here


Solution

  • We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.

    I think this specific case is hard but Spark but easy in Python. I'll walk through the solution.

    First we make a Pandas DataFrame for quick testing:

    import pandas as pd
    df = pd.DataFrame([(1,'2021-10-04 09:05:14', "For the 2nd copy of the ticket, access the link: wa.me/11223332211 (Whats) use ID and our number(1112222333344455). Duvidas, www.abtech.com . AB Tech"),
    (2,'2021-10-04 09:10:05', ". MARCIOG, let's try again? Get in touch to rectify your situation. For WhatsApp Link: ab-ab.ab.com/n/12345467. AB Tech"),
    (3,'2021-10-04 09:27:27', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/99998-88822 (Whats) ou 0800-999-9999. AB Tech"),
    (4,'2021-10-04 14:55:26', "Mr, SUELI. enjoy the holiday with money in your account. AB has great conditions for you. Call now and hire 0800899-9999 (Mon to Fri from 12pm to 6pm)"),
    (5,'2021-10-06 09:15:11', ". DEPREZC, let's try again? Get in touch to rectify your situation. For whatsapp Link: csi-csi.abtech.com/n/12345467. AB Tech"),
    (6,'2022-02-03 08:00:12', "Mr. SARA. We have great discount options. Regularize your situation with AB! Link: wa.me/25544-8855 (Whats) ou 0800-999-9999. AB."),
    (7,'2021-10-04 09:26:00', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/999999999 (Whats) or 0800-999-9999. AB Tech"),
    (8,'2018-10-09 12:31:33', "Mr.(a) ANTONI, regularize your situation with the Ammmm Bhhhh. Ligue 0800-729-2406 or access the CHAT www.abtech.com. AB Tech."),
    (9,'2018-10-09 15:14:51', "Follow code of bars of your updated deal for today (11111.111111 1111.11111 11111.111111 1 11111111111). Doubts call 0800-999-9999. AB Tech.")], columns=["id_mt", "date_send","message"])
    

    Now we create a native Python function to extract the string. The get_name_for_one_string operates on one string and the get_names will take in the whole DataFrame.

    from typing import List, Dict, Any
    import re
    
    def get_name_for_one_string(message: str) -> str:
        # drop non-alphanumeric
        message = re.sub(r"\s*[^A-Za-z]+\s*", " ", message)
        # string split
        items = message.split(" ")
        # keep all caps and len > 2
        item = [x for x in items if (x.upper() == x and len(x) > 2)]
        if len(item) > 0:
            return item[0]
        else:
            return None
                                     
    def get_names(df: List[Dict[str,Any]]) -> List[Dict[str,Any]]:
        for row in df:
            row["names"] = get_name_for_one_string(row["message"])
        return df
    

    Now we can use this on a Pandas DataFrame using the Fugue transform function and Fugue will handle the conversions

    from fugue import transform
    transform(df, get_names, schema="*,names:str")
    

    This works so now we can bring it to Spark just by specifying the engine.

    import fugue_spark
    transform(df, get_names, schema="*,names:str", engine="spark").show()
    
    +-----+-------------------+--------------------+-------+
    |id_mt|          date_send|             message|  names|
    +-----+-------------------+--------------------+-------+
    |    1|2021-10-04 09:05:14|For the 2nd copy ...|   null|
    |    2|2021-10-04 09:10:05|. MARCIOG, let's ...|MARCIOG|
    |    3|2021-10-04 09:27:27|, we do not ident...|   null|
    |    4|2021-10-04 14:55:26|Mr, SUELI. enjoy ...|  SUELI|
    |    5|2021-10-06 09:15:11|. DEPREZC, let's ...|DEPREZC|
    |    6|2022-02-03 08:00:12|Mr. SARA. We have...|   SARA|
    |    7|2021-10-04 09:26:00|, we do not ident...|   null|
    |    8|2018-10-09 12:31:33|Mr.(a) ANTONI, re...| ANTONI|
    |    9|2018-10-09 15:14:51|Follow code of ba...|   null|
    +-----+-------------------+--------------------+-------+
    

    Note you need .show() because Spark evaluates lazily. The transform function can take in both Pandas and Spark DataFrames. If you use the Spark engine, the output will be a Spark DataFrame also.