How to get capitalized names?
from pyspark.sql import types as T
import pyspark.sql.functions as F
from datetime import datetime
from pyspark.sql.functions import to_timestamp
test = spark.createDataFrame(
[
(1,'2021-10-04 09:05:14', "For the 2nd copy of the ticket, access the link: wa.me/11223332211 (Whats) use ID and our number(1112222333344455). Duvidas, www.abtech.com . AB Tech"),
(2,'2021-10-04 09:10:05', ". MARCIOG, let's try again? Get in touch to rectify your situation. For WhatsApp Link: ab-ab.ab.com/n/12345467. AB Tech"),
(3,'2021-10-04 09:27:27', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/99998-88822 (Whats) ou 0800-999-9999. AB Tech"),
(4,'2021-10-04 14:55:26', "Mr, SUELI. enjoy the holiday with money in your account. AB has great conditions for you. Call now and hire 0800899-9999 (Mon to Fri from 12pm to 6pm)"),
(5,'2021-10-06 09:15:11', ". DEPREZC, let's try again? Get in touch to rectify your situation. For whatsapp Link: csi-csi.abtech.com/n/12345467. AB Tech"),
(6,'2022-02-03 08:00:12', "Mr. SARA. We have great discount options. Regularize your situation with AB! Link: wa.me/25544-8855 (Whats) ou 0800-999-9999. AB."),
(7,'2021-10-04 09:26:00', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/999999999 (Whats) or 0800-999-9999. AB Tech"),
(8,'2018-10-09 12:31:33', "Mr.(a) ANTONI, regularize your situation with the Ammmm Bhhhh. Ligue 0800-729-2406 or access the CHAT www.abtech.com. AB Tech."),
(9,'2018-10-09 15:14:51', "Follow code of bars of your updated deal for today (11111.111111 1111.11111 11111.111111 1 11111111111). Doubts call 0800-999-9999. AB Tech.")
],
T.StructType(
[
T.StructField("id_mt", T.StringType(), True),
T.StructField("date_send", T.StringType(), True),
T.StructField("message", T.StringType(), True),
]
),
)
Could you tell me what is the logic to check the uppercase names?
So, there is a column name 'names' which is answer:
We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
I think this specific case is hard but Spark but easy in Python. I'll walk through the solution.
First we make a Pandas DataFrame for quick testing:
import pandas as pd
df = pd.DataFrame([(1,'2021-10-04 09:05:14', "For the 2nd copy of the ticket, access the link: wa.me/11223332211 (Whats) use ID and our number(1112222333344455). Duvidas, www.abtech.com . AB Tech"),
(2,'2021-10-04 09:10:05', ". MARCIOG, let's try again? Get in touch to rectify your situation. For WhatsApp Link: ab-ab.ab.com/n/12345467. AB Tech"),
(3,'2021-10-04 09:27:27', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/99998-88822 (Whats) ou 0800-999-9999. AB Tech"),
(4,'2021-10-04 14:55:26', "Mr, SUELI. enjoy the holiday with money in your account. AB has great conditions for you. Call now and hire 0800899-9999 (Mon to Fri from 12pm to 6pm)"),
(5,'2021-10-06 09:15:11', ". DEPREZC, let's try again? Get in touch to rectify your situation. For whatsapp Link: csi-csi.abtech.com/n/12345467. AB Tech"),
(6,'2022-02-03 08:00:12', "Mr. SARA. We have great discount options. Regularize your situation with AB! Link: wa.me/25544-8855 (Whats) ou 0800-999-9999. AB."),
(7,'2021-10-04 09:26:00', ", we do not identify the payment of the installment of your agreement, if paid disregard. You doubt, link: wa.me/999999999 (Whats) or 0800-999-9999. AB Tech"),
(8,'2018-10-09 12:31:33', "Mr.(a) ANTONI, regularize your situation with the Ammmm Bhhhh. Ligue 0800-729-2406 or access the CHAT www.abtech.com. AB Tech."),
(9,'2018-10-09 15:14:51', "Follow code of bars of your updated deal for today (11111.111111 1111.11111 11111.111111 1 11111111111). Doubts call 0800-999-9999. AB Tech.")], columns=["id_mt", "date_send","message"])
Now we create a native Python function to extract the string. The get_name_for_one_string
operates on one string and the get_names
will take in the whole DataFrame.
from typing import List, Dict, Any
import re
def get_name_for_one_string(message: str) -> str:
# drop non-alphanumeric
message = re.sub(r"\s*[^A-Za-z]+\s*", " ", message)
# string split
items = message.split(" ")
# keep all caps and len > 2
item = [x for x in items if (x.upper() == x and len(x) > 2)]
if len(item) > 0:
return item[0]
else:
return None
def get_names(df: List[Dict[str,Any]]) -> List[Dict[str,Any]]:
for row in df:
row["names"] = get_name_for_one_string(row["message"])
return df
Now we can use this on a Pandas DataFrame using the Fugue transform
function and Fugue will handle the conversions
from fugue import transform
transform(df, get_names, schema="*,names:str")
This works so now we can bring it to Spark just by specifying the engine.
import fugue_spark
transform(df, get_names, schema="*,names:str", engine="spark").show()
+-----+-------------------+--------------------+-------+
|id_mt| date_send| message| names|
+-----+-------------------+--------------------+-------+
| 1|2021-10-04 09:05:14|For the 2nd copy ...| null|
| 2|2021-10-04 09:10:05|. MARCIOG, let's ...|MARCIOG|
| 3|2021-10-04 09:27:27|, we do not ident...| null|
| 4|2021-10-04 14:55:26|Mr, SUELI. enjoy ...| SUELI|
| 5|2021-10-06 09:15:11|. DEPREZC, let's ...|DEPREZC|
| 6|2022-02-03 08:00:12|Mr. SARA. We have...| SARA|
| 7|2021-10-04 09:26:00|, we do not ident...| null|
| 8|2018-10-09 12:31:33|Mr.(a) ANTONI, re...| ANTONI|
| 9|2018-10-09 15:14:51|Follow code of ba...| null|
+-----+-------------------+--------------------+-------+
Note you need .show()
because Spark evaluates lazily. The transform
function can take in both Pandas and Spark DataFrames. If you use the Spark engine, the output will be a Spark DataFrame also.