I am trying to extract the highlighted "numeric information" from a Pandas DataFrame column:
Text |
---|
Dimensions: 23"/60 |
Dimensions: 23" / 60 |
Dimensions: 48" |
Dimensions: 22.5X8.25 |
Dimensions: 80IN |
Dimensions: 567 S |
Dimensions: 22.5X8.25 |
Dimensions: 26INNP |
Dimensions: 24" x 55" with pipe 16 x 7 |
I am using regex and is as follows: regex = r"(\d([^\s]*)\s.\s\d*[^\s])|(\d([^\s])*)"
I am using to the below script to create a new column with all the numeric information extracted from each sentence and combined with '_'
df2['Numeric_Info'] = df2['Text'].apply(lambda x: '_'.join([i[0] for i in re.findall(regex, str(x))]))
Can someone tell me what I am doing wrong?
Edit:
Specifying rules to the pattern to solidify the regex:
short A20-6014 complete
, A20-6014
should be extracted)A20-6014 short complete
, A20-6014
should be extracted)short complete A20-6014
, A20-6014
should be extracted)ELBOW- 2.5 IN 90 SIL
, 2.5 IN 90
should be extracted)_
. This also applies for alphanumeric words occurring at two different places in a sentence (ie. for sentence ELBOW- 2.5 IND 90 SIL
, 2.5_90
should be extracted. )90 CONN 3/4 ST-#8 FL
, 90_3/4_ST-#8
should be extractedYou can use
regex = r'\d+(?:\.\d+)?"?(?:\s*[Xx/.]\s*\d+(?:\.\d+)?"?)?(?:\s*[A-Z]+\b)?'
df2['Numeric_Info'] = df2['Text'].str.findall(regex).str.join('_')
See the regex demo.
Details:
\d+(?:\.\d+)?
- an int or float value"?
- an optional "
char(?:\s*[Xx/.]\s*\d+(?:\.\d+)?"?)?
- an optional sequence of
\s*[Xx/.]\s*
- X
, x
, /
or .
enclosed with zero or more whitespaces\d+(?:\.\d+)?
- an int or float value"?
- an optional "
char(?:\s*[A-Z]+\b)?
- an optional sequence of
\s*
- zero or more whitespaces[A-Z]+
- one or more ASCII uppercase letters\b
- a word boundary.Here is a Pandas test:
import pandas as pd
df = pd.DataFrame({'Text':['No data here','Dimensions: 23"/60', 'Dimensions: 23" / 60', 'Dimensions: 48"', 'Dimensions: 22.5X8.25', 'Dimensions: 80IN', 'Dimensions: 567 S','Dimensions: 22.5X8.25', 'Dimensions: 26INNP','Dimensions: 24" x 55" with pipe 16 x 7']})
regex = r'\d+(?:\.\d+)?"?(?:\s*[Xx/.]\s*\d+(?:\.\d+)?"?)?(?:\s*[A-Z]+\b)?'
df['Numeric_Info'] = df['Text'].str.findall(regex).str.join('_')
Output:
>>> df
Text Numeric_Info
0 No data here
1 Dimensions: 23"/60 23"/60
2 Dimensions: 23" / 60 23" / 60
3 Dimensions: 48" 48"
4 Dimensions: 22.5X8.25 22.5X8.25
5 Dimensions: 80IN 80IN
6 Dimensions: 567 S 567 S
7 Dimensions: 22.5X8.25 22.5X8.25
8 Dimensions: 26INNP 26INNP
9 Dimensions: 24" x 55" with pipe 16 x 7 24" x 55"_16 x 7