Search code examples
pythonpandasdataframeapplysigned

Modify sign of Pandas dataframe's column based off another column's mask?


I'm working with imported lat/lon data from NASA's fireball data API (https://cneos.jpl.nasa.gov/fireballs/)

  • lat/lon data only have positive values
  • their direction (N/S and E/W) are in different columns called lat-dir/lon-dir
  • dataframe as below.
  • Now I want to:
  • Convert any lat values to negative (multiply by -1) if "lat-dir" == 'S'
  • Convert lon values to negative if "lon-dir" == 'W'

Below is roughly how I created my dataframe:

import requests 
import pandas as pd

response = requests.get('https://ssd-api.jpl.nasa.gov/fireball.api')
j = response.json()
df = pd.DataFrame.from_dict(j[u'data'])

print( j[u'fields'] )

[u'date', u'energy', u'impact-e', u'lat', u'lat-dir', u'lon', u'lon-dir', u'alt', u'vel']

print( df.head() )

0    1      2     3     4     5     6     7     8
0  2019-12-06 10:19:57  4.6   0.15   3.3     S  37.7     W  19.5  None
1  2019-12-03 06:46:27  4.2   0.14   5.6     N  52.2     W  61.5  None
2  2019-11-28 20:30:54  2.7  0.095  35.7     N  31.7     W    35  13.0
3  2019-11-28 13:22:10  2.6  0.092  None  None  None  None  None  None
4  2019-11-28 11:55:02  2.5  0.089  22.1     S  25.7     E  22.5  24.7

Lines of code I've attempted:

Attempted to use df.apply() - though through my searching, I don't think you can easily reference two columns in this manner...

    df['lat'] = df['lat'].apply(lambda x: x * -1 if (df['lat-dir'][x] == 'S'))

    for i, row in df.iterrows():
        if (row['lat-dir'] == 'S'):
            df['lat'][i].apply(lambda x: x*-1)

For this, I get 'numpy.float64' object has no attribute 'apply' ?


Attempted to use masking:

    if( df['lon-dir'] == 'W'):
         df['lon'] * -1

But frankly, I'm stumped on what to do next regarding applying the mask.

EDIT:

dfDate['lat'] = dfDate['lat'].apply(lambda row: row['lon'] * -1 , axis = 1 )

Attempted this as well per comments.


Solution

  • Yes, by either of the following:

    A) using a vectorized mask. == isn't vectorized; .eq(...) is. For a vectorized expression, use dfDate['lon-dir'].eq('W'). Then negate 'lon' column on those rows.

    B) using apply() row-wise: dfDate['lon'] = dfDate.apply(lambda row: ..., axis=1) - and in your lambda selectively negate row['lon'] based on value row['lon-dir'] - the reason your apply call failed is you need to apply to the entire column/Series, not individual entries. So: df['lat'].apply(lambda: ..., axis=1)

    lat-dir/lon-dir are essentially sign columns, you could convert them to +1/-1 when you read them in.

    Code:

    First some issues with your code that you'll want to fix:

    1. Don't use the u'...' notation. Assuming you're using Python 3.x, don't need u'...', text is now unicode by default in 3.x. And if you're not using Python 3.x, you really should switch over now, 2.x is being sunsetted Jan 1, 2020.
    2. Pass the JSON column names onto the dataframe, make your life easy:
      • df.columns = j['fields']
    3. Reading in the JSON by passing response.json() into pd.DataFrame.from_dict() is a pain; your dataframe columns become string/'object' rather than the float ones being converted to float. Ideally we should be using pandas.read_json(..., dtype=...) for this and other convenience reasons.
    4. You're going to want to convert the dtypes (e.g. string -> float) on numeric columns, and that also automatically converts Python None -> pandas/numpy nan (for the sake of the vectorized code we're going to write gracefully handling nan and not constantly throwing annoying TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'). You can [do this with either astype(...), pd.to_numeric() or df.fillna(value=pd.np.nan, inplace=True)
    5. Really those nan entries are going to keep being a pain for multiple reasons listed below (e.g. integers keep getting coerced back to float), so you'll probably want to drop or at least temporarily ignore the nan rows by doing:
      • df2 = df.dropna(how='any', inplace=False) # probably not with ..., inplace=True. Note that that preserves the row-indices, so you can always insert the result of prpcessing df2 back into df at the end. Read the dropna doc and figure out at what exact point you want to drop the nan's.
      • Note that 'vel' column actually has other nans which we want to ignore, you'll need to figure that out, or for now ignore them: e.g. do df2 = df[['date','energy','impact-e','lat','lat-dir','lon','lon-dir']].dropna(how='any', inplace=False)

    Solution

    1. Several ways to translate the lat/lon-dir columns to +/-1 signs:

      A1) if you want the 'correct', nan-aware way which doesn't choke on nans...

        df2['lat'] = df2['lat-dir'].map({'N': +1, 'S': -1})
        df2['lon'] = df2['lon-dir'].map({'E': +1, 'W': -1})
    
    A2) ...or a fast-and-dirty way:
    
        (-1) ** df2['lat-dir'].eq('S')
        (-1) ** df2['lon-dir'].eq('W')
    

    B) But you can do this all in one row-wise apply() function:

    def fixup_latlon_signs(row):
        row['lat'] = row['lat'] * (-1) ** (row['lat-dir'] == 'S')
        row['lon'] = row['lon'] * (-1) ** (row['lon-dir'] == 'W')
        return row
    
    df2.apply(fixup_latlon_signs, axis=1)
    
    # Then insert the non-NA rows we processed back into the parent dataframe:
    df.update(df2)
    
    # Strictly we can drop 'lat-dir','lon-dir' now...