Search code examples
pythonpandasgroup-bydata-sciencejupyter

How to count text event type and transform it into country-year data using pandas?


I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to count the number of events where the "target" is a government building. One of the columns is called "targettype" or "targettype_txt" and there are 5 different entries in this column I want to count (government building, military, police, diplomatic building etc). The targettype is also coded as a number if that is easier (i.e. there is another column where gov't building is 2, military installation is 4 etc..)

FYI This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).

Right now my data looks like this (there are a ton of other columns but they aren't important for this):

eventID iyear country_txt nkill nwounded nhostages targettype_txt
10000102 2000 Nigeria 3 10 0 government building
10000103 2000 Mali 1 3 15 military installation
10000103 2000 Nigeria 15 0 0 government building
10000103 2001 Benin 1 0 0 police
10000103 2001 Nigeria 1 3 15 private business

. . .

And I would like it to look like this:

country_txt iyear total_nkill total_nwounded total_nhostages total public_target
Nigeria 2000 200 300 300 15
Nigeria 2001 250 450 15 17

I was able to get the total number for nkill,nwounded, and nhostages using this super simple line:

df2 = cdf.groupby(['country','country_txt', 'iyear'])['nkill', 'nwound','nhostkid'].sum()

But this is a little different because I want to only count certain entries and sum up the total number of times they occur. Any thoughts or suggestions are really appreciated!


Solution

  • Try:

    cdf['CountCondition'] = (cdf['targettype_txt']=='government building') | 
        (cdf['targettype_txt']=='military installation') | 
        (cdf['targettype_txt']=='police')
    df2 = cdf[cdf['CountCondition']].groupby(['country','country_txt', 'iyear', 'CountCondition']).count()
    

    You create a new column 'CountCondition' which just marks as true or false if the condition in the statement holds. Then you just count the number of times the CountCondition is True. Hope this makes sense.

    It is possible to combine all this into one statement and NOT create an additional column but the statement gets quite convaluted and more difficult to understand how it works:

    df2 = cdf[(cdf['targettype_txt']=='government building') | 
        (cdf['targettype_txt']=='military installation') | 
        (cdf['targettype_txt']=='police')].groupby(['country','country_txt', 'iyear']).count()