I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to count the number of events where the "target" is a government building. One of the columns is called "targettype" or "targettype_txt" and there are 5 different entries in this column I want to count (government building, military, police, diplomatic building etc). The targettype is also coded as a number if that is easier (i.e. there is another column where gov't building is 2, military installation is 4 etc..)
FYI This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).
Right now my data looks like this (there are a ton of other columns but they aren't important for this):
eventID | iyear | country_txt | nkill | nwounded | nhostages | targettype_txt |
---|---|---|---|---|---|---|
10000102 | 2000 | Nigeria | 3 | 10 | 0 | government building |
10000103 | 2000 | Mali | 1 | 3 | 15 | military installation |
10000103 | 2000 | Nigeria | 15 | 0 | 0 | government building |
10000103 | 2001 | Benin | 1 | 0 | 0 | police |
10000103 | 2001 | Nigeria | 1 | 3 | 15 | private business |
. . .
And I would like it to look like this:
country_txt | iyear | total_nkill | total_nwounded | total_nhostages | total public_target |
---|---|---|---|---|---|
Nigeria | 2000 | 200 | 300 | 300 | 15 |
Nigeria | 2001 | 250 | 450 | 15 | 17 |
I was able to get the total number for nkill,nwounded, and nhostages using this super simple line:
df2 = cdf.groupby(['country','country_txt', 'iyear'])['nkill', 'nwound','nhostkid'].sum()
But this is a little different because I want to only count certain entries and sum up the total number of times they occur. Any thoughts or suggestions are really appreciated!
Try:
cdf['CountCondition'] = (cdf['targettype_txt']=='government building') |
(cdf['targettype_txt']=='military installation') |
(cdf['targettype_txt']=='police')
df2 = cdf[cdf['CountCondition']].groupby(['country','country_txt', 'iyear', 'CountCondition']).count()
You create a new column 'CountCondition' which just marks as true or false if the condition in the statement holds. Then you just count the number of times the CountCondition is True. Hope this makes sense.
It is possible to combine all this into one statement and NOT create an additional column but the statement gets quite convaluted and more difficult to understand how it works:
df2 = cdf[(cdf['targettype_txt']=='government building') |
(cdf['targettype_txt']=='military installation') |
(cdf['targettype_txt']=='police')].groupby(['country','country_txt', 'iyear']).count()