Search code examples
pythonpython-3.xpandaspython-3.7

Python getting value from dataframe using pandas by specific condition & no duplicates


Given the following dataframe, I am trying to use pandas to extract the column "Total" only where "Year" is 2015 - 2020.

    Year  Yield     Total  ...      ExDate     PayDate                  
0   2020  3.09%  0.66  ...  2020-05-12  2020-05-26  
1   2020  3.09%  0.66  ...  2020-05-12  2020-05-26  
2   2019  7.02%   1.5  ...  2019-11-18  2019-11-29   
3   2019  7.02%   1.5  ...  2019-08-05  2019-08-20   
4   2019  7.02%   1.5  ...  2019-05-17  2019-05-31   

The code that I am using works if the data for 2020 exists:

counter = 5
for index, row in df.iterrows():
    if counter == 0:
        break
    if row['Year'] == end + counter:
        print(row['Total'])
        counter -= 1

But it has no output if the data for 2020 does not exist like so:

   Year   Yield  ...     PayDate                   
0  2019   2.00%  ...  2019-05-31  
1  2018   7.87%  ...  2018-09-28     
2  2018   7.87%  ...  2018-06-29  
3  2017  12.27%  ...  2017-09-29       
4  2017  12.27%  ...  2017-05-31     
5  2016   4.67%  ...  2016-09-30

Question:

How can I make sure that the code continues to run giving the output for 2019 to 2015 with 2020 being shown as 'unavailable'?

Expected output (one value from each year - no duplicate):

0.66
1.5
.
.
.

Solution

  • Edit: after you comment I understand you want only one from each year. So fixed line:

    print(df.loc[(df["Year"]>=2015) & (df["Year"]<=2020) ,:].drop_duplicates(subset="Year", keep="First")["Total"].values.tolist())
    

    Instead of filter Total column directly, I first use drop_duplicates to remove all duplicated year. Then with the result I filter only Total column.

    Original Answer:

    When working with pandas, usually only in rare cases you will need to iterate through dataframe with for loop directly. Pandas gives you many option to perform actions on whole dataframe, with no need to iterate through it.

    In your case you want to extract data based on some condition. You can achieve this using pandas.loc:

    print(df.loc[(df["Year"]>=2015) & (df["Year"]<=2020) , "Total"])
    

    loc allows you to select only specific data based on condition. Format is .loc[rows, columns]. So for rows I used a complex condition to make sure Year is between 2015 and 2020. For columns you want to filter only "Total".

    It does not matter if 2020 exist or not that way. It just filters all rows with year 2015-2020 and look on "Total" column in it.

    Also, I don't really understand your code. You decrease 1 from counter each time your encounter one row with year as desired. Why not simply as for a range inside a year?

    for index, row in df.iterrows():
        if row['Year'] >=2015 and row['Year'] <= 2020:
            print(row["Total"])
    

    If I misunderstood your intention (desired output), please edit your post to clarift and let me know