Search code examples
rdata-cleaning

How to look back a few rows higher to see if a condition is satisfied and then replace a cell accordingly


I have a dataframe whereby if a condition is satisfied, it looks a couple rows above to find a specific row with another string and replaces itself with that entire cell which had the string.

Here is an Example:

enter image description here

If "test" is detected in the type column, it looks up to 3 rows above for a row which contains the string "Subject". If that condition is satisfied, it then changes it's cell name to the entire cell it found.

The data is :

Type    Score
Ignore  Ignore
Subject_English NA
test    58
test    6
test    87
Subject_science NA
test    23
test    47
test    89
test    45
test    23
ignore  ignore
ignore  ignore
ignore  ignore
test    54
Subject_Math    NA
Ignore  ignore
Ignore  ignore
test    88

Lastly, if multiple rows satisfy the initial condition it takes the latest. Secondly, once a cell has been changed, that cell can be used to satisfy the first condition for another cell. Thirdly, there might be a few rows in between the "test" row and the row Im looking to satisfy the condition I need to ignore. Further notes: The number of tests can vary, it could be 3 or 10. You can have ignores between the subjects. If possible, I'd like to add a threshold that if there wasn't a subject in the last 10 rows, then leave it as test


Solution

  • How to look back a few rows higher to see if a condition is satisfied and then replace a cell accordingly

    Trying to "look back" is certainly possible, but it adds a lot of complexity. Instead, iterate over the rows from the top down. Whenever you come across a cell whose value starts with "Subject", set a variable current_subject to that value. Whenever you come across a cell whose value is "test", set the value of that cell to the value in current_subject.

    I'd like to add a threshold that if there wasn't a subject in the last 10 rows, then leave it as test

    OK, so create another variable called nonsubject_rows. Set it to 0 whenever you come across a subject row, and perhaps also whenever you set a test row to the current subject; otherwise, increment it for each row you visit. Before you set the value of a test row, check to see whether nonsubject_rows is greater than the threshold; if it is, don't change the row.

    In short, you can keep track of the information you need in order to process each row as you go. That will be a lot simpler and faster than searching for the information every time you need it.