I'm using the titanic csv and i've been trying to substitute the elements of the 5th and 12th column, sex and embarked, so the elements of sex instead of being male/female should be m/f and in the 12th column instead of being the first letter of the port have to be the full name of the port.
The csv originally looks like this:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,Nan,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.28,C85,C
...
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39,0,5,382652,29.12,Nan,Q
890,1,1,"Behr, Mr. Karl Howell",male,26,0,0,111369,30.00,C148,C
891,0,3,"Dooley, Mr. Patrick",male,32,0,0,370376,7.75,Nan,Q
And it should look like this after the modifications:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",m,22,1,0,A/5 21171,7.25,Nan,Southampton
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",f,38,1,0,PC 17599,71.28,C85,Cherbourg
...
886,0,3,"Rice, Mrs. William (Margaret Norton)",f,39,0,5,382652,29.12,Nan,Queenstown
890,1,1,"Behr, Mr. Karl Howell",m,26,0,0,111369,30.00,C148,Cherbourg
891,0,3,"Dooley, Mr. Patrick",m,32,0,0,370376,7.75,Nan,Queenstown
But it doesn't substitute the elements of the 12th column except in the last row, the column sex is substituted correctly:
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",m,22,1,0,A/5 21171,7.25,Nan,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",f,38,1,0,PC 17599,71.28,C85,C
...
886,0,3,"Rice, Mrs. William (Margaret Norton)",f,39,0,5,382652,29.12,Nan,Q
890,1,1,"Behr, Mr. Karl Howell",m,26,0,0,111369,30.00,C148,C
891,0,3,"Dooley, Mr. Patrick",m,32,0,0,370376,7.75,Nan,Queenstown
The script is the following:
BEGIN {
FPAT = "([^,]*)|(\"[^\"]+\")"
OFS = ","
}
{
# Cambiar el valor de la columna sexo a 0 si es "female" o a 1 si es "male"
if ($5 == "female")
$5 = "f"
else if ($5 == "male")
$5 = "m"
# Realizar la sustitución en la columna embarked
if ($12 == "C")
$12 = "Cherbourg"
else if ($12 == "Q")
$12 = "Queenstown"
else if ($12 == "S")
$12 = "Southampton"
print $0
}
To clarify, none of the elements of the 12th row has spaces or characters that would make the match fail, in python the substitution works fine.
Since a change in the middle of the row works, but not at the end, I suspect line-endings are your problem. Trailing \r
would explain your symptoms.
A more robust approach to manipulate CSV is to use a tool that already has a full CSV parser built into it.
Python has CSV support or, for example, sqlite3
is widely available:
#!/bin/sh
sqlite3 >"new.csv" <<'EOD'
.mode csv
.headers on
.import "orig.csv" t
update t set
sex = case
when sex="female" then "f"
when sex="male" then "m"
else sex
end,
embarked = case
when embarked="C" then "Cherbourg"
when embarked="Q" then "Queenstown"
when embarked="S" then "Southampton"
else embarked
end
;
select * from t;
EOD