Search code examples
sql-serverapache-pig

Filtering columns on PIG LATIN script inserts data from other columns on cells that are null


Okay I have done all of the googling and documentation reading for this and still cannot find a solution.

I am downloading a CSV file export from here: Traffic Crashes Chicago

I wrote a PIG script which will filter out columns I do not want so that the output will only contain the desired columns. The issue I am having is that when I use a (FOREACH GENERATE) PIG will take data from the previous column and insert it into the following field if that field was null.

I do not want to filter out null values since I need every row in this dataset. How can I prevent PIG from inserting incorrect data, or is there a way to insert a " " if PIG encounters a null value instead of it inserting stuff from previous columns?

Just a couple of explanations:

I am creating sort of a schema on the foreach ('alias' as 'alias') because if I don't the output will delete commas on null values and the table will be messed up since I am going to be importing this into SQL Server and I need all the PK and FK to be present in order to match other tables.

I am using DISTINCT in order to keep the output to one file otherwise the output will be spread around to multiple files.

Here is the output I get after running the script:

Output of PIG script

Here is my PIG script:

A = LOAD 'Traffic_Crashes.csv' 
USING PigStorage (',') 
AS (RD_NO: chararray, CRASH_DATE_EST_I: chararray, CRASH_DATE: chararray, POSTED_SPEED_LIMIT: chararray, TRAFFIC_CONTROL_DEVICE: chararray, DEVICE_CONDITION: chararray, WEATHER_CONDITION: chararray, LIGHTING_CONDITION: chararray, FIRST_CRASH_TYPE: chararray, TRAFFICWAY_TYPE: chararray, LANE_CNT: chararray, ALIGNMENT: chararray, ROADWAY_SURFACE_COND: chararray, ROAD_DEFECT: chararray, REPORT_TYPE: chararray, CRASH_TYPE: chararray, INTERSECTION_RELATED_I: chararray, NOT_RIGHT_OF_WAY_I: chararray, HIT_AND_RUN_I: chararray, DAMAGE: chararray, DATE_POLICE_IFIED: chararray, PRIM_CONTRIBUTORY_CAUSE: chararray, SEC_CONTRIBUTORY_CAUSE: chararray, STREET_NO: chararray, STREET_DIRECTION: chararray, STREET_NAME: chararray, BEAT_OF_OCCURRENCE: chararray, PHOTOS_TAKEN_I: chararray, STATEMENTS_TAKEN_I: chararray, DOORING_I: chararray, WORK_ZONE_I: chararray, WORK_ZONE_TYPE: chararray, WORKERS_PRESENT_I: chararray, NUM_UNITS: chararray, MOST_SEVERE_INJURY: chararray, INJURIES_TOTAL: chararray, INJURIES_FATAL: chararray, INJURIES_INCAPACITATING: chararray, INJURIES_NON_INCAPACITATING: chararray, INJURIES_REPORTED__EVIDENT: chararray, INJURIES_NO_INDICATION: chararray, INJURIES_UNKNOWN: chararray, CRASH_HOUR: chararray, CRASH_DAY_OF_WEEK: chararray, CRASH_MONTH: chararray, LATITUDE: chararray, LONGITUDE: chararray, LOCATION: chararray);

B = FOREACH A GENERATE RD_NO as RD_NO, CRASH_DATE as CRASH_DATE, PRIM_CONTRIBUTORY_CAUSE as PRIM_CONTRIBUTORY_CAUSE, MOST_SEVERE_INJURY AS MOST_SEVERE_INJURY;

C = DISTINCT D;

STORE C INTO 'crashFilter_Output/' USING PigStorage (',');

Thank you anyone who reads this, I am super new at this and it is for a school project.


Solution

  • Please try using CSVLoader.

    DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
    A = LOAD 'Traffic_Crashes.csv' USING CSVLoader() AS (RD_NO: char ...
    

    Issue here is PigStorage does not recognize quoted strings. Field like

    "OVER $1,500"
    

    is treated as two fields in PigStorage and thus causing the unexpected shifts in your outputs.