Search code examples
excelpandascsvgoogle-cloud-dataprep

How to manually control data schema interpretation


When I export public weather data from https://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2017/CRNS0101-05-2017-TX_Austin_33_NW.txt, as soon as solar radiation > 9, all of my data for the remaining columns gets lumped into a single column, as shown below. I have tried uploading as txt and csv and the problem still exists in excel, sheets, and dataprep.

Why is this happening?

Is there a programmatic way to fix this so that the data populates as intended, with 1 value per column?

CSV columns


Solution

  • It is likely because the initial data structure is not detected correctly. This can happen if the first rows of your dataset have a different structure than the remaining rows.

    To solve this problem in Dataprep, you can indicate how the dataset should be structured by following these steps:

    1. Go to the flow view
    2. Right click on the dataset and choose "remove structure..." remove structure...
    3. Open the recipe
    4. Insert a split row step:
      • splitrows col: column1 on: '\n'
    5. Split the column using a whitespace regex (for e.g., /\s+/)
      • splitpatterns col: column1 type: on on: /\s+/ limit: 22

    (you can copy and paste the following command inside the search input when you create a new step)

    Here is what you should get: recipe

    Note: it is also possible to prevent the initial structure detection when importing a dataset. See https://cloud.google.com/dataprep/docs/html/Remove-Initial-Structure_136154971

    detect initial structure - unchecked