How to manually control data schema interpretation

When I export public weather data from https://www1.ncdc.noaa.gov/pub/data/uscrn/products/subhourly01/2017/CRNS0101-05-2017-TX_Austin_33_NW.txt, as soon as solar radiation > 9, all of my data for the remaining columns gets lumped into a single column, as shown below. I have tried uploading as txt and csv and the problem still exists in excel, sheets, and dataprep.

Why is this happening?

Is there a programmatic way to fix this so that the data populates as intended, with 1 value per column?

Solution

It is likely because the initial data structure is not detected correctly. This can happen if the first rows of your dataset have a different structure than the remaining rows.

To solve this problem in Dataprep, you can indicate how the dataset should be structured by following these steps:

Go to the flow view
Right click on the dataset and choose "remove structure..."
Open the recipe
Insert a split row step:
- splitrows col: column1 on: '\n'
Split the column using a whitespace regex (for e.g., /\s+/)
- splitpatterns col: column1 type: on on: /\s+/ limit: 22

(you can copy and paste the following command inside the search input when you create a new step)

Here is what you should get:

Note: it is also possible to prevent the initial structure detection when importing a dataset. See https://cloud.google.com/dataprep/docs/html/Remove-Initial-Structure_136154971