Search code examples
rdataframenareadxl

Importing data in R with read_excel and missing values


I am currently learning how to import excel into a data frame. I have researched and I am unable to find the answer so far.

There are rows of data representing each month dating back to 2013, where there are about 150 rows per month (by locations). Each location, each month has sets of data such as the number of people, plus some other relevant numerical and character data. As the file was developed and methodologies changed, some columns have blanks, but the remainder of the column is numeric. For example, Number_of_PPL was only recorded from 2019 onwards, so the previous values are blank.

I know I can change the blanks to zeros, however I don't want this to be confused with 0 People vs no data.

When importing using the following:

df <- data.frame(read_excel("UPLOAD_DATA.xlsx", sheet="ALL DATA",na=""))

str(df) gives:

$ Number_of_PPL : logi NA NA NA NA NA NA ...

The columns with blanks are being treated as logicals rather than numerics. I tried converting to numeric however that gave me the 0,1 or NA values, not the true values for the data. I do not want to omit the rows with NA because they contain other relevant data for that location.

Is there a way to have these columns import as numeric so I can still represent them in charts and make calculations later on without inserting zeros?

EDIT / UPDATE Here is an example of the data uploaded from excel: Initial Data rows with blanks

transition from blanks to recorded data

The transition to numerical data vs blanks occurs at row 12,520 in the Number_of_PPL column. So far when using Read_xl I have found that it will guess this is logical data. If I enter numbers in a earlier row, ie. Row1, read_xl will treat the column as numerical on its own.

The suggestion of col_names works perfectly to force it to be numerical, however in terms of efficiency and for future data sets, I would prefer to find out if there is a way to have read_xl look farther down in the data before assuming the data is logical?

Clearly it is capable of taking blank rows, assigning NA and still calling it numerical but it seems to have some relation to where the numbers occur with in the data column, and in this case they seem to be to far "down"


Solution

  • Increasing the value of guess_max has solved this issue.

    df = data.frame(read_excel("UPLOAD_DATA.xlsx", sheet="ALL DATA",na="", guess_max=Inf))

    credit to @benbolker