I'm looking at downloading some data from the statistics.gov.scot website. For example, I would like to source some data on the rates of hospital admissions. The query to source the data table I'm interested in is of format:
http://statistics.gov.scot/slice/observations.csv?&dataset=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions&http%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23measureType=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fmeasure-properties%2Fratio&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fage=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fage%2Fall&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fgender=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fgender%2Fall
and be accessed via this link, for those who want to try. The query generates a *.CSV
file with the relevant information, however, the format of the file poses some challenges.
The file content looks like that:
Generated by http://statistics.gov.scot,2016-03-15T10:41:28+00:00
http://statistics.gov.scot/data/hospital-admissions,Hospital Admissions
measure type,""
Admission Type,""
Age,""
Gender,""
Measure (cell values): ,"Ratio (Rate Per 100,000 Population)"
,,http://reference.data.gov.uk/id/year/2002,http://reference.data.gov.uk/id/year/2003,http://reference.data.gov.uk/id/year/2004,http://reference.data.gov.uk/id/year/2005,http://reference.data.gov.uk/id/year/2006,http://reference.data.gov.uk/id/year/2007,http://reference.data.gov.uk/id/year/2008,http://reference.data.gov.uk/id/year/2009,http://reference.data.gov.uk/id/year/2010,http://reference.data.gov.uk/id/year/2011,http://reference.data.gov.uk/id/year/2012
http://purl.org/linked-data/sdmx/2009/dimension#refArea,Reference Area,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
http://statistics.gov.scot/id/statistical-geography/S92000003,Scotland,"9,351","9,262","9,261","9,347","9,723","10,517","10,293","10,150","10,024","10,232","10,194"
when imported to Excel:
However, when imported to R via the read.csv
it looks like that:
> head(problematicFile)
V1 V2
1 Generated by http://statistics.gov.scot 2016-03-15T10:36:29+00:00
2 http://statistics.gov.scot/data/hospital-admissions Hospital Admissions
3 measure type
4 Admission Type
5 Age
6 Gender
The read.csv
import returns only two columns. I'm guessing that the problem relates to some of the initial columns being empty. I want to read this file in a manner similar to the illustrated import achieved in Excel. The point is that, I intend to use vales from the row 7 in columns A and B and, naturally, the data table below. In terms of generating the data.frame
I would be happy to contain NA
values where there are empty cells but to be of the dimensions equivalent to those in Excel. I tried:
read.csv(file = link, header = FALSE, na.strings = "",
fill = TRUE)
but I keep on arriving at the same problem.
The desired results should look like that (extract generated by hand):
Generated by http://statistics.gov.scot 2016-03-15T10:41:28+00:00 NA NA NA NA NA NA NA
http://statistics.gov.scot/data/hospital-admissions Hospital Admissions NA NA NA NA NA NA NA
measure type NA NA NA NA NA NA NA NA
Admission Type NA NA NA NA NA NA NA NA
Age NA NA NA NA NA NA NA NA
Gender NA NA NA NA NA NA NA NA
Measure (cell values): Ratio (Rate Per 100,000 Population) NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA http://reference.data.gov.uk/id/year/2002 http://reference.data.gov.uk/id/year/2003 http://reference.data.gov.uk/id/year/2004 http://reference.data.gov.uk/id/year/2005 http://reference.data.gov.uk/id/year/2006 http://reference.data.gov.uk/id/year/2007 http://reference.data.gov.uk/id/year/2008
http://purl.org/linked-data/sdmx/2009/dimension#refArea Reference Area 2002 2003 2004 2005 2006 2007 2008
http://statistics.gov.scot/id/statistical-geography/S92000003 Scotland 9,351 9,262 9,261 9,347 9,723 10,517 10,293
http://statistics.gov.scot/id/statistical-geography/S16000082 Angus South 8,236 8,500 8,523 8,371 8,616 8,978 9,325
http://statistics.gov.scot/id/statistical-geography/S16000106 Edinburgh Northern and Leith 9,040 8,040 7,925 9,042 10,355 11,833 8,916
http://statistics.gov.scot/id/statistical-geography/S16000140 Renfrewshire South 9,391 9,122 9,491 9,586 10,425 10,900 11,065
http://statistics.gov.scot/id/statistical-geography/S16000108 Edinburgh Southern 5,878 5,910 6,101 6,035 7,426 9,343 6,766
http://statistics.gov.scot/id/statistical-geography/S16000075 Aberdeen Donside 10,047 10,963 10,629 10,512 10,383 10,787 10,685
http://statistics.gov.scot/id/statistical-geography/S16000137 Perthshire North 9,388 9,524 7,799 9,350 9,543 9,791 9,991
http://statistics.gov.scot/id/statistical-geography/S16000077 Aberdeenshire East 7,211 7,300 7,153 7,411 7,435 7,268 7,547
http://statistics.gov.scot/id/statistical-geography/S16000114 Galloway and West Dumfries 9,861 9,165 8,143 9,258 7,508 10,213 10,399
http://statistics.gov.scot/id/statistical-geography/S16000096 Dumbarton 8,703 8,570 8,727 9,310 9,389 9,885 10,237
Just to illustrate further, I want to maintain the dimensions and populate missing values with NA
s:
Parsing the metadata from the headers is a bit tricky. You might prefer to download the whole normalised dataset instead of that cross-tabulated slice.
> reconv <- read.csv("http://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions")
> head(reconv)
GeographyCode DateCode Measurement Units Value Gender Age
1 S92000003 2003 Mean Average reconvictions per offender 0.62 All All
2 S92000003 2004 Mean Average reconvictions per offender 0.33 All All
3 S92000003 2004 Mean Average reconvictions per offender 0.61 All All
4 S92000003 2005 Mean Average reconvictions per offender 0.60 All All
5 S92000003 2006 Mean Average reconvictions per offender 0.60 All All
6 S92000003 2007 Mean Average reconvictions per offender 0.11 All All
This will put all of the metadata in factor levels (so you don't have to parse it):
> str(reconv)
'data.frame': 10119 obs. of 7 variables:
$ GeographyCode: Factor w/ 26 levels "S12000005","S12000006",..: 26 26 26 26 26 26 26 26 26 26 ...
$ DateCode : int 2003 2004 2004 2005 2006 2007 2007 2008 2008 2009 ...
$ Measurement : Factor w/ 2 levels "Mean","Ratio": 1 1 1 1 1 1 1 1 1 1 ...
$ Units : Factor w/ 2 levels "Average reconvictions per offender",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Value : num 0.62 0.33 0.61 0.6 0.6 0.11 0.57 0.6 0.33 0.33 ...
$ Gender : Factor w/ 3 levels "All","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Age : Factor w/ 6 levels "21-25","26-30",..: 4 4 4 4 4 4 4 4 4 4 ...
You can select the slice you're interested in:
> slice <- subset(reconv, Measurement=="Ratio" & Gender=="All" & Age=="All")
And get back to the original cross-tabulated slice if you want:
> library(reshape2)
> dcast(slice, GeographyCode ~ DateCode, value.var="Value", fun.aggregate = first)
GeographyCode 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
1 S12000005 41.4 34.3 41.0 40.7 37.4 37.2 33.3 34.6 35.8 33.0 32.8
2 S12000006 34.9 36.0 31.9 34.2 31.1 28.7 27.9 29.6 27.5 26.8 27.0
3 S12000008 33.7 33.2 33.7 33.2 31.7 32.8 30.4 31.5 29.1 28.1 28.7
4 S12000010 26.7 24.5 25.7 26.9 26.7 27.8 29.3 25.1 22.4 29.0 28.2
5 S12000013 31.7 26.1 30.6 35.4 31.6 25.9 24.0 18.9 30.5 22.8 18.6
...