Search code examples
rdata-analysis

Linear Regression Analysis of population data with R


I have a homework assignment where I need to take a CSV file based around population data around the United States and do some data analysis on the data inside. I need to find the data that exists for my state and for starters run a Linear Regression Analysis to predict the size of the population.

I've been studying R for a few weeks now, went through a LinkedIn Learning training, as well as 2 different trainings on pluralsight about R. I have also tried searching for how to do a Linear Regression Analysis in R and I find plenty of examples for how to do it when the data is perfectly laid out in a table in just the right way to Analyze.

The CSV file is laid out so that each state is defined on a single line/row so I used the filter function to grab just the data for my State and put it into a variable.

Within that dataset the population data is defined across several columns with the most important data being the Population Estimates for each year from 2010 to 2018.

library(tidyverse)
population.data <- read_csv("nst-est2018-alldata.csv")
mn.state.data <- filter(population.data, NAME == "Minnesota")

I'm looking for some help to get headed in the right direction my thought is that I will need to create to containers of data 1 having each year from 2010 to 2018 and one that contains the population data for each of those years. And then use the xyplot function with those two containers? If you have some experience in this area please help me think this through I'm not looking for anybody to do the assignment for me just want some help trying to think it through.

Edit: Here is the results of the

dput(head(population.data))

command:

structure(list(SUMLEV = c("010", "020", "020", "020", "020", 
"040"), REGION = c("0", "1", "2", "3", "4", "3"), DIVISION = c("0", 
"0", "0", "0", "0", "6"), STATE = c("00", "00", "00", "00", "00", 
"01"), NAME = c("United States", "Northeast Region", "Midwest Region", 
"South Region", "West Region", "Alabama"), CENSUS2010POP = c(308745538L, 
55317240L, 66927001L, 114555744L, 71945553L, 4779736L), ESTIMATESBASE2010 
= c(308758105L, 
55318430L, 66929743L, 114563045L, 71946887L, 4780138L), POPESTIMATE2010 =
c(309326085L, 
55380645L, 66974749L, 114867066L, 72103625L, 4785448L), POPESTIMATE2011 = 
c(311580009L, 
55600532L, 67152631L, 116039399L, 72787447L, 4798834L), POPESTIMATE2012 =
c(313874218L, 
55776729L, 67336937L, 117271075L, 73489477L, 4815564L), POPESTIMATE2013 = 
c(316057727L, 
55907823L, 67564135L, 118393244L, 74192525L, 4830460L), POPESTIMATE2014 = 
c(318386421L, 
56015864L, 67752238L, 119657737L, 74960582L, 4842481L), POPESTIMATE2015 = 
c(320742673L, 
56047587L, 67869139L, 121037542L, 75788405L, 4853160L), POPESTIMATE2016 = 
c(323071342L, 
56058789L, 67996917L, 122401186L, 76614450L, 4864745L), POPESTIMATE2017 = 
c(325147121L, 
56072676L, 68156035L, 123598424L, 77319986L, 4875120L), POPESTIMATE2018 = 
c(327167434L, 
56111079L, 68308744L, 124753948L, 77993663L, 4887871L), NPOPCHG_2010 = 
c(567980L, 
62215L, 45006L, 304021L, 156738L, 5310L), NPOPCHG_2011 = c(2253924L, 
219887L, 177882L, 1172333L, 683822L, 13386L), NPOPCHG_2012 = c(2294209L, 
176197L, 184306L, 1231676L, 702030L, 16730L), NPOPCHG_2013 = c(2183509L, 
131094L, 227198L, 1122169L, 703048L, 14896L), NPOPCHG_2014 = c(2328694L, 
108041L, 188103L, 1264493L, 768057L, 12021L), NPOPCHG_2015 = c(2356252L, 
31723L, 116901L, 1379805L, 827823L, 10679L), NPOPCHG_2016 = c(2328669L, 
11202L, 127778L, 1363644L, 826045L, 11585L), NPOPCHG_2017 = c(2075779L, 
13887L, 159118L, 1197238L, 705536L, 10375L), NPOPCHG_2018 = c(2020313L, 
38403L, 152709L, 1155524L, 673677L, 12751L), BIRTHS2010 = c(987836L, 
163454L, 212614L, 368752L, 243016L, 14227L), BIRTHS2011 = c(3973485L, 
646265L, 834909L, 1509597L, 982714L, 59689L), BIRTHS2012 = c(3936976L, 
637904L, 830701L, 1504936L, 963435L, 59070L), BIRTHS2013 = c(3940576L, 
635741L, 830869L, 1504799L, 969167L, 57936L), BIRTHS2014 = c(3963195L, 
632433L, 836505L, 1525280L, 968977L, 58907L), BIRTHS2015 = c(3992376L, 
634515L, 837968L, 1545722L, 974171L, 59637L), BIRTHS2016 = c(3962654L, 
628039L, 831667L, 1541342L, 961606L, 59388L), BIRTHS2017 = c(3901982L, 
616552L, 816177L, 1519944L, 949309L, 58259L), BIRTHS2018 = c(3855500L, 
609336L, 804431L, 1499838L, 941895L, 57216L), DEATHS2010 = c(598691L, 
110848L, 140785L, 228706L, 118352L, 11073L), DEATHS2011 = c(2512442L, 
470816L, 586840L, 962751L, 492035L, 48818L), DEATHS2012 = c(2501531L, 
460985L, 584817L, 960575L, 495154L, 48364L), DEATHS2013 = c(2608019L, 
480032L, 605188L, 1011093L, 511706L, 50847L), DEATHS2014 = c(2582448L, 
470196L, 597078L, 1006057L, 509117L, 49692L), DEATHS2015 = c(2699826L, 
488881L, 626494L, 1052360L, 532091L, 51820L), DEATHS2016 = c(2703215L, 
480331L, 619471L, 1058173L, 545240L, 51662L), DEATHS2017 = c(2779436L, 
501022L, 620556L, 1092949L, 564909L, 53033L), DEATHS2018 = c(2814013L, 
506909L, 621030L, 1109152L, 576922L, 53425L), NATURALINC2010 = c(389145L, 
52606L, 71829L, 140046L, 124664L, 3154L), NATURALINC2011 = c(1461043L, 
175449L, 248069L, 546846L, 490679L, 10871L), NATURALINC2012 = c(1435445L, 
176919L, 245884L, 544361L, 468281L, 10706L), NATURALINC2013 = c(1332557L, 
155709L, 225681L, 493706L, 457461L, 7089L), NATURALINC2014 = c(1380747L, 
162237L, 239427L, 519223L, 459860L, 9215L), NATURALINC2015 = c(1292550L, 
145634L, 211474L, 493362L, 442080L, 7817L), NATURALINC2016 = c(1259439L, 
147708L, 212196L, 483169L, 416366L, 7726L), NATURALINC2017 = c(1122546L, 
115530L, 195621L, 426995L, 384400L, 5226L), NATURALINC2018 = c(1041487L, 
102427L, 183401L, 390686L, 364973L, 3791L), INTERNATIONALMIG2010 = 
c(178835L, 
45723L, 25158L, 68742L, 39212L, 928L), INTERNATIONALMIG2011 = c(792881L, 
206686L, 116948L, 285343L, 183904L, 4716L), INTERNATIONALMIG2012 = 
c(858764L, 
207584L, 120995L, 344198L, 185987L, 5874L), INTERNATIONALMIG2013 = 
c(850952L, 
194103L, 126681L, 329897L, 200271L, 5111L), INTERNATIONALMIG2014 = 
c(947947L, 
222685L, 134310L, 365281L, 225671L, 3753L), INTERNATIONALMIG2015 = 
c(1063702L, 
227275L, 142759L, 429088L, 264580L, 4685L), INTERNATIONALMIG2016 = 
c(1069230L, 
236718L, 144859L, 436795L, 250858L, 5950L), INTERNATIONALMIG2017 = 
c(953233L, 
215872L, 126013L, 404582L, 206766L, 3190L), INTERNATIONALMIG2018 = 
c(978826L, 
229700L, 127583L, 418418L, 203125L, 3344L), DOMESTICMIG2010 = c(0L, 
-32918L, -50873L, 90679L, -6888L, 1238L), DOMESTICMIG2011 = c(0L, 
-159789L, -186896L, 335757L, 10928L, -2239L), DOMESTICMIG2012 = c(0L, 
-205314L, -181285L, 336615L, 49984L, 59L), DOMESTICMIG2013 = c(0L, 
-216273L, -123814L, 293443L, 46644L, 2641L), DOMESTICMIG2014 = c(0L, 
-274391L, -182730L, 373439L, 83682L, -755L), DOMESTICMIG2015 = c(0L, 
-339996L, -234823L, 452879L, 121940L, -1553L), DOMESTICMIG2016 = c(0L, 
-372953L, -228200L, 442633L, 158520L, -1977L), DOMESTICMIG2017 = c(0L, 
-316879L, -161387L, 364465L, 113801L, 2065L), DOMESTICMIG2018 = c(0L, 
-292928L, -157048L, 345132L, 104844L, 5718L), NETMIG2010 = c(178835L, 
12805L, -25715L, 159421L, 32324L, 2166L), NETMIG2011 = c(792881L, 
46897L, -69948L, 621100L, 194832L, 2477L), NETMIG2012 = c(858764L, 
2270L, -60290L, 680813L, 235971L, 5933L), NETMIG2013 = c(850952L, 
-22170L, 2867L, 623340L, 246915L, 7752L), NETMIG2014 = c(947947L, 
-51706L, -48420L, 738720L, 309353L, 2998L), NETMIG2015 = c(1063702L, 
-112721L, -92064L, 881967L, 386520L, 3132L), NETMIG2016 = c(1069230L, 
-136235L, -83341L, 879428L, 409378L, 3973L), NETMIG2017 = c(953233L, 
-101007L, -35374L, 769047L, 320567L, 5255L), NETMIG2018 = c(978826L, 
-63228L, -29465L, 763550L, 307969L, 9062L), RESIDUAL2010 = c(0L, 
-3196L, -1108L, 4554L, -250L, -10L), RESIDUAL2011 = c(0L, -2459L, 
-239L, 4387L, -1689L, 38L), RESIDUAL2012 = c(0L, -2992L, -1288L, 
6502L, -2222L, 91L), RESIDUAL2013 = c(0L, -2445L, -1350L, 5123L, 
-1328L, 55L), RESIDUAL2014 = c(0L, -2490L, -2904L, 6550L, -1156L, 
-192L), RESIDUAL2015 = c(0L, -1190L, -2509L, 4476L, -777L, -270L
), RESIDUAL2016 = c(0L, -271L, -1077L, 1047L, 301L, -114L), RESIDUAL2017 = 
c(0L, 
-636L, -1129L, 1196L, 569L, -106L), RESIDUAL2018 = c(0L, -796L, 
-1227L, 1288L, 735L, -102L), RBIRTH2011 = c(12.79898857, 11.646389369, 
12.449493906, 13.0753983, 13.564866164, 12.455601786), RBIRTH2012 = 
c(12.589173852, 
11.454833676, 12.353389372, 12.900715293, 13.172754439, 12.287820829
), RBIRTH2013 = c(12.511116578, 11.384582534, 12.318197145, 12.770698648, 
13.1250523, 12.012410502), RBIRTH2014 = c(12.493440163, 11.301146646, 
12.363692308, 12.814734, 12.993051496, 12.179749675), RBIRTH2015 = 
c(12.493175596, 
11.324209532, 12.357461907, 12.843808208, 12.92441189, 12.301816868
), RBIRTH2016 = c(12.309933949, 11.20434042, 12.242454436, 12.663079639, 
12.619264908, 12.222387438), RBIRTH2017 = c(12.039095529, 10.996948983, 
11.989119413, 12.357287884, 12.333939366, 11.962999487), RBIRTH2018 = 
c(11.820984126, 
10.863177115, 11.789576855, 12.078306222, 12.128940451, 11.720998206
), RDEATH2011 = c(8.0928244199, 8.4846099623, 8.7504877826, 8.3388830191, 
6.7917918366, 10.187095914), RDEATH2012 = c(7.9990857588, 8.2779015368, 
8.6968381072, 8.2343067033, 6.7700904074, 10.060744313), RDEATH2013 = 
c(8.2803198685, 
8.5962112289, 8.9723230665, 8.5807898649, 6.9298356343, 10.542582104
), RDEATH2014 = c(8.1408206164, 8.4020820365, 8.8249187702, 8.4524499397, 
6.8267702932, 10.274434632), RDEATH2015 = c(8.4484528254, 8.7250748685, 
9.2388679994, 8.7443343664, 7.0592978512, 10.689339673), RDEATH2016 = 
c(8.3975028099, 
8.5692003816, 9.1188486402, 8.6935469035, 7.1552465339, 10.632332792
), RDEATH2017 = c(8.5756150392, 8.9363320099, 9.1155717285, 8.8857783149, 
7.3396052849, 10.889883997), RDEATH2018 = c(8.6277792774, 9.0371195009, 
9.1016891619, 8.9320830002, 7.4291216994, 10.944391939), RNATURALINC2011 = 
c(4.7061641498, 
3.161779407, 3.6990061239, 4.7365152812, 6.7730743272, 2.2685058724
), RNATURALINC2012 = c(4.5900880929, 3.1769321388, 3.656551265, 
4.66640859, 6.402664032, 2.2270765159), RNATURALINC2013 = c(4.2307967093, 
2.7883713049, 3.3458740787, 4.1899087829, 6.1952166656, 1.4698283977
), RNATURALINC2014 = c(4.3526195469, 2.89906461, 3.5387735378, 
4.3622840605, 6.1662812026, 1.9053150433), RNATURALINC2015 = 
c(4.0447227708, 
2.5991346635, 3.1185939072, 4.0994738414, 5.8651140389, 1.6124771946
), RNATURALINC2016 = c(3.912431139, 2.6351400388, 3.123605796, 
3.969532736, 5.4640183742, 1.5900546466), RNATURALINC2017 = 
c(3.4634804902, 
2.0606169731, 2.8735476848, 3.4715095687, 4.9943340813, 1.0731154898
), RNATURALINC2018 = c(3.1932048488, 1.8260576141, 2.687887693, 
3.1462232219, 4.6998187519, 0.7766062675), RINTERNATIONALMIG2011 = 
c(2.5539481982, 
3.7247036946, 1.7438348531, 2.4715029092, 2.5385138982, 0.9841112772
), RINTERNATIONALMIG2012 = c(2.7460490726, 3.7275831375, 1.7993217139, 
2.9505576333, 2.5429438207, 1.2219173785), RINTERNATIONALMIG2013 = 
c(2.7017267715, 
3.4759149144, 1.8781318506, 2.7997195452, 2.7121923767, 1.0597112344
), RINTERNATIONALMIG2014 = c(2.988275652, 3.9792291689, 1.9851256285, 
3.0689308523, 3.0260314993, 0.7759790947), RINTERNATIONALMIG2015 = 
c(3.3285982753, 
4.0561842059, 2.1052580818, 3.5654043717, 3.5102060089, 0.9664136698
), RINTERNATIONALMIG2016 = c(3.3215493142, 4.2230961065, 2.1323795548, 
3.5885415898, 3.2920380658, 1.2245437674), RINTERNATIONALMIG2017 = 
c(2.9410856198, 
3.8503376372, 1.8510505744, 3.2892897676, 2.6864164429, 0.6550398799
), RINTERNATIONALMIG2018 = c(3.0010858795, 4.0950670621, 1.8698304564, 
3.3695510667, 2.6156748143, 0.685035969), RDOMESTICMIG2011 = c(0, 
-2.879569389, -2.786843372, 2.9081645678, 0.1508443529, -0.467223314
), RDOMESTICMIG2012 = c(0, -3.686820778, -2.69589683, 2.8855541222, 
0.6834160664, 0.0122732593), RDOMESTICMIG2013 = c(0, -3.872925953, 
-1.835626629, 2.4903472978, 0.6316815776, 0.5475831286), RDOMESTICMIG2014 
= c(0, 
-4.903180146, -2.700781819, 3.1374707924, 1.1220952977, -0.156105573
), RDOMESTICMIG2015 = c(0, -6.067919504, -3.462920156, 3.7630900106, 
1.6177886489, -0.320350145), RDOMESTICMIG2016 = c(0, -6.653555548, 
-3.359190761, 3.6365043774, 2.0802759896, -0.40687782), RDOMESTICMIG2017 = 
c(0, 
-5.651919379, -2.370672066, 2.963134779, 1.4785645494, 0.4240305179
), RDOMESTICMIG2018 = c(0, -5.222289092, -2.301663494, 2.7793734944, 
1.350093835, 1.1713623417), RNETMIG2011 = c(2.5539481982, 0.845134306, 
-1.043008519, 5.379667477, 2.6893582511, 0.516887963), RNETMIG2012 = 
c(2.7460490726, 
0.0407623599, -0.896575116, 5.8361117555, 3.2263598871, 1.2341906378
), RNETMIG2013 = c(2.7017267715, -0.397011039, 0.0425052219, 
5.2900668429, 3.3438739543, 1.6072943629), RNETMIG2014 = c(2.988275652, 
-0.923950977, -0.71565619, 6.2064016447, 4.148126797, 0.6198735214
), RNETMIG2015 = c(3.3285982753, -2.011735298, -1.357662074, 
7.3284943823, 5.1279946578, 0.6460635248), RNETMIG2016 = c(3.3215493142, 
-2.430459441, -1.226811206, 7.2250459672, 5.3723140554, 0.8176659475
), RNETMIG2017 = c(2.9410856198, -1.801581742, -0.519621492, 
6.2524245465, 4.1649809923, 1.0790703978), RNETMIG2018 = c(3.0010858795, 
-1.12722203, -0.431833037, 6.1489245611, 3.9657686492, 1.8563983107
)), .Names = c("SUMLEV", "REGION", "DIVISION", "STATE", "NAME", 
"CENSUS2010POP", "ESTIMATESBASE2010", "POPESTIMATE2010", 
"POPESTIMATE2011", 
"POPESTIMATE2012", "POPESTIMATE2013", "POPESTIMATE2014", 
"POPESTIMATE2015", 
"POPESTIMATE2016", "POPESTIMATE2017", "POPESTIMATE2018", "NPOPCHG_2010", 
"NPOPCHG_2011", "NPOPCHG_2012", "NPOPCHG_2013", "NPOPCHG_2014", 
"NPOPCHG_2015", "NPOPCHG_2016", "NPOPCHG_2017", "NPOPCHG_2018", 
"BIRTHS2010", "BIRTHS2011", "BIRTHS2012", "BIRTHS2013", "BIRTHS2014", 
"BIRTHS2015", "BIRTHS2016", "BIRTHS2017", "BIRTHS2018", "DEATHS2010", 
"DEATHS2011", "DEATHS2012", "DEATHS2013", "DEATHS2014", "DEATHS2015", 
"DEATHS2016", "DEATHS2017", "DEATHS2018", "NATURALINC2010", 
"NATURALINC2011", 
"NATURALINC2012", "NATURALINC2013", "NATURALINC2014", "NATURALINC2015", 
"NATURALINC2016", "NATURALINC2017", "NATURALINC2018", 
"INTERNATIONALMIG2010", 
"INTERNATIONALMIG2011", "INTERNATIONALMIG2012", "INTERNATIONALMIG2013", 
"INTERNATIONALMIG2014", "INTERNATIONALMIG2015", "INTERNATIONALMIG2016", 
"INTERNATIONALMIG2017", "INTERNATIONALMIG2018", "DOMESTICMIG2010", 
"DOMESTICMIG2011", "DOMESTICMIG2012", "DOMESTICMIG2013", 
"DOMESTICMIG2014", 
"DOMESTICMIG2015", "DOMESTICMIG2016", "DOMESTICMIG2017", 
"DOMESTICMIG2018", 
"NETMIG2010", "NETMIG2011", "NETMIG2012", "NETMIG2013", "NETMIG2014", 
"NETMIG2015", "NETMIG2016", "NETMIG2017", "NETMIG2018", "RESIDUAL2010", 
"RESIDUAL2011", "RESIDUAL2012", "RESIDUAL2013", "RESIDUAL2014", 
"RESIDUAL2015", "RESIDUAL2016", "RESIDUAL2017", "RESIDUAL2018", 
"RBIRTH2011", "RBIRTH2012", "RBIRTH2013", "RBIRTH2014", "RBIRTH2015", 
"RBIRTH2016", "RBIRTH2017", "RBIRTH2018", "RDEATH2011", "RDEATH2012", 
"RDEATH2013", "RDEATH2014", "RDEATH2015", "RDEATH2016", "RDEATH2017", 
"RDEATH2018", "RNATURALINC2011", "RNATURALINC2012", "RNATURALINC2013", 
"RNATURALINC2014", "RNATURALINC2015", "RNATURALINC2016", 
"RNATURALINC2017", 
"RNATURALINC2018", "RINTERNATIONALMIG2011", "RINTERNATIONALMIG2012", 
"RINTERNATIONALMIG2013", "RINTERNATIONALMIG2014", "RINTERNATIONALMIG2015", 
"RINTERNATIONALMIG2016", "RINTERNATIONALMIG2017", "RINTERNATIONALMIG2018", 
"RDOMESTICMIG2011", "RDOMESTICMIG2012", "RDOMESTICMIG2013", 
"RDOMESTICMIG2014", 
"RDOMESTICMIG2015", "RDOMESTICMIG2016", "RDOMESTICMIG2017", 
"RDOMESTICMIG2018", 
"RNETMIG2011", "RNETMIG2012", "RNETMIG2013", "RNETMIG2014", "RNETMIG2015", 
"RNETMIG2016", "RNETMIG2017", "RNETMIG2018"), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Solution

  • In order to help you out, an example data using dput(head(population.data)) would be helpful. Based on your comments, your data is in what is called 'wide' format, meaning each observation is contained in a column, rather than a row (pupulation 2010, population 2011 etc.).

    As i hinted in my comment, a sub-goal within statistical modelling is always to clean and reshape data to a proper format, that will work for running models. In this case the problem is that your format is in an incorrect shape. The most common is likely melting to long format via the reshape2 or data.table package as explained in this link. I personally prefer the data.table package, as it seems to have better large scale performance. Their usage however is identical.

    Lets say you have a column 'NAME' for states and 9 columns for population estimates (2010 population estimates, 2011 population estimates and so on), we could then convert these columns into a long format, using melt from either of the two suggested packages (They are identical in use)

    require(data.table)
    value_columns <- paste(2010:2018, "Population Estimates")
    population.data_long <- melt(population.data, id.vars = "NAME", 
                                 measure.vars = value_columns, #Columns containing values we (that are grouped by their column names) 
                                 variable.name = 'Year (Population Estimate)', #Name of the column which tells us [(Year) Population Estimate]
                                 value.name = 'Population Estimate') #Name of the column with values
    population.data_long$year <- as.integer(substr(population.data_long$`Year (Population Estimate)`, 1, 4)) #Create a year column in a bit of a hacky way
    

    Note i have ignored any additional columns, and these should be included in your melt statement. From here on a linear regression should follow any standard example that you have found.