Search code examples
apache-sparkhadoopapache-spark-sqlparquet

Iterate through a whole dataset at once in Spark?


I have a big data set with population demographics per year, per country. I'm using Apache Spark with Scala and Parquet. The structure is one column per year (i.e. '1965'). I would like to be able to select row values across the set.

Here is the schema:

columns: Array[String] = Array(country, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010)

I'd like to be able to filter my data set based on population level, regardless of what year it is. For example, get the country name and year when the population is over 500

SELECT * FROM table WHERE population > 5000000. 

Result: Cuba, 1962

How can I structure my data frame to allow for this type of query?


Solution

  • You only need to pivot the table.

    Here is a good article: https://databricks.com/blog/2018/11/01/sql-pivot-converting-rows-to-columns.html

    How to pivot a dataframe: How to pivot DataFrame?