Search code examples
sql-serverapache-sparklarge-data

Are Apache Spark or an SQL server solutions for memory-limited local data manipulation?


I have been assigned an 8GB RAM desktop at work which I can't modify. My job involves data manipulation on a group of ~1GB, ~8M row tables.

Certain analyses I need to do would be considerably simpler to implement if I could merge all the files but this means R, which is the tool I currently am using, won't be able to load the merged file at all.

I've asked around and was told that using Apache Spark or setting up a local SQL server would solve the issue and let me ignore memory limitations for data processing steps (the expected output always consists of only a handful of total counts). I'd just like to be sure these will actually work like that before installing anything.

(as a bonus question, I wonder how software like SPSS manages to load and work on huge datasets without a hitch and why R can't implement a similar method)


Solution

  • Both Spark and SQL Server can absolutely handle and process larger data than fits into RAM.

    Installing these tools shouldn't be a big deal. Uninstalling a local Spark installation is just deleting a simple directory.

    Spark is intended for use on clusters of computers, but you can use it on a local workstation.

    Spark will also read/write data directly in most flat file formats. With SQL Server, you have to load it into SQL Server tables.