Search code examples
scalaapache-sparkdataframedatasetspark-shell

Difference between sparksession text and textfile methods?


I am working with Spark scala shell and trying to create dataframe and datasets from a text file.

For getting datasets from a text file, there are two options, text and textFile methods as follows:

scala> spark.read.
csv   format   jdbc   json   load   option   options   orc   parquet   schema   table   text   textFile

Here is how i am gettting datasets and dataframe from both these methods:

scala> val df = spark.read.text("/Users/karanverma/Documents/logs1.txt")
df: org.apache.spark.sql.DataFrame = [value: string]

scala> val df = spark.read.textFile("/Users/karanverma/Documents/logs1.txt")
df: org.apache.spark.sql.Dataset[String] = [value: string]

So my question is what is the difference between the two methods for text file?

When to use which methods?


Solution

  • As I've noticed that they are almost having the same functionality,

    It just that spark.read.text transform data to Dataset which is a distributed collection of data, while spark.read.textFile transform data to Dataset[Type] which is consist of Dataset organized into named columns.

    Hope it helps.