Search code examples
goamazon-s3buffergzipnewline

Go bufio.Writer, gzip.Writer and upload to AWS S3 in memory


I am attempting to write a compressed file from memory and upload to S3.

I am serializing a large array of type Data struct into a bufio.Writer that writes to a gzip.Writer in a line-by-line fashion:

### DATA AND SERIALIZATION

type Data struct {
  field_1 int
  field_2 string
}

func (d *Data) Serialize() []byte {
  return []byte( fmt.Sprintf(`%d;%s\n`, d.field_1, d.field_2) )
}
### CREATE FILE AS COMPRESSED BYTES

var datas []*Data   // assume this is filled

buffer := &bytes.Buffer{}
compressor := gzip.NewWriter(buffer)
writer := bufio.NewWriter(compressor)

for _, data := range datas {
  writer.Write(data.Serialize())
}

writer.Flush()
compressor.Close()
### UPLOAD COMPRESSED FILE TO S3

key := "file.gz"
payload := bytes.NewReader(buffer.Bytes())

upload := &s3.PutObjectInput{
  Body:   payload,
  Bucket: aws.String(bucket),
  Key:    aws.String(key),
}

This works, seems fast and somewhat efficient.

However, the resulting file, although considered a text file under Linux, does not honor the line breaks added via \n. Not sure if this is an OS specific issue, an issue with defining the file type by some means (e.g. use a file format ending file.txt.gz or file.csv.gz, or by adding specific header bytes), or an issue with the way I am creating these file in the first place.

What would be the proper way to create a fully qualified in-memory file type as []byte (or inside an io.ReadSeeker interface in general) to upload to S3, preferably in a line-by-line fashion?


Update:

I was able to solve this by wrapping the string in a call to fmt.Sprintln:

func (d *Data) Serialize() []byte {
  return []byte( fmt.Sprintln(fmt.Sprintf(`%d;%s`, d.field_1, d.field_2) )
}

When looking at the implementation of fmt.Sprintln it appends the \n rune - there must be subtle differences I am not aware of.


Solution

  • Replace

    `%d;%s\n`
    

    with

    "%d;%s\n"
    

    `%d;%s\n` is a raw string literal. And in a raw string literal, backslashes have no special meaning. See String literals in the language spec:

    Raw string literals are character sequences between back quotes, as in `foo`. Within the quotes, any character may appear except back quote. The value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the quotes; in particular, backslashes have no special meaning and the string may contain newlines.