I am attempting to write a compressed file from memory and upload to S3.
I am serializing a large array of type Data struct
into a bufio.Writer
that writes to a gzip.Writer
in a line-by-line fashion:
### DATA AND SERIALIZATION
type Data struct {
field_1 int
field_2 string
}
func (d *Data) Serialize() []byte {
return []byte( fmt.Sprintf(`%d;%s\n`, d.field_1, d.field_2) )
}
### CREATE FILE AS COMPRESSED BYTES
var datas []*Data // assume this is filled
buffer := &bytes.Buffer{}
compressor := gzip.NewWriter(buffer)
writer := bufio.NewWriter(compressor)
for _, data := range datas {
writer.Write(data.Serialize())
}
writer.Flush()
compressor.Close()
### UPLOAD COMPRESSED FILE TO S3
key := "file.gz"
payload := bytes.NewReader(buffer.Bytes())
upload := &s3.PutObjectInput{
Body: payload,
Bucket: aws.String(bucket),
Key: aws.String(key),
}
This works, seems fast and somewhat efficient.
However, the resulting file, although considered a text file under Linux, does not honor the line breaks added via \n
. Not sure if this is an OS specific issue, an issue with defining the file type by some means (e.g. use a file format ending file.txt.gz
or file.csv.gz
, or by adding specific header bytes), or an issue with the way I am creating these file in the first place.
What would be the proper way to create a fully qualified in-memory file type as []byte
(or inside an io.ReadSeeker
interface in general) to upload to S3, preferably in a line-by-line fashion?
Update:
I was able to solve this by wrapping the string in a call to fmt.Sprintln
:
func (d *Data) Serialize() []byte {
return []byte( fmt.Sprintln(fmt.Sprintf(`%d;%s`, d.field_1, d.field_2) )
}
When looking at the implementation of fmt.Sprintln
it appends the \n
rune - there must be subtle differences I am not aware of.
Replace
`%d;%s\n`
with
"%d;%s\n"
`%d;%s\n` is a raw string literal. And in a raw string literal, backslashes have no special meaning. See String literals in the language spec:
Raw string literals are character sequences between back quotes, as in `foo`. Within the quotes, any character may appear except back quote. The value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the quotes; in particular, backslashes have no special meaning and the string may contain newlines.