Search code examples
goweb-scrapinghtml-parsing

Get unicode characters as string when reading response body (Golang)


I'm scraping a website that was written in Polish, meaning it contains characters such as ź and ę.

When I attempt to parse the html, either using the html package or even by splitting the string of the response body, I get output like this:

���~♦�♀�����r�▬֭��↔��q���y���<p��19��lFۯ☻→Z�7��

Im currently using

bodyBytes, err := ioutil.Readall(resp.body)
if err != nil {
  //handle
} 
bodyString := string(bodyBytes)

In order to get the string

How can I get the text in readable format?


Solution

  • Update:

    Since the content encoding of the response was gzip, the code below worked for getting the response as a printable string

    gReader, err := gzip.NewReader(resp.Body)
    if err != nil {
        return err
    }
    gBytes, err := ioutil.ReadAll(gReader)
    if err != nil {
        return err
    }
    gReader.Close()
    bodyStr := string(gBytes)