Search code examples
pythonpython-3.xmatplotlibpylot

Python: ValueError: could not convert string to float: '0'


For some school assignments I've been trying to get pyplot to plot some scientific graphs based on data from Logger Pro for me. I'm met with the error

ValueError: could not convert string to float: '0'

This is the program:

plot.py
-------------------------------
import matplotlib.pyplot as plt 
import numpy as np

infile = open('text', 'r')

xs = []
ys = []

for line in infile:
    print (type(line))
    x, y = line.split()
    # print (x, y)
    # print (type(line), type(x), type(y))

    xs.append(float(x))
    ys.append(float(y))

xs.sort()
ys.sort()

plt.plot(xs, ys, 'bo')
plt.grid(True)

# print (xs, ys)

plt.show()

infile.close()

And the input file is containing this:

text
-------------------------------
0 1.33
1 1.37
2 1.43
3 1.51
4 1.59
5 1.67
6 1.77
7 1.86
8 1.98
9 2.1

This is the error message I recieve when I'm running the program:

Traceback (most recent call last):
  File "\route\to\the\file\plot01.py", line 36, in <module>
    xs.append(float(x))
ValueError: could not convert string to float: '0'

Solution

  • You have a UTF-8 BOM in your data file; this is what my Python 2 interactive session states is being converted to a float:

    >>> '0'
    '\xef\xbb\xbf0'
    

    The \xef\xbb\xbf bytes is a UTF-8 encoded U+FEFF ZERO WIDTH NO-BREAK SPACE, commonly used as a byte-order mark, especially by Microsoft products. UTF-8 has no byte order issues, the mark isn't required to record the byte ordering like you have to for UTF-16 or UTF-32; instead Microsoft uses it as an aid to detect encodings.

    On Python 3, you could open the file using the utf-8-sig codec; this codec expects the BOM at the start and will remove it:

    infile = open('text', 'r', encoding='utf-8-sig')
    

    On Python 2, you could use the codecs.BOM_UTF8 constant to detect and strip;

    for line in infile:
        if line.startswith(codecs.BOM_UTF8):
            line = line[len(codecs.BOM_UTF8):]
        x, y = line.split()
    

    As the codecs documentation explains it:

    As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF character in the decoded string (even if it’s the first character) is treated as a ZERO WIDTH NO-BREAK SPACE.

    Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to

    LATIN SMALL LETTER I WITH DIAERESIS
    RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
    INVERTED QUESTION MARK
    

    in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.