Search code examples
pandasmarkdowntabulate

pandas.DataFrame.to_markdown transform large int to float


pandas.DataFrame.to_markdown transforms large int to float. Is it a bug or a feature? Are there any solutions?

>>> df = pd.DataFrame({"A": [123456, 123456]})
>>> print(df.to_markdown())
|    |      A |
|---:|-------:|
|  0 | 123456 |
|  1 | 123456 |

>>> df = pd.DataFrame({"A": [1234567, 1234567]})
>>> print(df.to_markdown())
|    |           A |
|---:|------------:|
|  0 | 1.23457e+06 |
|  1 | 1.23457e+06 |

>>> print(df)
         A
0  1234567
1  1234567

>>> print(df.A.dtype)
int64

Solution

  • I initially found only a workaround, but not the explanation: converting the column to strings.

    >>> df = pd.DataFrame({"A": [1234567, 1234567]})
    >>> df["A"] = df.A.astype(str)
    >>> print(df.to_markdown())
    |    |       A |
    |---:|--------:|
    |  0 | 1234567 |
    |  1 | 1234567 |
    

    Update:

    I think it is caused by 2 elements:

    • The _column_type function in tabulate:
    def _column_type(strings, has_invisible=True, numparse=True):
        """The least generic type all column values are convertible to.
    

    It can be solved by disabling the conversion via tablefmt="pretty":

    print(df.to_markdown(tablefmt="pretty"))
    +---+---------+
    |   |    A    |
    +---+---------+
    | 0 | 1234567 |
    | 1 | 1234567 |
    +---+---------+
    
    • When there are more than one column, and that one of them contains float numbers. Since tabulate uses df.values to extract the data, which transforms the DataFrame to numpy.array, all values are then converted to the same dtype (float). This is also discussed in this issue.
    >>> df = pd.DataFrame({"A": [1234567, 1234567], "B": [0.1, 0.2]})
    >>> print(df)
             A    B
    0  1234567  0.1
    1  1234567  0.2
    
    >>> print(df.A.dtype)
    int64
    
    >>> print(df.to_markdown(tablefmt="pretty"))
    +---+-----------+-----+
    |   |     A     |  B  |
    +---+-----------+-----+
    | 0 | 1234567.0 | 0.1 |
    | 1 | 1234567.0 | 0.2 |
    +---+-----------+-----+
    
    >>> df.values
    array([[1.234567e+06, 1.000000e-01],
           [1.234567e+06, 2.000000e-01]])