Search code examples
pythonpandaslistnumpytruncation

Pandas truncates strings in numpy list


Consider the following minimal example:

@dataclass
class ExportEngine:

    def __post_init__(self):
        self.list = pandas.DataFrame(columns=list(MyObject.CSVHeaders()))

    def export(self):
        self.prepare()
        self.list.to_csv("~/Desktop/test.csv")

    def prepare(self):
        values = numpy.concatenate(
            (
                numpy.array(["Col1Value", "Col2Value", " Col3Value", "Col4Value"]),
                numpy.repeat("", 24),
            )
        )
        for x in range(8): #not the best way, but done due to other constraints
            start = 3 + (x * 3) - 2
            end = start + 3
            values[start:end] = [
                "123",
                "some_random_value_that_gets_truncated",
                "456",
            ]
        self.list.loc[len(self.list)] = values

When export() is called, some_random_value_that_gets_truncated is truncated to some_rando:

['Col1Value', '123', 'some_rando', '456', '123', 'some_rando', '456', '123', 'some_rando', '456', '123', 'some_rando', '456', '123', ...] 

I've tried setting the following:

pandas.set_option("display.max_colwidth", 10000), but this doesn't change anything...

Why does this happen, and how can I prevent the truncation?


Solution

  • So, numpy will by default choose a suitable, fixed-length unicode format.

    Notice the dtype:

    In [1]: import numpy
    
    In [2]: values = numpy.concatenate(
       ...:     (
       ...:         numpy.array(["Col1Value", "Col2Value", " Col3Value", "Col4Value"]),
       ...:         numpy.repeat("", 24),
       ...:     )
       ...: )
    
    In [3]: values
    Out[3]:
    array(['Col1Value', 'Col2Value', ' Col3Value', 'Col4Value', '', '', '',
           '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
           '', '', '', ''], dtype='<U10')
    

    You should probably just not use numpy directly, but one quick fix is to replace:

    values = numpy.concatenate(
        (
            numpy.array(["Col1Value", "Col2Value", " Col3Value", "Col4Value"]),
            numpy.repeat("", 24),
        )
    )
    

    with:

    values = np.array(
        ['Col1Value', 'Col2Value', ' Col3Value', 'Col4Value', *[""]*24], 
        dtype=object
    )
    

    Notice the dtype=object, which will use just pointers to python str objects, so there won't be a limitation on the length of the strings