Search code examples
pythonpandasdataframecategories

Adding a column with one single categorical value to a pandas dataframe


I have a pandas.DataFrame df and would like to add a new column col with one single value "hello". I would like this column to be of dtype category with the single category "hello". I can do the following.

df["col"] = "hello"
df["col"] = df["col"].astype("category")
  1. Do I really need to write df["col"] three times in order to achieve this?
  2. After the first line I am worried that the intermediate dataframe df might take up a lot of space before the new column is converted to categorical. (The dataframe is rather large with millions of rows and the value "hello" is actually a much longer string.)

Are there any other straightforward, "short and snappy" ways of achieving this while avoiding the above issues?

An alternative solution is

df["col"] = pd.Categorical(itertools.repeat("hello", len(df)))

but it requires itertools and the use of len(df), and I am not sure how memory usage is under the hood.


Solution

  • We can explicitly build the Series of the correct size and type instead of implicitly doing so via __setitem__ then converting:

    df['col'] = pd.Series('hello', index=df.index, dtype='category')
    

    Sample Program:

    import pandas as pd
    
    df = pd.DataFrame({'a': [1, 2, 3]})
    
    df['col'] = pd.Series('hello', index=df.index, dtype='category')
    
    print(df)
    print(df.dtypes)
    print(df['col'].cat.categories)
    
       a    col
    0  1  hello
    1  2  hello
    2  3  hello
    
    a         int64
    col    category
    dtype: object
    
    Index(['hello'], dtype='object')