Search code examples
pythonpandasdataframecsvrandom

Dummy data: generating random text and numerical data into one csv/excel file?


So I'm trying to generate dummy data that contains 3 columns: sq. feet, price and Borough. For the first two, which are purely numerical this is fine. I have 50,000 rows of data for both on a spreadsheet. However, when I add Borough and specify random values from a list I receive the following output:

       Sq. feet    Price  Borough
0           112   345382        5
1           310   901500        5
2           215   661033        5
3           147  1038431        5
4           212   296497        5

I have not used a package associated with numerical generation like np.random.randint

Instead I used "Borough" : random.randrange(len(word))

Where have I gone wrong?

My code below

import random

import pandas as pd
import numpy as np

WORDS = ["Chelsea", "Kensington", "Westminster", "Pimlico", "Bank", "Holborn", "Camden", "Islington", "Angel", "Battersea", "Knightsbridge", "Bermondsey", "Newham"]
word = random.choice(WORDS)
np.random.seed(1)
data3 = pd.DataFrame({"Sq. feet" : np.random.randint(low=75, high=325, size=50000),
                     "Price" : np.random.randint(low=200000, high=1250000, size=50000),
                      "Borough" : random.randrange(len(word))
                     })

df = pd.DataFrame(data3)
df.to_csv("/Users/thomasmcnally/PycharmProjects/real_estate_dummy_date/realestate.csv", index=False)

print(df)

I'm expecting a random line of word values from the WORDS [], instead the return value is just the number 5. It's obviously meaningless making another module just for the text-based data and printing them in different files.


Solution

  • I guess you want to generate a list of 50,000 random choices from WORDS - which itself could usefully be renamed BOROUGHS:

    import random
    import pandas as pd
    import numpy as np
    
    SIZE = 50_000
    BOROUGHS = ["Chelsea", "Kensington", "Westminster", "Pimlico", "Bank", "Holborn", "Camden", "Islington", "Angel", "Battersea", "Knightsbridge", "Bermondsey", "Newham"]
    
    np.random.seed(1)
    data3 = pd.DataFrame({"Sq. feet" : np.random.randint(low=75, high=325, size=SIZE),
      "Price" : np.random.randint(low=200000, high=1250000, size=SIZE),
      "Borough" : [random.choice(BOROUGHS) for _ in range(SIZE)]
    })
    
    df = pd.DataFrame(data3)
    df.to_csv("realestate.csv", index=False)
    print(df)
    

    Output

           Sq. feet    Price      Borough
    0           112   345382      Pimlico
    1           310   901500    Battersea
    2           215   661033      Holborn
    3           147  1038431  Westminster
    4           212   296497      Holborn
    ...         ...      ...          ...
    49995       252  1065034      Holborn
    49996       117   752615      Holborn
    49997       238   803058       Camden
    49998       147  1163555         Bank
    49999       269   888623  Westminster
    

    Aside... wherever you have a number repeated all over your code, like your 50,000, t's generally a good idea to make it a variable and declare it up at the top, then it can be changed without causing a maintenance nightmare for some poor future programmer looking all over for every occurrence of 50,000.

    This construct is called a "list comprehension" if you want to learn about them:

    [random.choice(BOROUGHS) for _ in range(SIZE)]