So I'm trying to generate dummy data that contains 3 columns: sq. feet, price and Borough. For the first two, which are purely numerical this is fine. I have 50,000 rows of data for both on a spreadsheet. However, when I add Borough and specify random values from a list I receive the following output:
Sq. feet Price Borough
0 112 345382 5
1 310 901500 5
2 215 661033 5
3 147 1038431 5
4 212 296497 5
I have not used a package associated with numerical generation like np.random.randint
Instead I used "Borough" : random.randrange(len(word))
Where have I gone wrong?
My code below
import random
import pandas as pd
import numpy as np
WORDS = ["Chelsea", "Kensington", "Westminster", "Pimlico", "Bank", "Holborn", "Camden", "Islington", "Angel", "Battersea", "Knightsbridge", "Bermondsey", "Newham"]
word = random.choice(WORDS)
np.random.seed(1)
data3 = pd.DataFrame({"Sq. feet" : np.random.randint(low=75, high=325, size=50000),
"Price" : np.random.randint(low=200000, high=1250000, size=50000),
"Borough" : random.randrange(len(word))
})
df = pd.DataFrame(data3)
df.to_csv("/Users/thomasmcnally/PycharmProjects/real_estate_dummy_date/realestate.csv", index=False)
print(df)
I'm expecting a random line of word values from the WORDS [], instead the return value is just the number 5. It's obviously meaningless making another module just for the text-based data and printing them in different files.
I guess you want to generate a list of 50,000 random choices from WORDS - which itself could usefully be renamed BOROUGHS:
import random
import pandas as pd
import numpy as np
SIZE = 50_000
BOROUGHS = ["Chelsea", "Kensington", "Westminster", "Pimlico", "Bank", "Holborn", "Camden", "Islington", "Angel", "Battersea", "Knightsbridge", "Bermondsey", "Newham"]
np.random.seed(1)
data3 = pd.DataFrame({"Sq. feet" : np.random.randint(low=75, high=325, size=SIZE),
"Price" : np.random.randint(low=200000, high=1250000, size=SIZE),
"Borough" : [random.choice(BOROUGHS) for _ in range(SIZE)]
})
df = pd.DataFrame(data3)
df.to_csv("realestate.csv", index=False)
print(df)
Output
Sq. feet Price Borough
0 112 345382 Pimlico
1 310 901500 Battersea
2 215 661033 Holborn
3 147 1038431 Westminster
4 212 296497 Holborn
... ... ... ...
49995 252 1065034 Holborn
49996 117 752615 Holborn
49997 238 803058 Camden
49998 147 1163555 Bank
49999 269 888623 Westminster
Aside... wherever you have a number repeated all over your code, like your 50,000, t's generally a good idea to make it a variable and declare it up at the top, then it can be changed without causing a maintenance nightmare for some poor future programmer looking all over for every occurrence of 50,000.
This construct is called a "list comprehension" if you want to learn about them:
[random.choice(BOROUGHS) for _ in range(SIZE)]