Search code examples
pythonexcelxlrd

Read data from excel file for analysis


I have the following two codes that I wrote and the excel file. I just don't understand how to combine them, so that I can read from the excel file and using those numbers make a cluster.

import matplotlib.pyplot as plt
import xlrd
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans

fileWorkspace = 'C://Users/bob/Desktop/'

pull=[]
wb1 = xlrd.open_workbook(fileWorkspace + 'try.xlsx')
sh1 = wb1.sheet_by_index(0)

for a in range(0,sh1.nrows):
    for b in range(0,sh1.ncols):
        pull.append(sh1.cell(a,b).value)
    print('Finished in row' + str(a))


x = [11,19,23,33,44,91,92,90,60,63]

y = [92,85,22,25,86,78,63,51,66,15]


X = [list(item) for item in zip(x,y)]

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)

colors = ["g.","r.","y.","c.","m.","k."]

for i in range(len(X)):
    print("coordinate:",X[i], "label:", labels[i])
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths=5, zorder=10)
plt.show()

The excel file image: Excel image

It is complicated because I have to read that single row of data and then make clusters. Also I have to skip rows and columns to read them.


Solution

  • If you're not opposed to using pandas you can use the read_excel function by doing:

    import pandas as pd
    # Read in data from first sheet
    df = pd.read_excel(filename, sheetname=0, parse_cols='B:D', index_col=0, header=[0,1])
    

    That way you can deal with the blank column as well as the header and data labels at the same time. From there you can access the data via df.values in the form of a numpy array, or get a list of lists (y, x) pairs by doing:

    pairs = df.values.tolist()
    

    You can also use xlrd by iterating through the rows and/or columns using the appropriate ranges. For example, if you wanted to read in just the data in your example file into a list of lists you could do something like:

    import xlrd
    
    workbook = xlrd.open_workbook(filename)
    sheet = workbook.sheet_by_index(0)
    
    pairs = []
    # Iterate through rows starting at the 3rd row)
    for i in range(2, 15):
        # Iterate through columns starting at the 3rd column
        pairs.append([sheet.cell(i, j).value for j in range(2, 4)])
    

    There's probably a better way to do this in xlrd, but I rarely ever use it.