Search code examples
pythoncsvmatplotlibgraphdiagram

How do i make a graph/diagram from a CSV file in Python?


This is my first time asking a question in this forum, hopefully i won't make a fool of myself. I am a student in an IT education and i was briefly introduced to the CSV and Matplotlib libraries today. An assignment was to make a graph/diagram of the maximum and minimum temperatures and the corresponding dates in this CSV file. I need the row numbers and i need the program to understand the right format/syntax of the cells, but i am really not sure how to.

Example of CSV file here: "STATION","NAME","DATE","PRCP","TMAX","TMIN","TOBS" "USC00042319","DEATH VALLEY, CA US","2018-01-01","0.00","65","34","42" "USC00042319","DEATH VALLEY, CA US","2018-01-02","0.00","61","38","46" "USC00042319","DEATH VALLEY, CA US","2018-01-03","0.00","69","34","54" "USC00042319","DEATH VALLEY, CA US","2018-01-04","0.00","69","39","48" "USC00042319","DEATH VALLEY, CA US","2018-01-05","0.00","74","40","57" "USC00042319","DEATH VALLEY, CA US","2018-01-06","0.00","74","47","65" "USC00042319","DEATH VALLEY, CA US","2018-01-07","0.00","77","54","60" "USC00042319","DEATH VALLEY, CA US","2018-01-08","0.07","62","52","52" "USC00042319","DEATH VALLEY, CA US","2018-01-09","0.40","60","51","51" "USC00042319","DEATH VALLEY, CA US","2018-01-10","0.00","64","49","50"

This is what i got:

import csv
import matplotlib.pyplot as plt

filename = 'death_valley_2018_simple.csv'
with open(filename) as f:
    csv_reader = csv.reader(f, delimiter=',')
    line_count = 0

    for row in f:
        x=(row[4], row[5])
        y=(row[2])
        print(row[2])
        print(row[4])
        print(row[5])

plt.bar(x,y)
plt.xticks(y)
plt.ylabel('Dates')
plt.title('Plot')
plt.show()

the result is this "bar graph" I read other forum posts from here, asked around on Discord and read the documentation for CSV. Maybe the answer is there, but i don't understand it then. I hope someone will explain this to me like im 5 years old.


Solution

  • Personal Advice

    Don't worry; I got you. But first some advice. I remember when I posted my first question on this forum, I didn't know the proper way to ask a question (and my English wasn't that good at that time). The key to asking a perfect question is to search first (which you did), and then if you didn't find an answer, you should ask your question as clear as possible and as short as possible. I'm not saying don't give enough information, but if you can ask your question in fewer words and your question is still as clear as possible, you should do it. Why? Because the truth is so many people will skip the question if it is long. Just now, when I opened your question and saw the lines, I was a little intimidated and wanted to skip it :D, but I solved it in a few minutes, and it wasn't scary at all. I am less concerned about writing long answers because those with a problem will read your answer if they have to. Please note that all of this was just my personal experience. You should also look for better beginner guides to ask questions on this forum and similar platforms. My suggestion: http://www.catb.org/~esr/faqs/smart-questions.html

    Now the Answer

    Instead of the csv library, which is a Python standard library (means it's part of the programming language when you install it and doesn't need to be installed separated), I prefer using pandas. pandas will make your life much more easier. But you have to install it first:

    pip install pandas
    

    Now it's quite simple, let's import everything and load the csv file.

    import pandas as pd
    import matplotlib.pyplot as plt
    
    filename = 'death_valley_2018_simple.csv'
    dataframe = pd.read_csv(filename)
    

    dataframe contains your csv file's rows and columns. We need to convert DATE column from str to datetime.

    dataframe["DATE"] = pd.to_datetime(dataframe['DATE'], format="%Y-%m-%d")
    

    So we are just telling pandas to change the DATE column to datetime, and we are telling where is the number for year and month and day is by specifying the format field. %Y represents the year, then there is a dash, %m represents the month, and ..., we are using capital Y because %y represents the year when we only have the two digits on the right. In this case, since it is pretty straightforward, pandas will understand how to convert this column to datetime even if we didn't specify the format.

    Now we just have to plot our diagram/graph:

    fig, ax = plt.subplots()
    ax.plot(dataframe["DATE"], dataframe["TMAX"])
    ax.plot(dataframe["DATE"], dataframe["TMIN"])
    fig.autofmt_xdate()
    fig.show()
    

    So after doing everything, your code should look like this:

    import pandas as pd
    import matplotlib.pyplot as plt
    
    filename = 'death_valley_2018_simple.csv'
    dataframe = pd.read_csv(filename)
    
    dataframe["DATE"] = pd.to_datetime(dataframe['DATE'], format="%Y-%m-%d")
    
    fig, ax = plt.subplots()
    ax.plot(dataframe["DATE"], dataframe["TMAX"])
    ax.plot(dataframe["DATE"], dataframe["TMIN"])
    fig.autofmt_xdate()
    fig.show()
    

    Without pandas

    You can do the exact same thing without the pandas library; you just have to do some things manually.

    Importing the libraries (no pandas this time):

    import csv
    import datetime
    
    import matplotlib.pyplot as plt
    

    This will create a python dictionary similar to a pandas data frame:

    filename = "death_valley_2018_simple.csv"
    
    with open(filename, "r") as file:
        csv_reader = csv.reader(file)
        headers = next(csv_reader)
        
        data = {}
        for title in headers:
            data[title] = []
    
        for row in csv_reader:
            for i, title in enumerate(headers):
                data[title].append(row[i])
    

    Same as before, we should convert the DATE column from str to datetime. We also have to convert the TMAX and TMIN column to int; pandas did this automatically for us. The first loop takes care of the DATE column, and the second and third one is for the TMAX and TMIN columns.

    for i in range(len(data["DATE"])):
        data["DATE"][i] = datetime.datetime.strptime(data["DATE"][i], "%Y-%m-%d")
    
    for i in range(len(data["TMAX"])):
        data["TMAX"][i] = int(data["TMAX"][i])
    
    for i in range(len(data["TMIN"])):
        data["TMIN"][i] = int(data["TMIN"][i])
    

    Now, we can plot our diagram/graph:

    fig, ax = plt.subplots()
    ax.plot(data["DATE"], data["TMAX"])
    ax.plot(data["DATE"], data["TMIN"])
    fig.autofmt_xdate()
    fig.show()
    

    So after doing everything, your code should look like this:

    import csv
    import datetime
    
    import matplotlib.pyplot as plt
    
    
    filename = "death_valley_2018_simple.csv"
    
    with open(filename, "r") as file:
        csv_reader = csv.reader(file)
        headers = next(csv_reader)
        
        data = {}
        for title in headers:
            data[title] = []
    
        for row in csv_reader:
            for i, title in enumerate(headers):
                data[title].append(row[i])
    
    for i in range(len(data["DATE"])):
        data["DATE"][i] = datetime.datetime.strptime(data["DATE"][i], "%Y-%m-%d")
    
    for i in range(len(data["TMAX"])):
        data["TMAX"][i] = int(data["TMAX"][i])
    
    for i in range(len(data["TMIN"])):
        data["TMIN"][i] = int(data["TMIN"][i])
    
    fig, ax = plt.subplots()
    ax.plot(data["DATE"], data["TMAX"])
    ax.plot(data["DATE"], data["TMIN"])
    fig.autofmt_xdate()
    fig.show()
    

    Hard Coding, a Rookie Mistake

    You said:

    There is 365 lines in the file, so maybe it would be nice to limit the program to taking maybe the first 10 lines

    Search hard coding and read about it. Hard coding is a rookie mistake in beginners, I've done it a thousand times but you have to be aware of it. We are not writing our code in a way that it matters if there are 10 rows in the csv file or if there are 10,000 rows. Hard coding means that you are embedding some unnecessary data in your program and your program can work only in certain examples. You shouldn't write a program that only works if there are 10 rows or 100 rows, you should write your program so it would work without knowing the number of rows.