Search code examples
pythonpandasdata-cleaning

Is there a function, or clever method, within Pandas read_csv that checks the first entry of each line & adds it to the Dataframe if condition is met


What I'm doing:

I am reading a csv file into python using pandas. A valid entry will always start with a county.

The issue

The issue I'm having is that there is some issue with the csv file that's invalidating 1.3k lines. Some lines end with too many commas , some lines start with nonsense, some lines have additional BS, some lines have been returned halfway through the entry which causes the next line to be nonsense etc

My question

As I've previously stated, each valid line should start with one of the entries contained in counties. So I thought there might be some way to validate the first entry of each line and check if that entry 'is in counties'. If it is, add it. If not, ignore it.

Is it possible to use one of the arguments within the .read_csv function to overcome the issue? Is there some magic I can do with a lambda function to pull this off, (like I tried above)? Maybe you could suggest a better way?

Please do not suggest:

Modifying the CSV file by hand / directly. Yes I know I can do that but it's not what I'm asking & is not reasonable or viable for large datasets or if calling requests. My goal here is to learn & tackle the issue pythonicly

What I've tried:

I thought trying this would work:

counties = ["Bedfordshire", "Berkshire", "Bristol", ... , "Worcestershire"]

df = pd.read_csv('data/libraries.csv', skiprows=(lambda x: x not in counties))

but alas not.

I have also tried using the built in arguments to solve this issue, for eg lineterminator='\n', skipblanklines=True and a few other things to try to 'constrain' the read and prevent the error but this doesn't work

Small sample of csv:

Library service,Library name,In use 2010,In use 2016,Type of library,Type of closed library,Closed,New building,Replace existing,Notes,Weekly hours open,Weekly hours staffed
Barking and Dagenham,Barking,Yes,Yes,LAL,,,,,,72.0,72.0
Barking and Dagenham,Castle Green,Yes,No,,XL,Mar-13,,,Closed as a public library 31.3.2013. Secondary School library Jo Richardson School,,
Barking and Dagenham,Dagenham,No,Yes,LAL,,,Oct-10,Yes,Replaced Fanshawe and Rectory libraries,56.0,56.0
...
Surrey,Stanwell  ,yes,yes,CRL,,,2012,Yes,Rebuilt on site of old library. Dec 2011 - Jan 2012 Closed 4 weeks to move to new library.,,
Opening hours reduced from 30 to 27 from 5 Sep 2016. Now closed at lunch (1pm-2pm),27.0,yes,,,,,,,,,
Surrey,Stoneleigh ,yes,yes,CRL,,,,,CRL from Feb 2013,0.0,0.0
Surrey,Sunbury ,yes,yes,LAL,,,,,Opening hours reduced from 42 to 39.5 from 5 Sep 2016,39.5,39.5
Surrey,Tattenhams ,yes,yes,CRL,,,,,April 2010 closed for 5 weks for building work.,,
CRL from Nov 2012,0.0,no,,,,,,,,,
...
Worcestershire,Wythall Library,yes,no,,XLR,2016,,,,,
Worcestershire,Wythall Library,yes,yes,LAL,,,2016,"yes, replaced Wythall library",,31.5,31.5
York,Acomb,yes,yes,CL,,,,,,58.5,58.5
York,Bishopthorpe,yes,yes,CL,,,,,,21.5,21.5

In the middle section you can see that one line starts with "Opening hours reduced from..." another starts with "CRL from Nov...". These are the artefacts that I assume are causing massive issues


Solution

  • You could use a wrapper to filter the lines before feeding to read_csv:

    from io import StringIO
    
    def filter_valid(fname, valid):
        def wrapper(fname, valid):
            first = True
            with open(fname) as f:
                for line in f:
                    if first or line.split(',', 1)[0] in valid:
                        yield line
                    first = False
        return StringIO(''.join(wrapper(fname, valid)))
    
    df = pd.read_csv(filter_valid('libraries.csv', set(counties)), on_bad_lines='skip')
    
    print(df)
    

    Output:

      Library service     Library name In use 2010 In use 2016 Type of library Type of closed library  Closed  New building               Replace existing                                                                                       Notes  Weekly hours open  Weekly hours staffed
    0          Surrey       Stanwell           yes         yes             CRL                    NaN     NaN        2012.0                            Yes  Rebuilt on site of old library. Dec 2011 - Jan 2012 Closed 4 weeks to move to new library.                NaN                   NaN
    1          Surrey      Stoneleigh          yes         yes             CRL                    NaN     NaN           NaN                            NaN                                                                           CRL from Feb 2013                0.0                   0.0
    2          Surrey         Sunbury          yes         yes             LAL                    NaN     NaN           NaN                            NaN                                       Opening hours reduced from 42 to 39.5 from 5 Sep 2016               39.5                  39.5
    3          Surrey      Tattenhams          yes         yes             CRL                    NaN     NaN           NaN                            NaN                                             April 2010 closed for 5 weks for building work.                NaN                   NaN
    4  Worcestershire  Wythall Library         yes          no             NaN                    XLR  2016.0           NaN                            NaN                                                                                         NaN                NaN                   NaN
    5  Worcestershire  Wythall Library         yes         yes             LAL                    NaN     NaN        2016.0  yes, replaced Wythall library                                                                                         NaN               31.5                  31.5