Search code examples
pythonexcelcsvcsvkit

Csvkit Library Usage


I'm looking to convert a given excel file to a csv using csvkit as a library, not from the command line. I'm having trouble finding any information on library usage syntax. Can anyone shed some light on how to use csvkit as a library for this purpose?

My test case is simple - take input.xlsx or input.xls, convert, and save as output.csv. Here's what I've tried so far, which is based on suggestions found elsewhere :

import csvkit

with open('input.xlsx') as csvfile:
    reader = in2csv(csvfile)
    # below is just to test whether the file could be accessed
    for row in reader:
        print(row)

gives

Traceback (most recent call last):
  File "excelconvert.py", line 6, in <module>
    reader = in2csv(csvfile)
NameError: name 'in2csv' is not defined

There's a similar question here, but the answers just seem to reference documentation that either isn't up or doesn't actually explain library usage syntax, it just lists classes. There is an answer suggesting the syntax might be similar to the csv module, which is what I used to make the attempt above, but I'm getting nowhere.


Solution

  • The docs strongly suggest this is meant to be a command line tool, not to be used from inside the Python interpreter. You can do something like this to convert a file to csv from the command line (or you could pop it in a shell script):

    in2csv your_file.xlsx > your_new_file.csv
    

    If you'd like to read the file, just do this (it's similar to what you have, but you don't need any external modules, just use built-in Python):

    with open('input.xlsx') as csvfile:
        reader = csvfile.readlines() # This was the only line of your code I changed
        # below is just to test whether the file could be accessed
        for row in reader:
            print(row)
    

    Or you could call your command line using the os module:

    # Careful, raw sys call. Use subprocess.Popen 
    # if you need to accept untrusted user input here
    os.popen("in2csv your_file.xlsx > your_new_file.csv").read()
    

    One of the snippets above is is probably what you need, but if you're really looking for punishment, you can attempt to use the in2csv file from inside the interpreter. Here's how you might go about doing it (there is no support for this in the docs that I could find, it's just me poking around in the interpreter):

    >>> from csvkit import in2csv
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ImportError: cannot import name in2csv
    >>> import csvkit
    >>> help(csvkit)
    Help on package csvkit:
    
    NAME
        csvkit
    
    FILE
        c:\python27\lib\site-packages\csvkit\__init__.py
    
    DESCRIPTION
        This module contains csvkit's superpowered alternative to the standard Python
        CSV reader and writer. It can be used as a drop-in replacement for the standard
        module.
    
        .. warn::
    
            Since version 1.0 csvkit relies on `agate <http://agate.rtfd.org>`_'s
        CSV reader and writer. This module is supported for legacy purposes only and you
        should migrate to using agate.
    
    PACKAGE CONTENTS
        cleanup
        cli
        convert (package)
        exceptions
        grep
        utilities (package)
    

    So you can't import in2csv directly from csvkit (since it isn't listed under PACKAGE CONTENTS). However, if you do a little hunting, you'll find that you can access the package from csvkit.utilities. But it only gets worse from here. If you do more "help hunting" (i.e. calling help from the interpreter) like above, you'll find that the class was designed to be used from the command line. So it's a real pain in the ass to use from inside the interpreter. Here's an example of trying to use the defaults (results in an explosion):

    >>> from csvkit.utilities import in2csv
    >>> i = in2csv.In2CSV()
    >>> i.main()
    usage:  [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
            [-p ESCAPECHAR] [-z FIELD_SIZE_LIMIT] [-e ENCODING] [-S] [-H] [-v]
            [-l] [--zero] [-f FILETYPE] [-s SCHEMA] [-k KEY] [--sheet SHEET]
            [-y SNIFF_LIMIT] [--no-inference]
            [FILE]
    : error: You must specify a format when providing data via STDIN (pipe).
    

    Taking a look at the in2csv.py module, you'll have to monkey patch the args to get it to do what you want from inside the interpreter. Again, this was not designed to be used from inside the interpreter, it was designed to be called from the cmd line (so args is defined if you call it from the cmd line). Something like this seemed to run, but I didn't thoroughly test it:

    >>> from csvkit.utilities import in2csv
    >>> i = in2csv.In2CSV()
    >>> from collections import namedtuple
    >>> i.args = namedtuple("patched_args", "input_path filetype no_inference")
    >>> i.args.input_path = "/path/to/your/file.xlsx"
    >>> i.args.no_inference = True
    >>> i.args.filetype = None
    >>> i.main()