Search code examples
pythonpandasdocumentationdocstring

Use pandas.read_csv docstring in the function I'm writing


I'd like to write a function with the following header :

def split_csv(file, sep=";", output_path=".", nrows=None, chunksize=None, low_memory=True, usecols=None):

As you can see, I am using the same parameters as several found in pd.read_csv. What I would like to know (or do) is forward the docstring concerning these parameters from read_csv to my own function without having to copy/paste them.

EDIT : As I understand, there are no out of the box existing solutions for this. So perhaps building one is in order. What I have in mind :

some_new_fancy_library.get_doc(for_function = pandas.read_csv,for_parameters = ['sep','nrows']) would output :

{'sep': 'doc as found in the docstring', 'nrows' : 'doc as found in the docstring', ...}

and then it'd be just a matter of inserting the dictionary's value into my own function's docstring

Cheers


Solution

  • You could parse the docstrings with regex and return the matched arguments to your function:

    import re
    
    pat = re.compile(r'([\w_+]+ :)')    # capturing group for arguments
    
    splitted = pat.split(pd.read_csv.__doc__)
    
    # Compare the parsed docstring against your function's arguments and only extract the required docstrings
    docstrings = '\n'.join([''.join(splitted[i: i+2]) for i, s in enumerate(splitted) if s.rstrip(" :") in split_csv.__code__.co_varnames])
    
    split_csv.__doc__ = docstrings
    
    help(split_csv)
    
    # Help on function split_csv in module __main__:
    # 
    # split_csv(file, sep=';', output_path='.', nrows=None, chunksize=None, low_memory=True, usecols=None)
    #   sep : str, default ','
    #       Delimiter to use. If sep is None, the C engine cannot automatically detect
    #       the separator, but the Python parsing engine can, meaning the latter will
    #       be used and automatically detect the separator by Python's builtin sniffer
    #       tool, ``csv.Sniffer``. In addition, separators longer than 1 character and
    #       different from ``'\s+'`` will be interpreted as regular expressions and
    #       will also force the use of the Python parsing engine. Note that regex
    #       delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``
    #   
    #   usecols : list-like or callable, default None
    #       Return a subset of the columns. If list-like, all elements must either
    #       be positional (i.e. integer indices into the document columns) or strings
    #       that correspond to column names provided either by the user in `names` or
    #       inferred from the document header row(s). For example, a valid list-like
    #       `usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element
    #       order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
    #       To instantiate a DataFrame from ``data`` with element order preserved use
    #       ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns
    #       in ``['foo', 'bar']`` order or
    #       ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
    #       for ``['bar', 'foo']`` order.
    #   
    #       If callable, the callable function will be evaluated against the column
    #       names, returning names where the callable function evaluates to True. An
    #       example of a valid callable argument would be ``lambda x: x.upper() in
    #       ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
    #       parsing time and lower memory usage.
    #   
    #   nrows : int, default None
    #       Number of rows of file to read. Useful for reading pieces of large files
    #   
    #   chunksize : int, default None
    #       Return TextFileReader object for iteration.
    #       See the `IO Tools docs
    #       <http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_
    #       for more information on ``iterator`` and ``chunksize``.
    #   
    #   low_memory : boolean, default True
    #       Internally process the file in chunks, resulting in lower memory use
    #       while parsing, but possibly mixed type inference.  To ensure no mixed
    #       types either set False, or specify the type with the `dtype` parameter.
    #       Note that the entire file is read into a single DataFrame regardless,
    #       use the `chunksize` or `iterator` parameter to return the data in chunks.
    #       (Only valid with C parser)
    

    But of course this relies on you having the exact argument names to the copied function. And as you can see, you will need to add the unmatched docstrings your self (e.g. file, output_path).