Search code examples
f#type-providersf#-interactivef#-datafsharp.data.typeproviders

F# CSV TypeProvider less robust in console application


I am trying to experiment with live data from the Coronavirus pandemic (unfortunately and good luck to all of us).

I have developed a small script and I am transitioning into a console application: it uses CSV type providers.

I have the following issue. Suppose we want to filter by region the Italian spread we can use this code into a .fsx file:

open FSharp.Data

let provinceData = CsvProvider< @"https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-province/dpc-covid19-ita-province.csv" , IgnoreErrors = true>.GetSample()


let filterDataByProvince province = 
    provinceData.Rows
    |> Seq.filter (fun x -> x.Sigla_provincia = province)

Being sequences lazy, then suppose I force the complier to load in memory the data for the province of Rome, I can add:

let romeProvince = filterDataByProvince "RM" |> Seq.toArray

This works fine, run by FSI, locally.

Now, if I transition this code into a console application using a .fs file; I declare exactly the same functions and using exactly the same type provider loader; but instead of using the last line to gather the data, I put it into a main function:

[<EntryPoint>]
let main _ =
    let romeProvince = filterDataByProvince "RM" |> Seq.toArray

    Console.Read() |> ignore
    0

This results into the following runtime exception:

System.Exception
  HResult=0x80131500
  Message=totale_casi is missing
  Source=FSharp.Data
  StackTrace:
   at <StartupCode$FSharp-Data>[email protected](String message)
   at [email protected](Object parent, String[] row) in C:\Users\glddm\source\repos\CoronaSchiatta\CoronaSchiatta\CoronaEvolution.fs:line 10
   at FSharp.Data.Runtime.CsvHelpers.parseIntoTypedRows@174.GenerateNext(IEnumerable`1& next)

Can you explain that?

Some rows have an odd format, possibly, but the FSI session is robust to those, whilst the console version is fragile; why? How can I fix that?

I am using VS2019 Community Edition, targeting .NET Framework 4.7.2, F# runtime: 4.7.0.0; as FSI, I am using the following: FSI Microsoft (R) F# Interactive version 10.7.0.0 for F# 4.7

PS: Please also be aware that if I use CsvFile, instead of type providers, as in:

let test = @"https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-province/dpc-covid19-ita-province.csv" 
               |> CsvFile.Load |> (fun  x  -> x.Rows )  |> Seq.filter ( fun x-> x.[6 ] = "RM")
               |> Seq.iter ( fun x -> x.[9] |> Console.WriteLine )

Then it works like a charm also in the console application. Of course I would like to use type providers otherwise I have to add type definition, mapping the schema to the columns (and it will be more fragile). The last line was just a quick test.


Solution

  • Fragility

    CSV Type Providers can be fragile if you don't have a good schema or sample.

    Now getting a runtime error is almost certainly because your data doesn't match up. How do you figure it out? One way is to run through your data first:

    provinceData.Rows |> Seq.iteri (fun i x -> printfn "Row %d: %A" (i + 1) x)
    

    This runs up to Row 2150. And sure enough, the next row:

    2020-03-11 17:00:00,ITA,19,Sicilia,994,In fase di definizione/aggiornamento,,0,0,
    

    You can see the last value (totale_casi) is missing.

    One of CsvProvider's options is InferRows. This is the number of rows the provider scans in order to build up a schema - and its default value happens to be 1000.

    So:

    type COVID = CsvProvider<uri, InferRows = 0>
    

    A better way to prevent this from happening in the future is to manually define a sample from a sub-set of data:

    type COVID = CsvProvider<"sample-dpc-covid19-ita-province.csv">
    

    and sample-dpc-covid19-ita-province.csv is:

        data,stato,codice_regione,denominazione_regione,codice_provincia,denominazione_provincia,sigla_provincia,lat,long,totale_casi
        2020-02-24 18:00:00,ITA,13,Abruzzo,069,Chieti,CH,42.35103167,14.16754574,0
        2020-02-24 18:00:00,ITA,13,Abruzzo,066,L'Aquila,AQ,42.35122196,13.39843823,
        2020-02-24 18:00:00,ITA,13,Abruzzo,068,Pescara,PE,42.46458398,14.21364822,0
        2020-02-24 18:00:00,ITA,13,Abruzzo,067,Teramo,TE,42.6589177,13.70439971,0
    

    With this the type of totale_casi is now Nullable<int>.

    If you don't mind NaN values, you can also use:

    CsvProvider<..., AssumeMissingValues = true>
    

    Why does FSI seem more robust?

    FSI isn't more robust. This is my best guess:

    Your schema source is being regularly updated. Type Providers cache the schema, so that it doesn't regenerate the schema every time you compile your code, which can be impractical. When you restart an FSI session, you end up regenerating your Type Provider, but not so with the console application. So it might sometimes has the effect of being less error-prone, having worked with a newer source.