Search code examples
pythonpython-polars

Python Polars: Lazy Frame Row Count not equal wc -l


Been experimenting with polars and of the key features that peak my interest is the larger than RAM operations.

I downloaded some files to play with from HERE. On the website: First line in each file is header; 1 line corresponds to 1 record.. WARNING total download is quite large (~1.3GB)! This experiment was done on AWS server (t2.medium, 2cpu, 4GB)

wget https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Shoes_v1_00.tsv.gz \
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Office_Products_v1_00.tsv.gz \
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Software_v1_00.tsv.gz \
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv  .gz \
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Watches_v1_00.tsv.gz 

gunzip *

Here are the results from wc -l

drwxrwxr-x 3 ubuntu ubuntu       4096 Jun  2 12:44 ../
-rw-rw-r-- 1 ubuntu ubuntu 1243069057 Nov 25  2017 amazon_reviews_us_Office_Products_v1_00.tsv
-rw-rw-r-- 1 ubuntu ubuntu   44891575 Nov 25  2017 amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv
-rw-rw-r-- 1 ubuntu ubuntu 1570176560 Nov 25  2017 amazon_reviews_us_Shoes_v1_00.tsv
-rw-rw-r-- 1 ubuntu ubuntu  249565371 Nov 25  2017 amazon_reviews_us_Software_v1_00.tsv
-rw-rw-r-- 1 ubuntu ubuntu  412542975 Nov 25  2017 amazon_reviews_us_Watches_v1_00.tsv

$ find . -type f -exec cat {} + | wc -l
8398139

$ find . -name '*.tsv' | xargs wc -l
   2642435 ./amazon_reviews_us_Office_Products_v1_00.tsv
    341932 ./amazon_reviews_us_Software_v1_00.tsv
     85982 ./amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv
   4366917 ./amazon_reviews_us_Shoes_v1_00.tsv
    960873 ./amazon_reviews_us_Watches_v1_00.tsv
   8398139 total

Now, if I count the rows using polars using our new fancy lazy function:

import polars as pl

csvfile = "~/data/amazon/*.tsv"
(
    pl.scan_csv(csvfile, separator = '\t')
    .select( 
        pl.len()
        )
    .collect()
)
shape: (1, 1)
┌─────────┐
│ len     │
│ ---     │
│ u32     │
╞═════════╡
│ 4186305 │
└─────────┘

Wow, thats a BIG difference between wc -l and polars. Thats weird... maybe its a data issue. Lets only focus on the column of interest:

csvfile = "~/data/amazon/*.tsv"
(
...     pl.scan_csv(csvfile, separator = '\t')
...     .select( 
...         pl.col("product_category").count()
...         )
...     .collect()
... )
shape: (1, 1)
┌──────────────────┐
│ product_category │
│ ---              │
│ u32              │
╞══════════════════╡
│ 7126095          │
└──────────────────┘

And with .collect(streaming = True):

shape: (1, 1)
┌──────────────────┐
│ product_category │
│ ---              │
│ u32              │
╞══════════════════╡
│ 7125569          │
└──────────────────┘

Ok, still a difference of about 1 million? Lets do it bottom up:

csvfile = "~/data/amazon/*.tsv"
(
    pl.scan_csv(csvfile, separator = '\t') 
    .group_by("product_category")
    .agg(pl.col("product_category").count().alias("counts"))
    .collect(streaming = True)
    .filter(pl.col('counts') > 100)
    .sort(pl.col("counts"), descending = True)
    .select(
        pl.col('counts').sum()
    )
)
shape: (1, 1)
┌─────────┐
│ counts  │
│ ---     │
│ u32     │
╞═════════╡
│ 7125553 │
└─────────┘

Close, albeit that its once again a different count...

Some more checks using R:

library(vroom)
library(purrr)
library(glue)
library(logger)
amazon <- list.files("~/data/amazon/", full.names = TRUE)
f <- function(file){
     df <- vroom(file, col_select = 'product_category', show_col_types=FALSE )
     log_info(glue("File [{basename(file)}] has [{nrow(df)}] rows"))
}

walk(amazon, f)
INFO [2023-06-02 14:23:40] File [amazon_reviews_us_Office_Products_v1_00.tsv] has [2633651] rows
INFO [2023-06-02 14:23:41] File [amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv] has [85898] rows
INFO [2023-06-02 14:24:06] File [amazon_reviews_us_Shoes_v1_00.tsv] has [4353998] rows
INFO [2023-06-02 14:24:30] File [amazon_reviews_us_Software_v1_00.tsv] has [331152] rows
INFO [2023-06-02 14:24:37] File [amazon_reviews_us_Watches_v1_00.tsv] has [943763] rows

Total: 8348462

Ok. Screw it. Basically a random number generating exercise and nothing is real.

Surely if its a data hygiene issue the error should be constant? Any idea why there might be such a large discrepancy?


Solution

  • It's usually helpful to declare the size of downloads in cases like this.

    I tried pandas to debug this, it cannot read any of these files:

    pd.read_csv('amazon-reviews/amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv', sep='\t')
    
    ParserError: Error tokenizing data. C error: 
     Expected 15 fields in line 1598, saw 22
    

    Line 1598:

    US  3878437 R3BH4UXFRP6F8L  B00J7G8EL0  381088677   GUM Expanding Floss - 30 m - 2 pk   Personal_Care_Appliances    
    4   0   0   N   Y   " like the REACH woven that's no longer available--THAT was the wish it was a bit &#34;fluffier,&#34; like the REACH woven that's no longer available--THAT was the best    2015-08-06
    

    The issue is the single " character, you need to disable the default quoting behaviour.

    With that change I get a total count of 8398134 each time.


    polars

    (pl.scan_csv('amazon-reviews/*.tsv', separator='\t', quote_char=None)
       .select(pl.len())
       .collect()
    )
    
    CPU times: user 3.65 s, sys: 2.02 s, total: 5.67 s
    Wall time: 2.48 s
    shape: (1, 1)
    ┌─────────┐
    │ len     │
    │ ---     │
    │ u32     │
    ╞═════════╡
    │ 8398134 │
    └─────────┘
    

    pandas

    sum(
       len(pd.read_csv(file, sep='\t', quoting=3).index)
       for file in files
    )
    
    CPU times: user 57.6 s, sys: 9.78 s, total: 1min 7s
    Wall time: 1min 7s
    8398134
    

    duckdb

    duckdb.sql("""
    from read_csv_auto('amazon-reviews/*.tsv', sep='\t', quote='')
    select count(*)
    """).pl()
    
    CPU times: user 12.4 s, sys: 2.32 s, total: 14.7 s
    Wall time: 5.05 s
    shape: (1, 1)
    ┌──────────────┐
    │ count_star() │
    │ ---          │
    │ i64          │
    ╞══════════════╡
    │ 8398134      │
    └──────────────┘
    

    pyarrow

    parse_options = pyarrow.csv.ParseOptions(delimiter='\t', quote_char=False)
    
    sum(
       pyarrow.csv.read_csv(file, parse_options=parse_options).num_rows
       for file in files
    )
    
    CPU times: user 12.9 s, sys: 6.46 s, total: 19.4 s
    Wall time: 6.65 s
    8398134