Been experimenting with polars
and of the key features that peak my interest is the larger than RAM operations.
I downloaded some files to play with from HERE. On the website: First line in each file is header; 1 line corresponds to 1 record.. WARNING total download is quite large (~1.3GB)! This experiment was done on AWS server (t2.medium
, 2cpu
, 4GB
)
wget https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Shoes_v1_00.tsv.gz \
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Office_Products_v1_00.tsv.gz \
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Software_v1_00.tsv.gz \
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv .gz \
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Watches_v1_00.tsv.gz
gunzip *
Here are the results from wc -l
drwxrwxr-x 3 ubuntu ubuntu 4096 Jun 2 12:44 ../
-rw-rw-r-- 1 ubuntu ubuntu 1243069057 Nov 25 2017 amazon_reviews_us_Office_Products_v1_00.tsv
-rw-rw-r-- 1 ubuntu ubuntu 44891575 Nov 25 2017 amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv
-rw-rw-r-- 1 ubuntu ubuntu 1570176560 Nov 25 2017 amazon_reviews_us_Shoes_v1_00.tsv
-rw-rw-r-- 1 ubuntu ubuntu 249565371 Nov 25 2017 amazon_reviews_us_Software_v1_00.tsv
-rw-rw-r-- 1 ubuntu ubuntu 412542975 Nov 25 2017 amazon_reviews_us_Watches_v1_00.tsv
$ find . -type f -exec cat {} + | wc -l
8398139
$ find . -name '*.tsv' | xargs wc -l
2642435 ./amazon_reviews_us_Office_Products_v1_00.tsv
341932 ./amazon_reviews_us_Software_v1_00.tsv
85982 ./amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv
4366917 ./amazon_reviews_us_Shoes_v1_00.tsv
960873 ./amazon_reviews_us_Watches_v1_00.tsv
8398139 total
Now, if I count the rows using polars
using our new fancy lazy function:
import polars as pl
csvfile = "~/data/amazon/*.tsv"
(
pl.scan_csv(csvfile, separator = '\t')
.select(
pl.len()
)
.collect()
)
shape: (1, 1)
┌─────────┐
│ len │
│ --- │
│ u32 │
╞═════════╡
│ 4186305 │
└─────────┘
Wow, thats a BIG difference between wc -l
and polars
. Thats weird... maybe its a data issue. Lets only focus on the column of interest:
csvfile = "~/data/amazon/*.tsv"
(
... pl.scan_csv(csvfile, separator = '\t')
... .select(
... pl.col("product_category").count()
... )
... .collect()
... )
shape: (1, 1)
┌──────────────────┐
│ product_category │
│ --- │
│ u32 │
╞══════════════════╡
│ 7126095 │
└──────────────────┘
And with .collect(streaming = True)
:
shape: (1, 1)
┌──────────────────┐
│ product_category │
│ --- │
│ u32 │
╞══════════════════╡
│ 7125569 │
└──────────────────┘
Ok, still a difference of about 1 million? Lets do it bottom up:
csvfile = "~/data/amazon/*.tsv"
(
pl.scan_csv(csvfile, separator = '\t')
.group_by("product_category")
.agg(pl.col("product_category").count().alias("counts"))
.collect(streaming = True)
.filter(pl.col('counts') > 100)
.sort(pl.col("counts"), descending = True)
.select(
pl.col('counts').sum()
)
)
shape: (1, 1)
┌─────────┐
│ counts │
│ --- │
│ u32 │
╞═════════╡
│ 7125553 │
└─────────┘
Close, albeit that its once again a different count...
Some more checks using R
:
library(vroom)
library(purrr)
library(glue)
library(logger)
amazon <- list.files("~/data/amazon/", full.names = TRUE)
f <- function(file){
df <- vroom(file, col_select = 'product_category', show_col_types=FALSE )
log_info(glue("File [{basename(file)}] has [{nrow(df)}] rows"))
}
walk(amazon, f)
INFO [2023-06-02 14:23:40] File [amazon_reviews_us_Office_Products_v1_00.tsv] has [2633651] rows
INFO [2023-06-02 14:23:41] File [amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv] has [85898] rows
INFO [2023-06-02 14:24:06] File [amazon_reviews_us_Shoes_v1_00.tsv] has [4353998] rows
INFO [2023-06-02 14:24:30] File [amazon_reviews_us_Software_v1_00.tsv] has [331152] rows
INFO [2023-06-02 14:24:37] File [amazon_reviews_us_Watches_v1_00.tsv] has [943763] rows
Total: 8348462
Ok. Screw it. Basically a random number generating exercise and nothing is real.
Surely if its a data hygiene issue the error should be constant? Any idea why there might be such a large discrepancy?
It's usually helpful to declare the size of downloads in cases like this.
For any readers, the total size is 1.3 GB
The smallest file is https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv.gz at 17.6 MB
I tried pandas to debug this, it cannot read any of these files:
pd.read_csv('amazon-reviews/amazon_reviews_us_Personal_Care_Appliances_v1_00.tsv', sep='\t')
ParserError: Error tokenizing data. C error:
Expected 15 fields in line 1598, saw 22
Line 1598:
US 3878437 R3BH4UXFRP6F8L B00J7G8EL0 381088677 GUM Expanding Floss - 30 m - 2 pk Personal_Care_Appliances
4 0 0 N Y " like the REACH woven that's no longer available--THAT was the wish it was a bit "fluffier," like the REACH woven that's no longer available--THAT was the best 2015-08-06
The issue is the single "
character, you need to disable the default quoting behaviour.
With that change I get a total count of 8398134
each time.
(pl.scan_csv('amazon-reviews/*.tsv', separator='\t', quote_char=None)
.select(pl.len())
.collect()
)
CPU times: user 3.65 s, sys: 2.02 s, total: 5.67 s
Wall time: 2.48 s
shape: (1, 1)
┌─────────┐
│ len │
│ --- │
│ u32 │
╞═════════╡
│ 8398134 │
└─────────┘
sum(
len(pd.read_csv(file, sep='\t', quoting=3).index)
for file in files
)
CPU times: user 57.6 s, sys: 9.78 s, total: 1min 7s
Wall time: 1min 7s
8398134
duckdb.sql("""
from read_csv_auto('amazon-reviews/*.tsv', sep='\t', quote='')
select count(*)
""").pl()
CPU times: user 12.4 s, sys: 2.32 s, total: 14.7 s
Wall time: 5.05 s
shape: (1, 1)
┌──────────────┐
│ count_star() │
│ --- │
│ i64 │
╞══════════════╡
│ 8398134 │
└──────────────┘
parse_options = pyarrow.csv.ParseOptions(delimiter='\t', quote_char=False)
sum(
pyarrow.csv.read_csv(file, parse_options=parse_options).num_rows
for file in files
)
CPU times: user 12.9 s, sys: 6.46 s, total: 19.4 s
Wall time: 6.65 s
8398134