Search code examples
filesystemszfs

ZFS Blocksize (recordsize) and primarycache issue


On this website: http://www.patpro.net/blog/index.php/2014/03/19/2628-zfs-primarycache-all-versus-metadata/

The person shows that by switching primarycache to all or metadata, he gets wildly different read performance when using the antivirus.

However, he also shows the read bandwidth has a vast difference too.

I create 2 brand new datasets, both with primarycache=none and compression=lz4, and I copy in each one a 4.8GB file (2.05x compressratio). Then I set primarycache=all on the first one, and primarycache=metadata on the second one. I cat the first file into /dev/null with zpool iostat running in another terminal. And finally, I cat the second file the same way.

The sum of read bandwidth column is (almost) exactly the physical size of the file on the disk (du output) for the dataset with primarycache=all: 2.44GB. For the other dataset, with primarycache=metadata, the sum of the read bandwidth column is ...wait for it... 77.95GB.

He then says that an anonymous user explained as this:

clamscan reads a file, gets 4k (pagesize?) of data and processes it, then it reads the next 4k, etc.

ZFS, however, cannot read just 4k. It reads 128k (recordsize) by default. Since there is no cache (you've turned it off) the rest of the data is thrown away.

128k / 4k = 32

32 x 2.44GB = 78.08GB

I don't quite understand the anonymous user's explanation. I'm still confused as to why there is such a big difference in the read bandwidth.

So why does this ZFS experiment show that when primarycache is all, the read bandwidth is 2.44 GB, but when it is just metadata, it's 77.95GB? And what are implications for tuning ZFS? If the person perhaps reduced his recordsize, would he a different result?

What about the claim that ZFS's recordsize is variable?


Solution

  • The test that the blogger, Patrick, ran was to "cat" the 4.8 GB file (compressed to 2.44 GB) to /dev/null and watch how long it took for the file to be read.

    The key is that "primarycache=metadata" might as well mean "cache=off," because none of the actual file will be stored in the cache. When "primarycache=all," the system reads the whole file once and stores it in cache (typically RAM, and then an L2 SSD cache when that fills up). When "cat" or "clamscan" look for the file, they can find it there, and it doesn't need to be read again from disk.

    As cat writes the file to /dev/null, it doesn't just write it in a single 2.44 GB block, it writes it a little bit at a time, then it checks the cache for the next bit, then it writes a little more, etc.

    With cache off, that file will need to be re-read from disk a ridiculous amount of times as it's written to /dev/null (or stdout, wherever) -- that's the logic of "128k/4k = 32".

    ZFS writes files on disk in 128k blocks, but the forum posters found that "clamscan" (and "cat", at least on this user's FreeBSD box) processes data in 4k blocks. So, without a cache, each 128k block will have to be served up 32 times instead of just once. (clamscan pulls block #1, 128k large, uses the first 4k; needs block #1 again, since there's no cache it reads the block from disk again; takes the second 4k, throws the rest out; etc.)

    The upshot is:

    [1] Maybe never do "primarycache=metadata", for any reason.

    [2] When block size is mismatched like so, performance issues can result. If clamscan read 128k blocks, there would be no (significant?) difference on the read of a single file. OTOH, if you need the file again shortly after, a cache would still have its data blocks and it woudn't need to be pulled from disk again.

    ...

    Here are some tests inspired by the forum post to illustrate. The examples take place on a zfs dataset, record size set to 128k (the default), primarycache is set to metadata and a 1G dummy file is copied at different block sizes, 128k first, then 4 then 8. (Scroll to the right, I've lined up my copy commands w/ the iostat readout).

    Notice how dramatically, when the block sizes are mismatched, the ratio of reads to writes balloons and the read bandwidth takes off.

        root@zone1:~# zpool iostat 3
                       capacity     operations    bandwidth
        pool        alloc   free   read  write   read  write
        ----------  -----  -----  -----  -----  -----  -----
        rpool        291G   265G      0     21  20.4K   130K
        rpool        291G   265G      0      0      0      0
        rpool        291G   265G      0    515      0  38.9M            ajordan@zone1:~/mnt/test$ mkfile 1G test1.tst   
        rpool        291G   265G      0  1.05K      0   121M
        rpool        292G   264G      0    974      0   100M
        rpool        292G   264G      0    217      0  26.7M
        rpool        292G   264G      0    516      0  58.0M
        rpool        292G   264G      0      0      0      0
        rpool        292G   264G      0      0      0      0
        rpool        292G   264G      0     96      0   619K
        rpool        292G   264G      0      0      0      0
        rpool        292G   264G      0      0      0      0
        rpool        292G   264G      0      0      0      0
        rpool        292G   264G      0      0      0      0
        rpool        292G   264G      0      0      0      0
        rpool        292G   264G      0      0      0      0
        rpool        292G   264G      0      0      0      0
        rpool        292G   264G      0      0      0      0
        rpool        292G   264G    474      0  59.3M      0            ajordan@zone1:~/mnt/test$ dd if=test1.tst of=test1.2 bs=128k
        rpool        292G   264G    254    593  31.8M  67.8M                
        rpool        292G   264G    396    230  49.6M  27.9M                
        rpool        293G   263G    306    453  38.3M  45.2M                8192+0 records in
        rpool        293G   263G    214    546  26.9M  62.0M                8192+0 records out
        rpool        293G   263G    486      0  60.8M      0
        rpool        293G   263G    211    635  26.5M  72.9M
        rpool        293G   263G    384    235  48.1M  29.2M
        rpool        293G   263G      0    346      0  37.2M
        rpool        293G   263G      0      0      0      0
        rpool        293G   263G      0      0      0      0
        rpool        293G   263G      0      0      0      0
        rpool        293G   263G      0      0      0      0
        rpool        293G   263G      0      0      0      0
        rpool        293G   263G      0      0      0      0
        rpool        293G   263G      0      0      0      0
        rpool        293G   263G      0      0      0      0
        rpool        293G   263G  1.05K     70   134M  3.52M            ajordan@zone1:~/mnt/test$ dd if=test1.tst of=test1.3 bs=4k
        rpool        293G   263G  1.45K      0   185M      0                
        rpool        293G   263G  1.35K    160   173M  10.0M
        rpool        293G   263G  1.44K      0   185M      0
        rpool        293G   263G  1.31K    180   168M  9.83M
        rpool        293G   263G  1.36K    117   174M  9.20M
        rpool        293G   263G  1.42K      0   181M      0
        rpool        293G   263G  1.26K    120   161M  9.48M
        rpool        293G   263G  1.49K      0   191M      0
        rpool        293G   263G  1.40K    117   179M  9.23M
        rpool        293G   263G  1.36K    159   175M  9.98M
        rpool        293G   263G  1.41K     12   180M   158K
        rpool        293G   263G  1.23K    167   157M  9.63M
        rpool        293G   263G  1.54K      0   197M      0
        rpool        293G   263G  1.36K    158   175M  9.70M
        rpool        293G   263G  1.42K    151   181M  9.99M
        rpool        293G   263G  1.41K     21   180M   268K
        rpool        293G   263G  1.32K    132   169M  9.39M
        rpool        293G   263G  1.48K      0   189M      0
        rpool        294G   262G  1.42K    118   181M  9.32M
        rpool        294G   262G  1.34K    121   172M  9.73M
        rpool        294G   262G    859      2   107M  10.7K
        rpool        294G   262G  1.34K    135   171M  6.83M
        rpool        294G   262G  1.43K      0   183M      0
        rpool        294G   262G  1.31K    120   168M  9.44M
        rpool        294G   262G  1.26K    116   161M  9.11M
        rpool        294G   262G  1.52K      0   194M      0
        rpool        294G   262G  1.32K    118   170M  9.44M
        rpool        294G   262G  1.48K      0   189M      0
        rpool        294G   262G  1.23K    170   157M  9.97M
        rpool        294G   262G  1.41K    116   181M  9.07M
        rpool        294G   262G  1.49K      0   191M      0
        rpool        294G   262G  1.38K    123   176M  9.90M
        rpool        294G   262G  1.35K      0   173M      0
        rpool        294G   262G  1.41K    114   181M  8.86M
        rpool        294G   262G  1.29K    155   165M  10.3M
        rpool        294G   262G  1.50K      7   192M  89.3K
        rpool        294G   262G  1.43K    116   183M  9.03M
        rpool        294G   262G  1.52K      0   194M      0
        rpool        294G   262G  1.39K    125   178M  10.0M
        rpool        294G   262G  1.28K    119   164M  9.52M
        rpool        294G   262G  1.54K      0   197M      0
        rpool        294G   262G  1.39K    120   178M  9.57M
        rpool        294G   262G  1.45K      0   186M      0
        rpool        294G   262G  1.37K    133   175M  9.60M                
        rpool        294G   262G  1.38K    173   176M  10.1M                
        rpool        294G   262G  1.61K      0   207M      0
        rpool        294G   262G  1.47K    125   189M  10.2M
        rpool        294G   262G  1.56K      0   200M      0
        rpool        294G   262G  1.38K    124   177M  10.2M
        rpool        294G   262G  1.37K    145   175M  9.95M
        rpool        294G   262G  1.51K     28   193M   359K
        rpool        294G   262G  1.32K    171   169M  10.1M
        rpool        294G   262G  1.55K      0   199M      0
        rpool        294G   262G  1.29K    119   165M  9.48M
        rpool        294G   262G  1.11K    110   142M  8.36M
        rpool        294G   262G  1.43K      0   183M      0
        rpool        294G   262G  1.36K    118   174M  9.32M
        rpool        294G   262G  1.49K      0   190M      0
        rpool        294G   262G  1.35K    118   173M  9.32M
        rpool        294G   262G  1.32K    146   169M  10.1M
        rpool        294G   262G  1.07K     29   137M   363K                262144+0 records in
        rpool        294G   262G      0     79      0  4.65M                262144+0 records out
        rpool        294G   262G      0      0      0      0
        rpool        294G   262G      0      0      0      0
        rpool        294G   262G      0      0      0      0
        rpool        294G   262G      0      0      0      0
        rpool        294G   262G      0      0      0      0
        rpool        294G   262G      0      0      0      0
    
    
    
        root@zone1:~# zpool iostat 3
                       capacity     operations    bandwidth
        pool        alloc   free   read  write   read  write
        ----------  -----  -----  -----  -----  -----  -----
        rpool        292G   264G      0     21  22.6K   130K
        rpool        292G   264G      0      0      0      0
        rpool        292G   264G      0      0      0      0
        rpool        292G   264G  1.03K      0   131M      0            ajordan@zone1:~/mnt/test$ dd if=test8k.tst of=test8k.2 bs=8k
        rpool        292G   264G  1.10K    202   141M  16.4M
        rpool        292G   264G  1.25K     25   161M   316K
        rpool        292G   264G    960    215   120M  15.5M
        rpool        292G   264G  1.25K      0   160M      0
        rpool        292G   264G     1K    210   128M  14.8M
        rpool        292G   264G   1010    159   126M  14.3M
        rpool        292G   264G  1.28K      0   164M      0
        rpool        292G   264G  1.08K    169   138M  15.6M
        rpool        292G   264G  1.25K      0   161M      0
        rpool        292G   264G  1.00K    166   128M  15.3M
        rpool        293G   263G    998    201   125M  15.1M
        rpool        293G   263G  1.19K      0   153M      0
        rpool        293G   263G    655    161  82.0M  14.2M
        rpool        293G   263G  1.27K      0   162M      0
        rpool        293G   263G  1.02K    230   130M  12.7M
        rpool        293G   263G  1.02K    204   130M  15.5M
        rpool        293G   263G  1.23K      0   157M      0
        rpool        293G   263G  1.11K    162   142M  14.8M
        rpool        293G   263G  1.26K      0   161M      0
        rpool        293G   263G  1.01K    168   130M  15.5M
        rpool        293G   263G  1.04K    215   133M  15.5M
        rpool        293G   263G  1.30K      0   167M      0
        rpool        293G   263G  1.01K    210   129M  16.1M
        rpool        293G   263G  1.24K      0   159M      0
        rpool        293G   263G  1.10K    214   141M  15.3M
        rpool        293G   263G  1.07K    169   137M  15.6M
        rpool        293G   263G  1.25K      0   160M      0
        rpool        293G   263G  1.01K    166   130M  15.0M
        rpool        293G   263G  1.25K      0   160M      0
        rpool        293G   263G    974    230   122M  15.8M
        rpool        293G   263G  1.11K    160   142M  14.4M
        rpool        293G   263G  1.26K      0   161M      0
        rpool        293G   263G  1.06K    172   136M  15.8M
        rpool        293G   263G  1.27K      0   162M      0
        rpool        293G   263G  1.07K    167   136M  15.4M
        rpool        293G   263G   1011    217   126M  15.8M
        rpool        293G   263G  1.22K      0   156M      0
        rpool        293G   263G    569    160  71.2M  14.6M                131072+0 records in
        rpool        293G   263G      0      0      0      0                131072+0 records out
        rpool        293G   263G      0     98      0  1.09M
        rpool        293G   263G      0      0      0      0
        rpool        293G   263G      0      0      0      0
        rpool        293G   263G      0      0      0      0