<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/fs/btrfs/delayed-ref.c, branch linux-5.1.y</title>
<subtitle>Hosts the 0x221E linux distro kernel.</subtitle>
<id>https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-5.1.y</id>
<link rel='self' href='https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-5.1.y'/>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/'/>
<updated>2019-02-25T13:13:39Z</updated>
<entry>
<title>btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record</title>
<updated>2019-02-25T13:13:39Z</updated>
<author>
<name>Qu Wenruo</name>
<email>wqu@suse.com</email>
</author>
<published>2019-01-23T07:15:12Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=1418bae1c22951aad9883bc8f8f4dccb272cce1e'/>
<id>urn:sha1:1418bae1c22951aad9883bc8f8f4dccb272cce1e</id>
<content type='text'>
[BUG]
Btrfs/139 will fail with a high probability if the testing machine (VM)
has only 2G RAM.

Resulting the final write success while it should fail due to EDQUOT,
and the fs will have quota exceeding the limit by 16K.

The simplified reproducer will be: (needs a 2G ram VM)

  $ mkfs.btrfs -f $dev
  $ mount $dev $mnt

  $ btrfs subv create $mnt/subv
  $ btrfs quota enable $mnt
  $ btrfs quota rescan -w $mnt
  $ btrfs qgroup limit -e 1G $mnt/subv

  $ for i in $(seq -w  1 8); do
  	xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i &gt; /dev/null
  	echo "file $i written" &gt; /dev/kmsg
    done
  $ sync
  $ btrfs qgroup show -pcre --raw $mnt

The last pwrite will not trigger EDQUOT and final 'qgroup show' will
show something like:

  qgroupid         rfer         excl     max_rfer     max_excl parent  child
  --------         ----         ----     --------     -------- ------  -----
  0/5             16384        16384         none         none ---     ---
  0/256      1073758208   1073758208         none   1073741824 ---     ---

And 1073758208 is larger than
  &gt; 1073741824.

[CAUSE]
It's a bug in btrfs qgroup data reserved space management.

For quota limit, we must ensure that:
  reserved (data + metadata) + rfer/excl &lt;= limit

Since rfer/excl is only updated at transaction commmit time, reserved
space needs to be taken special care.

One important part of reserved space is data, and for a new data extent
written to disk, we still need to take the reserved space until
rfer/excl numbers get updated.

Originally when an ordered extent finishes, we migrate the reserved
qgroup data space from extent_io tree to delayed ref head of the data
extent, expecting delayed ref will only be cleaned up at commit
transaction time.

However for small RAM machine, due to memory pressure dirty pages can be
flushed back to disk without committing a transaction.

The related events will be something like:

  file 1 written
  btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
  btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
  btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
  btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
  btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
  cleanup_ref_head: num_bytes=54947840
  cleanup_ref_head: num_bytes=5636096
  cleanup_ref_head: num_bytes=569344
  cleanup_ref_head: num_bytes=57344
  cleanup_ref_head: num_bytes=8192
  ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
  file 2 written
  ...
  file 8 written
  cleanup_ref_head: num_bytes=8192
  ...
  btrfs_commit_transaction  &lt;&lt;&lt; the only transaction committed during
				the test

When file 2 is written, we have already freed 128M reserved qgroup data
space for ino 258. Thus later write won't trigger EDQUOT.

This allows us to write more data beyond qgroup limit.

In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.

[FIX]
By moving reserved qgroup data space from btrfs_delayed_ref_head to
btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
space won't be freed half way before commit transaction, thus fix the
problem.

Fixes: f64d5ca86821 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
Signed-off-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: introduce delayed_refs_rsv</title>
<updated>2018-12-17T13:51:46Z</updated>
<author>
<name>Josef Bacik</name>
<email>jbacik@fb.com</email>
</author>
<published>2018-12-03T15:20:33Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=ba2c4d4e3bda7d6de2bc616ae6715e0a0725b294'/>
<id>urn:sha1:ba2c4d4e3bda7d6de2bc616ae6715e0a0725b294</id>
<content type='text'>
Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv.  This works most
of the time, except when it doesn't.  We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction.  Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.

So instead give them their own block_rsv.  This way we can always know
exactly how much outstanding space we need for delayed refs.  This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system.  Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.

For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums.  We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.

Historical background:

The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
if we're running short on space and when it's time to force a commit.  A
failure rate of 20-40 file systems when we run hundreds of thousands of
them isn't super high, but cleaning up this code will make things less
ugly and more predictible.

Thus the delayed refs rsv.  We always know how many delayed refs we have
outstanding, and although running them generates more we can use the
global reserve for that spill over, which fits better into it's desired
use than a full blown reservation.  This first approach is to simply
take how many times we're reserving space for and multiply that by 2 in
order to save enough space for the delayed refs that could be generated.
This is a niave approach and will probably evolve, but for now it works.

Signed-off-by: Josef Bacik &lt;jbacik@fb.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt; # high-level review
[ added background notes from the cover letter ]
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: only track ref_heads in delayed_ref_updates</title>
<updated>2018-12-17T13:51:46Z</updated>
<author>
<name>Josef Bacik</name>
<email>jbacik@fb.com</email>
</author>
<published>2018-12-03T15:20:32Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=158ffa364bf723fa1ef128060646d23dc3942994'/>
<id>urn:sha1:158ffa364bf723fa1ef128060646d23dc3942994</id>
<content type='text'>
We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it.  Fix the accounting to only be
adjusted when we add/remove a ref head.

In addition to using this number to limit the number of delayed refs
run, a future patch is also going to use it to calculate the amount of
space required for delayed refs space reservation.

Reviewed-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Signed-off-by: Josef Bacik &lt;jbacik@fb.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: add btrfs_delete_ref_head helper</title>
<updated>2018-12-17T13:51:46Z</updated>
<author>
<name>Josef Bacik</name>
<email>jbacik@fb.com</email>
</author>
<published>2018-12-03T15:20:29Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=d7baffdaf9f9df8c9715aa507e3be2f409347c74'/>
<id>urn:sha1:d7baffdaf9f9df8c9715aa507e3be2f409347c74</id>
<content type='text'>
We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.

Reviewed-by: Omar Sandoval &lt;osandov@fb.com&gt;
Reviewed-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Signed-off-by: Josef Bacik &lt;jbacik@fb.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: delayed-ref: extract find_first_ref_head from find_ref_head</title>
<updated>2018-10-17T17:21:00Z</updated>
<author>
<name>Lu Fengqi</name>
<email>lufq.fnst@cn.fujitsu.com</email>
</author>
<published>2018-10-15T06:25:38Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=0a9df0df17a011fa4d17277a8c8dd662668c6426'/>
<id>urn:sha1:0a9df0df17a011fa4d17277a8c8dd662668c6426</id>
<content type='text'>
The find_ref_head shouldn't return the first entry even if no exact match
is found. So move the hidden behavior to higher level.

Besides, remove the useless local variables in the btrfs_select_ref_head.

Reviewed-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Signed-off-by: Lu Fengqi &lt;lufq.fnst@cn.fujitsu.com&gt;
[ reformat comment ]
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: switch return_bigger to bool in find_ref_head</title>
<updated>2018-10-15T15:23:41Z</updated>
<author>
<name>Lu Fengqi</name>
<email>lufq.fnst@cn.fujitsu.com</email>
</author>
<published>2018-10-11T05:40:38Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=d9352794dad9f28535439d85a815978878c141ab'/>
<id>urn:sha1:d9352794dad9f28535439d85a815978878c141ab</id>
<content type='text'>
Using bool is more suitable than int here, and add the comment about the
return_bigger.

Signed-off-by: Lu Fengqi &lt;lufq.fnst@cn.fujitsu.com&gt;
Reviewed-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: delayed-ref: pass delayed_refs directly to btrfs_delayed_ref_lock</title>
<updated>2018-10-15T15:23:41Z</updated>
<author>
<name>Lu Fengqi</name>
<email>lufq.fnst@cn.fujitsu.com</email>
</author>
<published>2018-10-11T05:40:34Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=9e920a6f03e40b1eb712f38b29ad5880153754e2'/>
<id>urn:sha1:9e920a6f03e40b1eb712f38b29ad5880153754e2</id>
<content type='text'>
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_delayed_ref_lock().

No functional change.

Reviewed-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Signed-off-by: Lu Fengqi &lt;lufq.fnst@cn.fujitsu.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: delayed-ref: pass delayed_refs directly to btrfs_select_ref_head</title>
<updated>2018-10-15T15:23:40Z</updated>
<author>
<name>Lu Fengqi</name>
<email>lufq.fnst@cn.fujitsu.com</email>
</author>
<published>2018-10-11T05:40:33Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=5637c74b01458d4bc392c2bb721bd102f316ad2d'/>
<id>urn:sha1:5637c74b01458d4bc392c2bb721bd102f316ad2d</id>
<content type='text'>
Since trans is only used for referring to delayed_refs, there is no need
to pass it instead of delayed_refs to btrfs_select_ref_head().  No
functional change.

Signed-off-by: Lu Fengqi &lt;lufq.fnst@cn.fujitsu.com&gt;
Reviewed-by: Nikolay Borisov &lt;nborisov@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>Btrfs: delayed-refs: use rb_first_cached for ref_tree</title>
<updated>2018-10-15T15:23:33Z</updated>
<author>
<name>Liu Bo</name>
<email>bo.liu@linux.alibaba.com</email>
</author>
<published>2018-08-22T19:51:50Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=e3d039656384288bbe952413d8d404b3035fe7d7'/>
<id>urn:sha1:e3d039656384288bbe952413d8d404b3035fe7d7</id>
<content type='text'>
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

Functions manipulating href-&gt;ref_tree need to get the first entry, this
converts href-&gt;ref_tree to use rb_first_cached().

For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".

Tested-by: Holger Hoffstätte &lt;holger@applied-asynchrony.com&gt;
Signed-off-by: Liu Bo &lt;bo.liu@linux.alibaba.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>Btrfs: delayed-refs: use rb_first_cached for href_root</title>
<updated>2018-10-15T15:23:33Z</updated>
<author>
<name>Liu Bo</name>
<email>bo.liu@linux.alibaba.com</email>
</author>
<published>2018-08-22T19:51:49Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=5c9d028b3b174e5cf3678a7b0c14e21e51665793'/>
<id>urn:sha1:5c9d028b3b174e5cf3678a7b0c14e21e51665793</id>
<content type='text'>
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

Functions manipulating href_root need to get the first entry, this
converts href_root to use rb_first_cached().

This patch is first in the sequenct of similar updates to other rbtrees
and this is analysis of the expected behaviour and improvements.

There's a common pattern:

while (node = rb_first) {
        entry = rb_entry(node)
        next = rb_next(node)
        rb_erase(node)
        cleanup(entry)
}

rb_first needs to traverse the tree up to logN depth, rb_erase can
completely reshuffle the tree. With the caching we'll skip the traversal
in rb_first.  That's a cached memory access vs looped pointer
dereference trade-off that IMHO has a clear winner.

Measurements show there's not much difference in a sample tree with
10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of
caching and pointer chasing are unpredictable though.

Further optimzations can be done to avoid the expensive rb_erase step.
In some cases it's ok to process the nodes in any order, so the tree can
be traversed in post-order, not rebalancing the children nodes and just
calling free. Care must be taken regarding the next node.

Tested-by: Holger Hoffstätte &lt;holger@applied-asynchrony.com&gt;
Signed-off-by: Liu Bo &lt;bo.liu@linux.alibaba.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
[ update changelog from mail discussions ]
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
</feed>
