<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/fs/btrfs/extent-tree.c, branch linux-3.17.y</title>
<subtitle>Hosts the 0x221E linux distro kernel.</subtitle>
<id>https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-3.17.y</id>
<link rel='self' href='https://universe.0xinfinity.dev/distro/kernel/atom?h=linux-3.17.y'/>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/'/>
<updated>2015-01-08T18:27:49Z</updated>
<entry>
<title>Btrfs: fix fs corruption on transaction abort if device supports discard</title>
<updated>2015-01-08T18:27:49Z</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2014-12-07T21:31:47Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=eece4e2040607cec8524a8d06e3a24a6183156f3'/>
<id>urn:sha1:eece4e2040607cec8524a8d06e3a24a6183156f3</id>
<content type='text'>
commit 678886bdc6378c1cbd5072da2c5a3035000214e3 upstream.

When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info-&gt;freed_extents[0] and fs_info-&gt;freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.

This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info-&gt;pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.

I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:

  [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

The corruption always happened when a transaction aborted and then fsck complained
like this:

   _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
   *** fsck.btrfs output ***
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   read block failed check_tree_block
   Couldn't open file system

In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:

   1) transaction N started, fs_info-&gt;pinned_extents pointed to
      fs_info-&gt;freed_extents[0];

   2) node/eb 94945280 is created;

   3) eb is persisted to disk;

   4) transaction N commit starts, fs_info-&gt;pinned_extents now points to
      fs_info-&gt;freed_extents[1], and transaction N completes;

   5) transaction N + 1 starts;

   6) eb is COWed, and btrfs_free_tree_block() called for this eb;

   7) eb range (94945280 to 94945280 + 16Kb) is added to
      fs_info-&gt;pinned_extents (fs_info-&gt;freed_extents[1]);

   8) Something goes wrong in transaction N + 1, like hitting ENOSPC
      for example, and the transaction is aborted, turning the fs into
      readonly mode. The stack trace I got for example:

      [112065.253935]  [&lt;ffffffff8140c7b6&gt;] dump_stack+0x4d/0x66
      [112065.254271]  [&lt;ffffffff81042984&gt;] warn_slowpath_common+0x7f/0x98
      [112065.254567]  [&lt;ffffffffa0325990&gt;] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.261674]  [&lt;ffffffff810429e5&gt;] warn_slowpath_fmt+0x48/0x50
      [112065.261922]  [&lt;ffffffffa032949e&gt;] ? btrfs_free_path+0x26/0x29 [btrfs]
      [112065.262211]  [&lt;ffffffffa0325990&gt;] __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.262545]  [&lt;ffffffffa036b1d6&gt;] btrfs_remove_chunk+0x537/0x58b [btrfs]
      [112065.262771]  [&lt;ffffffffa033840f&gt;] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
      [112065.263105]  [&lt;ffffffffa0343106&gt;] cleaner_kthread+0x100/0x12f [btrfs]
      (...)
      [112065.264493] ---[ end trace dd7903a975a31a08 ]---
      [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
      [112065.264997] BTRFS info (device sdc): forced readonly

   9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
      fs_info-&gt;fs_state and calls btrfs_cleanup_transaction(), which in
      turn calls btrfs_destroy_pinned_extent();

   10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
       marked as dirty in fs_info-&gt;freed_extents[], and for each one
       it calls discard, if the fs was mounted with "-o discard", and
       adds the range to the free space cache of the respective block
       group;

   11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
       sees the free space entries and performs a discard;

   12) After an umount and mount (or fsck), our eb's location on disk was full
       of zeroes, and it should have been untouched, because it was marked as
       dirty in the fs_info-&gt;pinned_extents tree, and therefore used by the
       trees that the last committed superblock points to.

Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.

This isn't a new problem, as it's been present since 2011 (git commit
acce952b0263825da32cf10489413dec78053347).

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>Btrfs: fix loop writing of async reclaim</title>
<updated>2015-01-08T18:27:49Z</updated>
<author>
<name>Liu Bo</name>
<email>bo.li.liu@oracle.com</email>
</author>
<published>2014-09-10T04:58:50Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=f9841b92fbb2154e9d61279cba98695a86fd886b'/>
<id>urn:sha1:f9841b92fbb2154e9d61279cba98695a86fd886b</id>
<content type='text'>
commit 25ce459c1af138f95a3fd318461193397ebb825b upstream.

One of my tests shows that when we really don't have space to reclaim via
flush_space and also run out of space, this async reclaim work loops on adding
itself into the workqueue and keeps writing something to disk according to
iostat's results, and these writes mainly comes from commit_transaction which
writes super_block.  This's unacceptable as it can be bad to disks, especially
memeory storages.

This adds a check to avoid the above situation.

Signed-off-by: Liu Bo &lt;bo.li.liu@oracle.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Cc: "Alexander E. Patrakov" &lt;patrakov@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>Btrfs: don't do async reclaim during log replay</title>
<updated>2014-10-30T16:43:05Z</updated>
<author>
<name>Josef Bacik</name>
<email>jbacik@fb.com</email>
</author>
<published>2014-09-18T15:27:17Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=7433c1d96f229e26b8224b6cfead1a7e4486ea53'/>
<id>urn:sha1:7433c1d96f229e26b8224b6cfead1a7e4486ea53</id>
<content type='text'>
commit f6acfd50110b335c7af636cf1fc8e55319cae5fc upstream.

Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
during log replay.  This is because we use fs_info-&gt;fs_root as our root for
shrinking and such.  Technically we can use whatever root we want, but let's
just not allow async reclaim while we're doing log replay.  Thanks,

Signed-off-by: Josef Bacik &lt;jbacik@fb.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>Btrfs: fix task hang under heavy compressed write</title>
<updated>2014-08-24T14:17:02Z</updated>
<author>
<name>Liu Bo</name>
<email>bo.li.liu@oracle.com</email>
</author>
<published>2014-08-15T15:36:53Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=9e0af23764344f7f1b68e4eefbe7dc865018b63d'/>
<id>urn:sha1:9e0af23764344f7f1b68e4eefbe7dc865018b63d</id>
<content type='text'>
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.

Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.

Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --

normal_work_helper(arg)
    work = container_of(arg, struct btrfs_work, normal_work);

    work-&gt;func() &lt;---- (we name it work X)
    for ordered_work in wq-&gt;ordered_list
            ordered_work-&gt;ordered_func()
            ordered_work-&gt;ordered_free()

The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will

file a readahead request
    btrfs_readpages()
         for page that is not in page cache
                __do_readpage()
                     submit_extent_page()
                           btrfs_submit_bio_hook()
                                 btrfs_bio_wq_end_io()
                                 submit_bio()
                                 end_workqueue_bio() &lt;--(ret by the 1st endio)
                                      queue a work(named work Y) for the 2nd
                                      also the real endio()

So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.

A bit more explanation,

A,B,C -- struct btrfs_work
arg   -- struct work_struct

kthread:
worker_thread()
    pick up a work_struct from @worklist
    process_one_work(arg)
	worker-&gt;current_work = arg;  &lt;-- arg is A-&gt;normal_work
	worker-&gt;current_func(arg)
		normal_work_helper(arg)
		     A = container_of(arg, struct btrfs_work, normal_work);

		     A-&gt;func()
		     A-&gt;ordered_func()
		     A-&gt;ordered_free()  &lt;-- A gets freed

		     B-&gt;ordered_func()
			  submit_compressed_extents()
			      find_free_extent()
				  load_free_space_inode()
				      ...   &lt;-- (the above readhead stack)
				      end_workqueue_bio()
					   btrfs_queue_work(work C)
		     B-&gt;ordered_free()

As if work A has a high priority in wq-&gt;ordered_list and there are more ordered
works queued after it, such as B-&gt;ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker-&gt;current_work pointer to be work
A-&gt;normal_work's, ie. arg's address.

Meanwhile, work C is allocated after work A is freed, work C-&gt;normal_work
and work A-&gt;normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).

When another kthread picks up work C-&gt;normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.

So the situation is that our kthread is waiting forever on work C.

Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work-&gt;func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.

With this patch, I no long hit the above hang.

Signed-off-by: Liu Bo &lt;bo.li.liu@oracle.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
</content>
</entry>
<entry>
<title>Btrfs: don't consider the missing device when allocating new chunks</title>
<updated>2014-08-19T15:52:19Z</updated>
<author>
<name>Miao Xie</name>
<email>miaox@cn.fujitsu.com</email>
</author>
<published>2014-07-24T03:37:14Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=95669976bd7d30ae265db938ecb46a6b7f8cb893'/>
<id>urn:sha1:95669976bd7d30ae265db938ecb46a6b7f8cb893</id>
<content type='text'>
The original code allocated new chunks by the number of the writable devices
and missing devices to make sure that any RAID levels on a degraded FS continue
to be honored, but it introduced a problem that it stopped us to allocating
new chunks, the steps to reproduce is following:

 # mkfs.btrfs -m raid1 -d raid1 -f &lt;dev0&gt; &lt;dev1&gt;
 # mkfs.btrfs -f &lt;dev1&gt;			//Removing &lt;dev1&gt; from the original fs
 # mount -o degraded &lt;dev0&gt; &lt;mnt&gt;
 # dd if=/dev/null of=&lt;mnt&gt;/tmpfile bs=1M

It is because we allocate new chunks only on the writable devices, if we take
the number of missing devices into account, and want to allocate new chunks
with higher RAID level, we will fail becaue we don't have enough writable
device. Fix it by ignoring the number of missing devices when allocating
new chunks.

Signed-off-by: Miao Xie &lt;miaox@cn.fujitsu.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
</content>
</entry>
<entry>
<title>btrfs: qgroup: account shared subtrees during snapshot delete</title>
<updated>2014-08-15T14:43:14Z</updated>
<author>
<name>Mark Fasheh</name>
<email>mfasheh@suse.de</email>
</author>
<published>2014-07-17T19:39:01Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=1152651a081720ef6a8c76bb7da676e8c900ac30'/>
<id>urn:sha1:1152651a081720ef6a8c76bb7da676e8c900ac30</id>
<content type='text'>
During its tree walk, btrfs_drop_snapshot() will skip any shared
subtrees it encounters. This is incorrect when we have qgroups
turned on as those subtrees need to have their contents
accounted. In particular, the case we're concerned with is when
removing our snapshot root leaves the subtree with only one root
reference.

In those cases we need to find the last remaining root and add
each extent in the subtree to the corresponding qgroup exclusive
counts.

This patch implements the shared subtree walk and a new qgroup
operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
this type is encountered during qgroup accounting, we search for
any root references to that extent and in the case that we find
only one reference left, we go ahead and do the math on it's
exclusive counts.

Signed-off-by: Mark Fasheh &lt;mfasheh@suse.de&gt;
Reviewed-by: Josef Bacik &lt;jbacik@fb.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
</content>
</entry>
<entry>
<title>Btrfs: __btrfs_mod_ref should always use no_quota</title>
<updated>2014-08-15T14:43:11Z</updated>
<author>
<name>Josef Bacik</name>
<email>jbacik@fb.com</email>
</author>
<published>2014-07-02T17:54:25Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=e339a6b097c515a31ce230d498c44ff2e89f1cf4'/>
<id>urn:sha1:e339a6b097c515a31ce230d498c44ff2e89f1cf4</id>
<content type='text'>
Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
understand how snapshot delete was using it and assumed that we needed the
quota operations there.  With Mark's work this has turned out to be not the
case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
argument and make __btrfs_mod_ref call it's process function with no_quota set
always.  Thanks,

Signed-off-by: Josef Bacik &lt;jbacik@fb.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
</content>
</entry>
<entry>
<title>Btrfs: fix race of using total_bytes_pinned</title>
<updated>2014-07-03T14:04:15Z</updated>
<author>
<name>Liu Bo</name>
<email>bo.li.liu@oracle.com</email>
</author>
<published>2014-07-02T08:58:01Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=d288db5dc0110c8e0732d099aaf7a05e2ea0e0c8'/>
<id>urn:sha1:d288db5dc0110c8e0732d099aaf7a05e2ea0e0c8</id>
<content type='text'>
This percpu counter @total_bytes_pinned is introduced to skip unnecessary
operations of 'commit transaction', it accounts for those space we may free
but are stuck in delayed refs.

And we zero out @space_info-&gt;total_bytes_pinned every transaction period so
we have a better idea of how much space we'll actually free up by committing
this transaction.  However, we do the 'zero out' part a little earlier, before
we actually unpin space, so we end up returning ENOSPC when we actually have
free space that's just unpinned from committing transaction.

xfstests/generic/074 complained then.

This fixes it by actually accounting the percpu pinned number when 'unpin',
and since it's protected by space_info-&gt;lock, the race is gone now.

Signed-off-by: Liu Bo &lt;bo.li.liu@oracle.com&gt;
Reviewed-by: Miao Xie &lt;miaox@cn.fujitsu.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
</content>
</entry>
<entry>
<title>Btrfs: fix broken free space cache after the system crashed</title>
<updated>2014-06-19T21:20:54Z</updated>
<author>
<name>Miao Xie</name>
<email>miaox@cn.fujitsu.com</email>
</author>
<published>2014-06-19T02:42:50Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=e570fd27f2c5d7eac3876bccf99e9838d7f911a3'/>
<id>urn:sha1:e570fd27f2c5d7eac3876bccf99e9838d7f911a3</id>
<content type='text'>
When we mounted the filesystem after the crash, we got the following
message:
  BTRFS error (device xxx): block group xxxx has wrong amount of free space
  BTRFS error (device xxx): failed to load free space cache for block group xxx

It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.

There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
  space cache
- account the size of the allocated space that is used to store the file
  data, if the size is not zero, don't write out the free space cache.

The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.

Signed-off-by: Miao Xie &lt;miaox@cn.fujitsu.com&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
</content>
</entry>
<entry>
<title>btrfs: allocate raid type kobjects dynamically</title>
<updated>2014-06-10T00:21:01Z</updated>
<author>
<name>Jeff Mahoney</name>
<email>jeffm@suse.com</email>
</author>
<published>2014-05-27T16:59:57Z</published>
<link rel='alternate' type='text/html' href='https://universe.0xinfinity.dev/distro/kernel/commit/?id=c1895442be01c58449e3bf9272f22062a670e08f'/>
<id>urn:sha1:c1895442be01c58449e3bf9272f22062a670e08f</id>
<content type='text'>
We are currently allocating space_info objects in an array when we
allocate space_info. When a user does something like:

# btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt
# btrfs balance start -mconvert=single -dconvert=single /mnt -f
# btrfs balance start -mconvert=raid1 -dconvert=raid1 /

We can end up with memory corruption since the kobject hasn't
been reinitialized properly and the name pointer was left set.

The rationale behind allocating them statically was to avoid
creating a separate kobject container that just contained the
raid type. It used the index in the array to determine the index.

Ultimately, though, this wastes more memory than it saves in all
but the most complex scenarios and introduces kobject lifetime
questions.

This patch allocates the kobjects dynamically instead. Note that
we also remove the kobject_get/put of the parent kobject since
kobject_add and kobject_del do that internally.

Signed-off-by: Jeff Mahoney &lt;jeffm@suse.com&gt;
Reported-by: David Sterba &lt;dsterba@suse.cz&gt;
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
</content>
</entry>
</feed>
