All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Nicholas Piggin" <npiggin@gmail.com>
To: "Timothy Pearson" <tpearson@raptorengineering.com>,
	"Michael Ellerman" <mpe@ellerman.id.au>
Cc: "Jens Axboe" <axboe@kernel.dk>,
	"regressions" <regressions@lists.linux.dev>,
	"christophe leroy" <christophe.leroy@csgroup.eu>,
	"linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save
Date: Tue, 21 Nov 2023 10:18:52 +1000	[thread overview]
Message-ID: <CX424SO03Y1Q.2YCS9G1C3IAOW@wheely> (raw)
In-Reply-To: <439072392.48800901.1700498743840.JavaMail.zimbra@raptorengineeringinc.com>

On Tue Nov 21, 2023 at 2:45 AM AEST, Timothy Pearson wrote:
>
>
> ----- Original Message -----
> > From: "Michael Ellerman" <mpe@ellerman.id.au>
> > To: "Timothy Pearson" <tpearson@raptorengineering.com>, "Jens Axboe" <axboe@kernel.dk>, "regressions"
> > <regressions@lists.linux.dev>, "npiggin" <npiggin@gmail.com>, "christophe leroy" <christophe.leroy@csgroup.eu>,
> > "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> > Sent: Monday, November 20, 2023 1:10:06 AM
> > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save
>
> > Hi Timothy,
> > 
> > Great work debugging this. I think your fix is good, but I want to understand it
> > a bit more to make sure I can explain why we haven't seen it outside of
> > io-uring.
> > If this can be triggered outside of io-uring then I have even more backporting
> > in my future :}
> > 
> > Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
> > also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP regs
> > from the thread struct before letting the task use FP again. So in that case
> > save_fpu() is free to clobber fr0 because the FP regs no longer hold live values
> > for the task.
> > 
> > There is another case though, which is the path via:
> >  copy_process()
> >    dup_task_struct()
> >      arch_dup_task_struct()
> >        flush_all_to_thread()
> >          save_all()
> > 
> > That path saves the FP regs but leaves them live. That's meant as an
> > optimisation for a process that's using FP/VSX and then calls fork(), leaving
> > the regs live means the parent process doesn't have to take a fault after the
> > fork to get its FP regs back.
> > 
> > That path does clobber fr0, but fr0 is volatile across a syscall, and the only
> > way to reach copy_process() from userspace is via a syscall. So in normal usage
> > fr0 being clobbered across a syscall shouldn't cause data corruption.
> > 
> > Even if we handle a signal on the return from the fork() syscall, the worst that
> > happens is that the task's thread struct holds the clobbered fr0, but the task
> > doesn't care (because fr0 is volatile across the syscall anyway).
> > 
> > That path is something like:
> > 
> > system_call_vectored_common()
> >  system_call_exception()
> >    sys_fork()
> >      kernel_clone()
> >        copy_process()
> >          dup_task_struct()
> >            arch_dup_task_struct()
> >              flush_all_to_thread()
> >                save_all()
> >                  if (tsk->thread.regs->msr & MSR_FP)
> >                    save_fpu()
> >                    # does not clear MSR_FP from regs->msr
> >  syscall_exit_prepare()
> >    interrupt_exit_user_prepare_main()
> >      do_notify_resume()
> >        get_signal()
> >        handle_rt_signal64()
> >          prepare_setup_sigcontext()
> >            flush_fp_to_thread()
> >              if (tsk->thread.regs->msr & MSR_FP)
> >                giveup_fpu()
> >                  __giveup_fpu
> >                    save_fpu()
> >                    # clobbered fr0 is saved, but task considers it volatile
> >                    # across syscall anyway
> > 
> > 
> > But we now have a new path, because io-uring can call copy_process() via
> > create_io_thread() from the signal handling path. That's OK if the signal is
> > handled as we return from a syscall, but it's not OK if the signal is handled
> > due to some other interrupt.
> > 
> > Which is:
> > 
> > interrupt_return_srr_user()
> >  interrupt_exit_user_prepare()
> >    interrupt_exit_user_prepare_main()
> >      do_notify_resume()
> >        get_signal()
> >          task_work_run()
> >            create_worker_cb()
> >              create_io_worker()
> >                copy_process()
> >                  dup_task_struct()
> >                    arch_dup_task_struct()
> >                      flush_all_to_thread()
> >                        save_all()
> >                          if (tsk->thread.regs->msr & MSR_FP)
> >                            save_fpu()
> >                            # fr0 is clobbered and potentially live in userspace
> > 
> > 
> > So tldr I think the corruption is only an issue since io-uring started doing
> > the clone via signal, which I think matches the observed timeline of this bug
> > appearing.
>
> I agree the corruption really only started showing up in earnest on io_uring clone-via-signal, as this was confirmed several times in the course of debugging.  Bear in mind however that we have seen very, very rare crashes over several years in other tasks, and I am starting to think this bug might be the root cause (see below).
>
> Note as well that I may very well have a wrong call order in the commit message, since I was relying on a couple of WARN_ON() macros I inserted to check for a similar (but not identical) condition and didn't spend much time getting new traces after identifying the root cause.
>
> I went back and grabbed some real world system-wide stack traces, since I now know what to trigger on.  A typical example is:
>
> interrupt_return_srr_user()
>  interrupt_exit_user_prepare()
>   interrupt_exit_user_prepare_main()
>    schedule()
>     __schedule()
>      __switch_to()
>       giveup_all()
>        # tsk->thread.regs->msr MSR_FP is still set here
>        __giveup_fpu()
>         save_fpu()
>         # fr0 is clobbered and potentially live in userspace
>
> This indicates that the pending signal is not actually required, it simply makes triggering much more likely.  Both the pending signal and the _TIF_NEED_RESCHED paths give up control and end up in the same overall position of trigging down-call routines that assume the FPU state is valid.

That seems probably true, but that's on the other side of the equation,
I think? Because __giveup_fpu does clear MSR[FP] before any context
switch or return to user is possile.

*After* we have clobbered fr0 without clearing MSR_FP from the user msr,
then yes any context switch or return to user will cause uerspace to
next run with a clobbered fr0. But getting to that fr0/msr state
requires the flush_all_to_thread(), and that needs to be called in a
path where user fr0 must not be clobbered. I don't see one other than
io-uring yet.

[ KVM via kvmppc_save_user_regs() (which is basically the same as
flush_all_to_thread()) can do it. Not via the fr0 clobber in save_fpu,
because this is called via a VCPU run ioctl, but KVM will later clobber
all FP/VEC registers via running guest code. So we clobber non-volatile
regs as well. This wouldn't be causing random corruption and crashes
though, only possibly bugs in code that runs KVM guests. ]

Thanks,
Nick

>
> perl and bash seem to be affected to some degree, though current builds don't use enough VSX instructions rapidly enough to cause crashes with any significant probability.  That said, over many years of running POWER at datacenter scale I have seen enough random bash/perl crashes in the logs to recognize the pattern; I think this has been a low-grade issue for a long time, but with an infantismally small chance of happening it was seen as random noise / hardware issues / other rare bugs in various programs.


WARNING: multiple messages have this Message-ID (diff)
From: "Nicholas Piggin" <npiggin@gmail.com>
To: "Timothy Pearson" <tpearson@raptorengineering.com>,
	"Michael Ellerman" <mpe@ellerman.id.au>
Cc: Jens Axboe <axboe@kernel.dk>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	regressions <regressions@lists.linux.dev>
Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save
Date: Tue, 21 Nov 2023 10:18:52 +1000	[thread overview]
Message-ID: <CX424SO03Y1Q.2YCS9G1C3IAOW@wheely> (raw)
In-Reply-To: <439072392.48800901.1700498743840.JavaMail.zimbra@raptorengineeringinc.com>

On Tue Nov 21, 2023 at 2:45 AM AEST, Timothy Pearson wrote:
>
>
> ----- Original Message -----
> > From: "Michael Ellerman" <mpe@ellerman.id.au>
> > To: "Timothy Pearson" <tpearson@raptorengineering.com>, "Jens Axboe" <axboe@kernel.dk>, "regressions"
> > <regressions@lists.linux.dev>, "npiggin" <npiggin@gmail.com>, "christophe leroy" <christophe.leroy@csgroup.eu>,
> > "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> > Sent: Monday, November 20, 2023 1:10:06 AM
> > Subject: Re: [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register  save
>
> > Hi Timothy,
> > 
> > Great work debugging this. I think your fix is good, but I want to understand it
> > a bit more to make sure I can explain why we haven't seen it outside of
> > io-uring.
> > If this can be triggered outside of io-uring then I have even more backporting
> > in my future :}
> > 
> > Typically save_fpu() is called from __giveup_fpu() which saves the FP regs and
> > also *turns off FP* in the tasks MSR, meaning the kernel will reload the FP regs
> > from the thread struct before letting the task use FP again. So in that case
> > save_fpu() is free to clobber fr0 because the FP regs no longer hold live values
> > for the task.
> > 
> > There is another case though, which is the path via:
> >  copy_process()
> >    dup_task_struct()
> >      arch_dup_task_struct()
> >        flush_all_to_thread()
> >          save_all()
> > 
> > That path saves the FP regs but leaves them live. That's meant as an
> > optimisation for a process that's using FP/VSX and then calls fork(), leaving
> > the regs live means the parent process doesn't have to take a fault after the
> > fork to get its FP regs back.
> > 
> > That path does clobber fr0, but fr0 is volatile across a syscall, and the only
> > way to reach copy_process() from userspace is via a syscall. So in normal usage
> > fr0 being clobbered across a syscall shouldn't cause data corruption.
> > 
> > Even if we handle a signal on the return from the fork() syscall, the worst that
> > happens is that the task's thread struct holds the clobbered fr0, but the task
> > doesn't care (because fr0 is volatile across the syscall anyway).
> > 
> > That path is something like:
> > 
> > system_call_vectored_common()
> >  system_call_exception()
> >    sys_fork()
> >      kernel_clone()
> >        copy_process()
> >          dup_task_struct()
> >            arch_dup_task_struct()
> >              flush_all_to_thread()
> >                save_all()
> >                  if (tsk->thread.regs->msr & MSR_FP)
> >                    save_fpu()
> >                    # does not clear MSR_FP from regs->msr
> >  syscall_exit_prepare()
> >    interrupt_exit_user_prepare_main()
> >      do_notify_resume()
> >        get_signal()
> >        handle_rt_signal64()
> >          prepare_setup_sigcontext()
> >            flush_fp_to_thread()
> >              if (tsk->thread.regs->msr & MSR_FP)
> >                giveup_fpu()
> >                  __giveup_fpu
> >                    save_fpu()
> >                    # clobbered fr0 is saved, but task considers it volatile
> >                    # across syscall anyway
> > 
> > 
> > But we now have a new path, because io-uring can call copy_process() via
> > create_io_thread() from the signal handling path. That's OK if the signal is
> > handled as we return from a syscall, but it's not OK if the signal is handled
> > due to some other interrupt.
> > 
> > Which is:
> > 
> > interrupt_return_srr_user()
> >  interrupt_exit_user_prepare()
> >    interrupt_exit_user_prepare_main()
> >      do_notify_resume()
> >        get_signal()
> >          task_work_run()
> >            create_worker_cb()
> >              create_io_worker()
> >                copy_process()
> >                  dup_task_struct()
> >                    arch_dup_task_struct()
> >                      flush_all_to_thread()
> >                        save_all()
> >                          if (tsk->thread.regs->msr & MSR_FP)
> >                            save_fpu()
> >                            # fr0 is clobbered and potentially live in userspace
> > 
> > 
> > So tldr I think the corruption is only an issue since io-uring started doing
> > the clone via signal, which I think matches the observed timeline of this bug
> > appearing.
>
> I agree the corruption really only started showing up in earnest on io_uring clone-via-signal, as this was confirmed several times in the course of debugging.  Bear in mind however that we have seen very, very rare crashes over several years in other tasks, and I am starting to think this bug might be the root cause (see below).
>
> Note as well that I may very well have a wrong call order in the commit message, since I was relying on a couple of WARN_ON() macros I inserted to check for a similar (but not identical) condition and didn't spend much time getting new traces after identifying the root cause.
>
> I went back and grabbed some real world system-wide stack traces, since I now know what to trigger on.  A typical example is:
>
> interrupt_return_srr_user()
>  interrupt_exit_user_prepare()
>   interrupt_exit_user_prepare_main()
>    schedule()
>     __schedule()
>      __switch_to()
>       giveup_all()
>        # tsk->thread.regs->msr MSR_FP is still set here
>        __giveup_fpu()
>         save_fpu()
>         # fr0 is clobbered and potentially live in userspace
>
> This indicates that the pending signal is not actually required, it simply makes triggering much more likely.  Both the pending signal and the _TIF_NEED_RESCHED paths give up control and end up in the same overall position of trigging down-call routines that assume the FPU state is valid.

That seems probably true, but that's on the other side of the equation,
I think? Because __giveup_fpu does clear MSR[FP] before any context
switch or return to user is possile.

*After* we have clobbered fr0 without clearing MSR_FP from the user msr,
then yes any context switch or return to user will cause uerspace to
next run with a clobbered fr0. But getting to that fr0/msr state
requires the flush_all_to_thread(), and that needs to be called in a
path where user fr0 must not be clobbered. I don't see one other than
io-uring yet.

[ KVM via kvmppc_save_user_regs() (which is basically the same as
flush_all_to_thread()) can do it. Not via the fr0 clobber in save_fpu,
because this is called via a VCPU run ioctl, but KVM will later clobber
all FP/VEC registers via running guest code. So we clobber non-volatile
regs as well. This wouldn't be causing random corruption and crashes
though, only possibly bugs in code that runs KVM guests. ]

Thanks,
Nick

>
> perl and bash seem to be affected to some degree, though current builds don't use enough VSX instructions rapidly enough to cause crashes with any significant probability.  That said, over many years of running POWER at datacenter scale I have seen enough random bash/perl crashes in the logs to recognize the pattern; I think this has been a low-grade issue for a long time, but with an infantismally small chance of happening it was seen as random noise / hardware issues / other rare bugs in various programs.


  parent reply	other threads:[~2023-11-21  0:19 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-19 15:18 [PATCH v2] powerpc: Don't clobber fr0/vs0 during fp|altivec register save Timothy Pearson
2023-11-20  7:10 ` Michael Ellerman
2023-11-20 14:32   ` Nicholas Piggin
2023-11-20 16:45   ` Timothy Pearson
2023-11-20 16:45     ` Timothy Pearson
2023-11-20 23:39     ` Michael Ellerman
2023-11-20 23:39       ` Michael Ellerman
2023-11-21  0:27       ` Nicholas Piggin
2023-11-21  0:27         ` Nicholas Piggin
2023-11-21  1:23       ` Timothy Pearson
2023-11-21  1:23         ` Timothy Pearson
2023-11-21  7:56         ` Nicholas Piggin
2023-11-21  7:56           ` Nicholas Piggin
2023-11-21  4:10       ` Timothy Pearson
2023-11-21  4:10         ` Timothy Pearson
2023-11-21  4:26         ` Timothy Pearson
2023-11-21  4:26           ` Timothy Pearson
2023-11-21  7:54         ` Nicholas Piggin
2023-11-21  7:54           ` Nicholas Piggin
2023-11-22  5:01         ` Michael Ellerman
2023-11-22  5:01           ` Michael Ellerman
2023-11-24  0:01           ` Timothy Pearson
2023-11-24  0:01             ` Timothy Pearson
2023-11-27 18:39             ` Timothy Pearson
2023-11-27 18:39               ` Timothy Pearson
2023-11-27 19:58               ` Christophe Leroy
2023-11-27 19:58                 ` Christophe Leroy
2023-11-28  0:59                 ` Michael Ellerman
2023-11-28  0:59                   ` Michael Ellerman
2023-11-28  1:40                   ` Nicholas Piggin
2023-11-28  1:40                     ` Nicholas Piggin
2023-11-27 22:53               ` Michael Ellerman
2023-11-27 22:53                 ` Michael Ellerman
2023-11-28 12:57                 ` Michael Ellerman
2023-11-28 12:57                   ` Michael Ellerman
2023-11-30 16:29                   ` Timothy Pearson
2023-11-30 16:29                     ` Timothy Pearson
2023-11-21  0:18     ` Nicholas Piggin [this message]
2023-11-21  0:18       ` Nicholas Piggin
2023-12-02 23:00 ` Michael Ellerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CX424SO03Y1Q.2YCS9G1C3IAOW@wheely \
    --to=npiggin@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=christophe.leroy@csgroup.eu \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mpe@ellerman.id.au \
    --cc=regressions@lists.linux.dev \
    --cc=tpearson@raptorengineering.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.