This blog came about because an old friend is trying to figure out how Java can do aggressive loop and inlining optimizations, while allowing the loading of new code and setting breakpoints⦠in that optimized code.
On 2/21/2015 11:04 AM, IG wrote:
Safepoint. Iām still confused because I donāt understand what is the required state at a safepoint that canāt be ensured at any random point of execution. What is in-flight that must be permitted to complete? Given a cycle-by-cycle map you can find all the pointers at any cycle ā is safe pointing just a way to reduce the volume of pointer-locating info?
Reducing OOP-map volume happens, is useful ā but is not the main thing being
achieved.
Itās flipping backwards from the machine state to the language architected
state, unwinding from super-aggressive optimization to reach the theoretical
Java (or C/C++) language state at some points. Toooo expensive to preserve
that mapping line-by-line, call-by-call, execution-point-by-execution-point.
So you keep the mapping at a few places, and optimize the bejeebers out of
things in-between. If somebody wants to:
then you need to stop the physical machine (actual CPU/Mill) at some place
where you can find all the language architected pieces. This mapping is far
more complex than the oop/no-oop mapping for GC⦠but isnāt why you make it
rare. You make it rare because you have to hang onto all the language state at
these safepoints, but donāt need it between safepoints.
An example might help:
for( int i=0; i<N; i++ )
sum += A[i] // set a breakpoint here, for āi>1e6ā or maybe āi>1e7ā
So intent is to run awhile⦠then somebody sets a breakpoint⦠and wants to
inspect āsumā, or change āiā in the debugger, then continue execution. Whatās
the machine code look like, with no safepoints? Something like this (bogus
X86-like assembly code, pardon non-perfection):
⦠some setup, zero-trip test, range-check on āNā vs āA.lengthā, etc
ld RA=&A // Classic strength-reduced pointer into array A
add RC,RA,#N*8 // Stopping value of A
ld RB=0 // sum
loop:
ld RD=[RA+off] // Load from A
add RB=RB,RD // Sum into RB
add RA=RA,8 // Advance our pointer into A
cmp RA,RC // Stop when at the end
blt loop
// RB contains sum
But actually, with some loop unrolling:
⦠some setup, zero-trip test, range-check on āNā vs āA.lengthā, etc
ld RA=&A // Classic strength-reduced pointer into array A
add RC,RA,#N*8 & (-1-3) // mask of low 2 bits of iteration - unroll by 4
ld RB=0 // sum
loop:
prefetch [R1+off+32] // Fetch next cache line⦠overlap memory & compute
ld R0=[RA+off+ 0] // Get 4 values array A
ld R1=[RA+off+ 8]
ld R2=[RA+off+16]
ld R3=[RA+off+24]
add R4=R0,R1 // add-tree to allow more parallel adders
add R5=R2,R3 // add-tree to allow more parallel adders
add RB=RB,R4 // partial sum into RB
add RB=RB,R5 // full sum into RB
add RA=RA,8*4 // Skip over 4 elements of array A
cmp RA,RC // Stop when at the end
blt loop
⦠cleanup code for 0-3 more iterations
// RB contains sum
Now I want to break on the āsum += A[i]ā line⦠where is that, exactly, in
that unrolled loop?
And which register contains āiā? (Hint: none of them).
Do I even want to keep a shadow copy of āiā in parallel with my āAā pointer Iām
rolling through the loop? Itās pure overhead, unless I breakpoint and want to
have some place to read āiāā¦. but even then I cannot changeĀ this shadow āiā
because it wonāt change my āAā pointer.
What about changing N? Do I have enough info in the register mapping to
describe all possible kinds of updates I might want to do to register RC when I
tweak N? This example is simple, but optimizing compilers can do hella complex
code changes.
How does a Safepoint help here? Well, for starters it points out that I do
indeed need some way to find āiā, and if I change it then propogate those
changes back into the loop. Lets look at this code with a Safepoint in it.
⦠some setup, zero-trip test, range-check on āNā vs āA.lengthā, etc
ld RA=&A // Classic strength-reduced pointer into array A
add RC,RA,#N*8 & (-1-3) // mask of low 2 bits of iteration - unroll by 4
ld RB=0 // sum
ld RI=0 // Concession to Safepoint: keep āiā around
loop:
prefetch [R1+off+32] // Fetch next cache line⦠overlap memory & compute
ld R0=[RA+off+ 0] // Get 4 values array A
ld R1=[RA+off+ 8]
ld R2=[RA+off+16]
ld R3=[RA+off+24]
add R4=R0,R1 // add-tree to allow more parallel adders
add R5=R2,R3 // add-tree to allow more parallel adders
add RB=RB,R4 // partial sum into RB
add RB=RB,R5 // full sum into RB
add RA=RA,8*4 // Skip over 4 elements of array A
add RI=RI,4 // Consession to Safepoint: keep āiā around
SAFEPOINT: RI contains āiā, RB contains āsumā, āsome place in the stackā contains āNā
cmp RA,RC // Stop when at the end
blt loop
⦠cleanup code for 0-3 more iterations
// RB contains sum
So I added 1 op in my hot loop to compute āiā on the fly. Not too expensive; 1
clock per cache-line of data. In general we end up keeping alive what is
otherwise programatically dead ājust in caseā the user stops the running code
and wants to inspect (dead) variables ā but this is cheap. Just spill āem off
to stack somewhere and flag āem in the Safepoints variable map.
How do we trigger the Safepoint? Easiest way is to have the safepoint code
test some thread-local bit for a āstopā bit, and if found⦠stop. If we have
to stop we might have a lot of cleanup to do, but stopping is rare so the cost
(of mostly not stopping) is low. Thereās lots of ways to make the safepoint
check cheap. Hereās one that depends on keeping a 2Meg (TLB-sized) window
above the stack pointer mapped in or out ā perfect for an X86 TLB:
ld RX,[RSP+2M] // will trap if we map this page out, then catch the SEGV & handle
How do we use the Safepoint? Once weāve stopped the running thread mid-loop,
we expect the trap handler to have saved off all the registers. Later we can
inspect the register map to find āiā or āsumā. Note that a Java-level query
cannot ask for our āRAā register ā as itās the sum of ā&RAā and āi*8ā and isnāt
visible at the language level ā so āRAā does not appear in the Safepointās map.
How do we change āiā or āNā and have it All Just Work? Basically, we re-enter
a clone of the loop from aboveā¦. a clone specially made to be re-entered at
any iteration. The loop setup code in the clone will rebuild āRAā and āRBā and
any other intermediate state as needed, before falling into essentially the
same code as the normal loop (and indeed they might be shared here!). Or in
other words ā since the optimizer did something complex to build āRAā from ā&Aā
and āiā ā have the optimized do it again for any arbitrary new āiā.
The general plan is simple (from the optimizing compilerās point-of-view):
Safepoints are treated like a function-call only taken on a very very rare
execution path. Safepoints take as inputs all visible language-level state,
including the state of all inlined function calls, and generally are assumed to
update all memory state but not any local vars (if you are also hacking local
variable state via the debugger interface, then we need to go a step further
and deopt/reopt ā leading to the loop clone mentioned above).
With this magical ālow frequency function callā in hand, the optimizer attempts
the best optimizations it can, subject to the normal constraints that the call
arguments are available somehow.
Hope this helps, Cliff