西窗月

note on "PRES" paper

  • What problem does the paper address?

    • reproduce concurrency bug in production on multi-processsors (multicore), with low overhead in a few coordinated attempts.
  • How is it different from previous work, if any?

    1. Traditional : deterministic replay - the non-determinism (1) input, eg. keyboard (2) system calls, eg. gettime() can be recorded for re-execution. (3) Timing variation from hardware, can be overcomed by logic time, eg. number of instructions executed. But (3) causes problems in multicore.

The most problematic source of non-determinism comes from timing variations caused by hardware, such as cache state, memory refresh-scrubbing, and timing variations on buses and devices.

    1. Timing variations exert an important influence on how threads concurrently interleaves each other, besides thread scheduling. When threads are on multicore, they interact with each other with sync or shared memory. This is different from uni-core.
    2. This can be solved by record all inter-thread interactions. (1) hardware assist to reduce overhead, but these hw do not exist. (2) software approach. 10x - 100x production run overhead. Something like I can save you but you have to stay in hospoital all your life.
      This picture is really helpful for people to understand!
      Fig 2.very clear object, really helpful for people to understand difference
  • What is the approach used to solve the problem?

    Fig 3. PRES approach overview

    • Not reproduce bug in one replay, which cause high overhead in production.
    • Just reproduce bug, not execution paths - how to make sure the bug is caused by the same execution path.
    • Sketches, are just record pointers, different sketch mechanism are just adding different event pointers.
    • It is confusing how a production run failure is reproduced the same. How to make sure they crush for the very same reason. Also for incorrect result, how to make sure it is recorded in the production run, since it’s just sketches not all info?
  • How does the paper support or otherwise justify its arguments and conclusions?

    • 11 opensource applications, 3 categories, but only 13 bugs? Are these bugs representitive? See the characteristic study.
    • very thorough experiment
  • What do you like / dislike this paper?

    • likes: writing very clear, problem important, solution interesting
    • dislike: I still doubt whether it will miss some bugs. The experiment has only 13 bugs. I understand it is hard to collect complicate concurrent bugs and even reproduce them. but …
  • Was the paper, in your judgement, successful in addressing the problem?

    • yes