note on "PRES" paper

2018-08-10

What problem does the paper address?
- reproduce concurrency bug in production on multi-processsors (multicore), with low overhead in a few coordinated attempts.
How is it different from previous work, if any?
1. Traditional : deterministic replay - the non-determinism (1) input, eg. keyboard (2) system calls, eg. gettime() can be recorded for re-execution. (3) Timing variation from hardware, can be overcomed by logic time, eg. number of instructions executed. But (3) causes problems in multicore.

The most problematic source of non-determinism comes from timing variations caused by hardware, such as cache state, memory refresh-scrubbing, and timing variations on buses and devices.

1. Timing variations exert an important influence on how threads concurrently interleaves each other, besides thread scheduling. When threads are on multicore, they interact with each other with sync or shared memory. This is different from uni-core.
2. This can be solved by record all inter-thread interactions. (1) hardware assist to reduce overhead, but these hw do not exist. (2) software approach. 10x - 100x production run overhead. Something like I can save you but you have to stay in hospoital all your life.
  
  Fig 2.very clear object, really helpful for people to understand difference
What is the approach used to solve the problem?

Fig 3. PRES approach overview
- Not reproduce bug in one replay, which cause high overhead in production.
- Just reproduce bug, not execution paths - how to make sure the bug is caused by the same execution path.
- Sketches, are just record pointers, different sketch mechanism are just adding different event pointers.
- It is confusing how a production run failure is reproduced the same. How to make sure they crush for the very same reason. Also for incorrect result, how to make sure it is recorded in the production run, since it’s just sketches not all info?
How does the paper support or otherwise justify its arguments and conclusions?
- 11 opensource applications, 3 categories, but only 13 bugs? Are these bugs representitive? See the characteristic study.
- very thorough experiment
What do you like / dislike this paper?
- likes: writing very clear, problem important, solution interesting
- dislike: I still doubt whether it will miss some bugs. The experiment has only 13 bugs. I understand it is hard to collect complicate concurrent bugs and even reproduce them. but …
Was the paper, in your judgement, successful in addressing the problem?
- yes

西窗月

note on "PRES" paper

About

Recents

Tag Cloud

Tags

Archives