Checkpoints

From SciNet Users Documentation
Revision as of 02:51, 28 March 2022 by Rzon (talk | contribs)
Jump to navigation Jump to search

Having your program output periodic checkpoints is a very good idea; if it doesn't already, you should think about adding this feature.

What are Checkpoints?

Checkpoint files are files output by your program which contain the entire state of the program run so far, so that if the program ends for whatever reason it can be restarted from that point as if it had not stopped.

SciNet's systems do not have system-checkpoints (for a various reasons such as non-portability and difficulties with parallel programs), so you are responsible for your own checkpointing, from within your own code. It is the only way to be reliably able to restart your calculations.

Why Checkpoints?

Unlike a dedicated lab machine where you can run a job forever, on our large shared computer facilities job the wallclock time limit is 24 hours. As such, your job must end within 24 hours or it will be helpfully ended for you, by the scheduler. Although exact limits vary, 24 hours is generally found to be a good balance between turnaround (ensuring people aren't waiting in the queue for weeks without being able to run) and being able to get a significant amount of work done in a single run. If, as will inevitably be the case, your runs grow to a size that they won't finish before the queue window closes, you will have to run jobs in several steps by outputting checkpoints and restarting from them.

Checkpoints also provide a certain amount of safety in case of hardware or software failure - one can restart from an earlier checkpoint without losing much work. In addition, checkpoints can be useful if you want to run to some intermediate state, and then use that as the starting point for several different runs; then you can save yourself having to run to the intermediate state many times.

What should be in Checkpoint Files?

Checkpoint files should contain the entire state of your run, at full precision, so that your run can continue exactly where it left off. Note that this may differ from typical outputs which may not need to be in full precision or which may not need absolutely everything in the simulations memory.

Typically checkpoints are written out between iterations or steps in your job. To decide what needs to be output, ask yourself: `what data structures need to be filled for the program to take the next steps?' Then the checkpoint writing should dump out all that information, and to restart from a checkpoint, you would read in that information, and then start from that step as if you had been running uninterrupted the whole time.

How often should you checkpoint?

There's a tension here between not spending a lot of time (or diskspace) checkpointing but also not loosing much work if you need to restart from the last checkpoint. Checkpoint `several' times during the queue window, for whatever value of `several' is most suitable for your work.

Rolling checkpoints

Because checkpoints may require substantial disk space, it is often the case that users do not keep all checkpoints; only the last (say) two may be needed, and after each following one is successfully written, the earliest one is deleted.