A Buffet Tasting
The University of Queensland
3/13/23
I did (PhD) weird DNA simulations, (2019-2020) weird protein simulations, and (2021-now) weird electrode-electrolyte simulations. Spot the pattern! Find out more at my (very poorly maintained) website.
This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
Does your hard drive look like this? (Mine does.)
This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
When he investigated, Chang was horrified to discover that a homemade data analysis program had flipped two columns of data, inverting the electron-density map from which his team had derived the final protein structure. Unfortunately, his group had used the program to analyze data for other proteins.
This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
Benefits of reproducibility by The Turing Way (link), DOI: 10.5281/zenodo.7684733.
We have a habit in writing articles published in scientific journals to make the work as finished as possible, to cover all the tracks, to not worry about the blind alleys or to describe how you had the wrong idea first, and so on.
So there isn’t any place to publish, in a dignified manner, what you actually did in order to get to do the work, although, there has been in these days, some interest in this kind of thing.
What barriers to good data management and reproducibility have you encountered?
Navigate to https://hackmd.io/@srtee/ctcms-2023-repro/edit
…
There are many paths toward reproducible research, and you shouldn’t try to change all aspects of your current practices all at once. Identify one weakness, adopt an improved approach, refine that a bit, and then move on to the next thing.
This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
Buffet image from Unsplash (link)
This talk was inspired by a TTW Workshop on “Reproducible, Open and FAIR Research” (Karoune, Zormpa, and Lee Steele 2023) and Dan Katz’s “Research Reproducibility” talk (Katz 2023).
Scheffler et al. (2022)
This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
Capture what you do so you can repeat it in one step – then capture multiple steps in one!
Keep secure data copies in multiple locations
Track changes in what you do and automate with version control
Organize and name things well
Capture your environment, licence your data, and make your work citeable
Where you are will influence where you can start!
Where you are will influence where you can start!
Webcomic by The Upturned Microscope (link)
Example
Work desktop + external HDD + RDM or GitHub (small files only!)
Make a regular plan! Helpful software: Rsync, Nextcloud, Syncthing
Synergizes with version control.
Webcomic by The Upturned Microscope (link)
This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
NAMES.md
in your project folder.(Briney 2020), CC-BY 4.0
NAMES.md
_Files_: solvent trajectory DCD files
_Metadata_: potential difference, non-equilibrium force, trajectory date
_Encoding_: "V00" = 0.0 volts (2-pad),
"F040" = 40 LAMMPS force units (3-pad), ISO8601 dates (YYYYMMDD)
_Order_ and _separator_: volts, force, date, separated with _
_Versions_: maybe during postprocessing I will combine
dated partial trajectories into a "full" trajectory?
_Final convention_:
"V"vv_"F"fff_YYYYMMDD.dcd *or* "V"vv_"F"fff_full.dcd
_Examples_:
V00_F040_20230415.dcd
(0.0 V, force 40 LAMMPS units, date 2023-04-15)
V25_F100_full.dcd
(2.5 V, force 100 LAMMPS units, combined from over multiple dates)
(Briney 2020), CC-BY 4.0
inspired by (Bryan 2015)
NO:
myabstract.docx
Pictures of Space.jpg
figure 1.png
MwpktrimportantfinalFINAL.pdf
YES:
2021-02-16_ctcms-abstract.docx
better-pictures-of-space.jpg
fig01_scatterplot-coffee-vs-paper-length.png
1986-01-28_challenger-o-rings_raw-data.txt
fig01_intro-pic.png
, fig02_charge-vs-v.png
ISO dates, done right:
Sorted order:
2023-03-11_traj.dcd
2023-03-14_traj.dcd
2023-04-01_traj.dcd
2023-04-11_traj.dcd
Whoops!
Sorted order:
Apr11_traj.dcd
Apr1_traj.dcd
Mar11_traj.dcd
Mar14_traj.dcd
...
Sorted order:
traj_1-4.dcd
traj_11-3.dcd
traj_11-4.dcd
traj_14-3.dcd
...
.
Padding numbers, done right:
Sorted order:
01_read-trajectories.py
02_calc-msd.py
...
10_final-figs.py
Whoops!
Sorted order:
10_final-figs.py
...
1_read-trajectories.py
2_calc-msd.py
...
The “GIN-Tonic” research folder structure standard (Colomb et al. 2021). See YODA or Cookiecutter as alternatives.
This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
Goal: Capture a procedure and make it repeatable in a simple, short step
In future, you might not have to choose …
(P): Python package
Activity | Graphical | Text-based |
---|---|---|
Documents | MS Word | LaTeX, Markdown, Quarto |
Tables | MS Excel | Awk, (P) NumPy, (P) Pandas |
Graphs | MS Excel | (P) Matplotlib, Gnuplot, RStudio |
Presentations | MS PowerPoint | LaTeX Beamer, Quarto |
Integrated Development Environments (IDEs), like Spyder and VSCode, give you the best of all worlds: text-based inputs and graphical outputs
and good practice for your decaying mouse-pointer skills
Plotting with Python in the Spyder IDE. Python lets me generate multiple plots in a single for loop; you can (just about) see their consistent styling (far right vertical region).
Editing this presentation with Quarto in the VSCode IDE. No more worrying about getting that picture exactly centered, or clicking through multiple directories to find it!
After automating individual steps, you can fit them together into an automated workflow!
This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
Tools to try: (Python) Snakemake, FireWorks, Signac, make
(TTW tutorial)
Signac example:
project.py
lmp="$HOME/.local/bin/lmp_mpi"
@Project.pre.isfile('start.data')
@Project.post.isfile('restart.file.1')
@Project.operation(cmd=True)
def first_run(job):
return f'{lmp} -in lammps.input'
' -var if_restart 0 ... '
@Project.pre.isfile('restart.file.1')
@Project.post.isfile('final.data')
@Project.operation(cmd=True)
def restart_run(job):
return f'{lmp} -in lammps.input'
' -var if_restart 1 ... '
Learn to write readable code, and even better, code that is easily reusable in future!
This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
Change things quickly knowing you can always retrieve past versions! This is also an important pre-requisite for continuous integration.
This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807
For files in your working directory that you care about, you add them to Git’s staging area. To save a version of your staging area, you commit it to your local repository, which you can push to a remote repository (like GitHub) for backup and sharing. You can clone a remote repository to your local repository and pull updated changes from the remote; you can then merge changes into your working directory, checkout other commits to try out other versions, and revert or reset to try fixing trickier issues.
Image from Cosima Meyer’s blog post, which is also a great short guide.
Sharing your computational environment:
python -m pip freeze > requirements.txt
or conda env export > environment.yml
nbdev
.Hosting your data online:
Licensing your work: see last section of Dan Katz’s slides (Katz 2023).
What one thing would you like to practice in the coming month?
Navigate to https://hackmd.io/@srtee/ctcms-2023-repro/edit
…
DOI: 10.5281/zenodo.7725483