Today marks the release of Bioconductor 3.7 (official announcement). Congratulations to the hundreds of developers who have collectively developed more than 1500 software packages, not to mention the annotation, experiment data, and workflow packages.
A huge thank you to the core team for the incredible work they do, especially around release time!
Much of my recent development work on Bioconductor has been leveraging and extending the DelayedArray framework developed by Hervé Pagès (Bioconductor Core Team). This framework allows array-like data to be stored on-disk rather than in-memory, a huge saving when analysing genome-wide data from tens to hundreds and even thousands of samples.
In my work on bsseq and
minfi, I have been
responsible for re-engineering the internals of these packages to support the
analysis of the ever-larger datasets generated in studies of DNA methylation.
This work spurred the development of
DelayedMatrixStats,
a port of the matrixStats
API to work within the DelayedArray framework.
The DelayedArray framework isn’t the first to provide on-disk access to array-like data. From my perspective, the killer feature is that it provides a common array-like API for handling array data no matter what backend is used to store the data; the data may be stored in-memory as a dense or sparse matrix or on-disk in a HDF5 file. As an analogy, you might think, ‘a DelayedArray is to an array as a tibble is to a data.frame’.
Over the next few weeks I’m going to write several posts about the DelayedArray framework and how Bioconductor package developers1 can leverage it to support on-disk datasets. I’m very interested to hear from developers and users to learn what questions they would like to see covered in these posts, so feel free to ping me on Twitter (@PeteHaitch).
Some of the topics I plan to cover include:
- An overview of the DelayedArray framework
- How do I adapt an existing function to support DelayedArray input?
I’ll describe the process of transitioning an existing package to support
the DelayedArray framework, illustrated using the bsseq and minfi
packages. I’ll highlight how the DelayedMatrixStats package can ease this
transition, as well as point out some ‘gotchas’ and tips-n-tricks I learnt
along the way. I’ll also touch on how developers can work with DelayedArray
objects within C++ using the beachmat framework.
- Non-Bioconductor packages can of course use the DelayedArray framework. However, it’s development as a Bioconductor package can be a source of friction for non-Bioconductor packages. ^