Fixing a reading light

One of our cats (Haru) chews small wires, and recently chewed through the USB Type A to barrel connector cord for my LED reading light. No biggie, I thought, I'll just buy a replacement for it, and move on. But wait, an entirely new reading light is just as cheap! I'll buy that so I'll have two, and I won't have to worry about buying the wrong connector.

When the new reading light arrived, I found out it now used a micro USB connector instead of the barrel, and more importantly, wouldn't run off of the battery. Some disassembly, and the reason became pretty obvious. The battery was slightly bulgy, had almost no resistance, and had zero voltage. All signs of a battery which shorted out at some point very early in its lifetime.

Luckily, I had a working battery from the older light... so why not swap? Some dodgy soldering work later, and voila! One working light with more universal connectors and some extra parts. My rudimentary soldering and electronic troubleshooting skills keep coming in handy.

Core Transcriptome of Mammalian Placentas

Our paper which describes the components of the placenta transcriptome which are conserved among all placental mammals in Placenta just came out today. More importantly than the results and the text of the paper, though, is the fact that all of the code and results of this paper, from the very first work I did two years ago to its publication today is present in git, and (in theory) reproducible.

You can see where our paper was rejected from Genome Biology and Genes and development and radically refocused before submission to Placenta. But more importantly, you can know where every single result which is mentioned in the paper came from, the precise code to generate it, and how we came to the final paper which was published. [And you've also got all of the hooks to branch off from our analysis to do your own analysis based on our data!]

This is what open, reproducible science should look like.

Shrinking lists of gene names in R

I've been trying to finish a paper where I compare gene expression in 14 different placentas. One of the supplemental figures compares median expression in gene trees across all 14 species, but because tree ids like ENSGT00840000129673 aren't very expressive, and names like "COL11A2, COL5A3, COL4A1, COL1A1, COL2A1, COL1A2, COL4A6, COL4A5, COL7A1, COL27A1, COL11A1, COL4A4, COL4A3, COL3A1, COL4A2, COL5A2, COL5A1, COL24A1" take up too much space, I wanted a function which could collapse the gene names into something which uses bash glob syntax to more succinctly list the gene names, like: COL{11A{1,2},1A{1,2},24A1,27A1,2A1,3A1,4A{1,2,3,4,5,6},5A{1,2,3},7A1}.

Thus, a crazy function which uses lcprefix from Biostrings and some looping was born:

collapse.gene.names <- function(x,min.collapse=2) {
    ## longest common substring
    if (is.null(x) || length(x)==0) {
    x <- sort(unique(x))
    str_collapse <- function(y,len) {
        if (len == 1 || length(y) < 2) {
        y.tree <-
        y.rem <-
        y.rem.prefix <-
            sum(combn(y.rem,2,function(x){Biostrings::lcprefix(x[1],x[2])}) >= 2)
        if (length(y.rem) > 3 &&
            y.rem.prefix >= 2
            ) {
            y.rem <- 
    i <- 1
    ret <- NULL
    while (i <= length(x)) {
        col.pmin <-
        collapseable <-
            which(col.pmin > min.collapse)
        if (length(collapseable) == 0) {
            ret <- c(ret,x[i])
            i <- i+1
        } else {
            ret <- c(ret,
            i <- max(collapseable)+1
H3ABioNet Hackathon (Workflows)

I'm in Pretoria, South Africa at the H3ABioNet hackathon which is developing workflows for Illumina chip genotyping, imputation, 16S rRNA sequencing, and population structure/association testing. Currently, I'm working with the imputation stream and we're using Nextflow to deploy an IMPUTE-based imputation workflow with Docker and NCSA's openstack-based cloud (Nebula) underneath.

The OpenStack command line clients (nova and cinder) seem to be pretty usable to automate bringing up a fleet of VMs and the cloud-init package which is present in the images makes configuring the images pretty simple.

Now if I just knew of a better shared object store which was supported by Nextflow in OpenStack besides mounting an NFS share, things would be better.

You can follow our progress in our git repo: []

Bioinformatic Supercomputer Wishlist

Many bioinformatic problems require large amounts of memory and processor time to complete. For example, running WGCNA across 10⁶ CpG sites requires 10⁶ choose 2 or 10¹³ comparisons, which needs 10 TB to store the resulting matrix. While embarrassingly parallel, the dataset upon which the regressions are calculated is very large, and cannot fit into main memory of most existing supercomputers, which are often tuned for small-data fast-interconnect problems.

Another problem which I am interested in is computing ancestral trees from whole human genomes. This involves running maximum likelihood calculations across 10⁹ bases and thousands of samples. The matrix itself could potentially take 1 TB, and calculating the likelihood across that many positions is computationally expensive. Furthermore, an exhaustive search of trees for 2000 individuals requires 2000!! comparisons, or 10²⁸⁶⁸; even searching a small fraction of that subspace requires lots of computational time.

Some things that a future supercomputer could have that would enable better solutions to bioinformatic problems include:

  1. Fast local storage
  2. Better hierarchical storage with smarter caching. Data should ideally move easily between local memory, shared memory, local storage, and remote storage.
  3. Fault-tolerant, storage affinity aware schedulers.
  4. GPUs and/or other coprocessors with larger memory and faster memory interconnects.
  5. Larger memory (at least on some nodes)
  6. Support for docker (or similar) images.
  7. Better bioinformatics software which can actually take advantage of advances in computer architecture.
Essential Data Science: Git

Having a new student join me to work in the lab reminded me that I should collect some of the many resources around for getting started in bioinformatics and any data-based science in general. So towards this end, one of the first essential tools for any data scientist is a knowledge of git.

Start first with Code School's simple introduction to git which gives you the basics of using git from the command line.

Then, check out set of lectures on Git and GitHub which goes into setting up git and using it with github. This is a set of lectures which was used in a Data Science course.

Finally, I'd check out the set of resources on github for even more information, and then learn to love the git manpages.

Introducing dqsub

I've been using qsub for a while now on the cluster here at the IGB at UofI. qsub is a command line program which is used to submit jobs to a scheduler to eventually be run on one (or more) nodes of a cluster.

Unfortunately, qsub's interface is horrible. It requires that you write a shell script for every single little thing you run, and doesn't do simple things like providing defaults or running multiple jobs at once with slightly different arguments. I've dealt with this for a while using some rudimentary shell scripting, but I finally had enough.

So instead, I wrote a wrapper around qsub called dqsub.

What used to require a complicated invocation like:

echo -e '#!/bin/bash\nmake foo'| \
 qsub -q default -S /bin/bash -d $(pwd) \
  -l mem=8G,nodes=1:ppn=4 -;

can now be run with

dqsub --mem 8G --ppn 4 make foo;

Want to run some command in every single directory which starts with SRX? That's easy:

ls -1 SRX*|dqsub --mem 8G --ppn 4 --array chdir make bar;

Want instead to behave like xargs but do the same thing?

ls -1 SRX*|dqsub --mem 8G --ppn 4 --array xargs make bar -C;

Now, this wrapper isn't complete yet, but it's already more than enough to do what I require, and has saved me quite a bit of time already.

You can steal dqsub for yourself

Feel free to request specific features, too.

Adding a Table of Contents to PDFs from R

I routinely generate very large PDFs from R which have hundreds (or thousands) of pages, and navigating these pages can be very difficult. Unfortunately, neither R's pdf() nor its cairopdf() drivers support creating Table of Contents (or Index) while plots are being written out. In the case of cairo, the underlying library doesn't support it either, so this isn't something that can easily be added to R directly. I had been thinking about sitting down for months and writing the support into cairo and R's cairo package... but real life kept getting in the way.

Fast forward to a week ago, when I realized that pdftk does support dumping the table of contents and updating the table of contents using dump_data_utf8 and update_info_utf8! Armed with that knowledge, and a bit of hackery, we can save an index, and then update the pdf once it's been closed.

The R code then looks like the following:

 ..device.set.up <- FALSE <<- 0

 save.bookmark <- function(text,bookmarks=list(),level=1,page=NULL) {
     if (!..device.set.up) {
         Cairo.onSave(device = dev.cur(),
                 <<- page
         ..device.set.up <<- TRUE
     if (missing(page)|| is.null(page)) {
         page <-
     bookmarks[[length(bookmarks)+1]] <-

 write.bookmarks <- function(pdf.file,bookmarks=list()) {
     pdf.bookmarks <- ""
     for (bookmark in 1:length(bookmarks)) {
         pdf.bookmarks <-
                    "BookmarkTitle: ",bookmarks[[bookmark]]$text,"\n",
                    "BookmarkLevel: ",bookmarks[[bookmark]]$level,"\n",
                    "BookmarkPageNumber: ",bookmarks[[bookmark]]$page,"\n")
     temp.pdf <- tempfile(pattern=basename(pdf.file)) <- tempfile(pattern=paste0(basename(pdf.file),"info_utf8"))
     if (file.exists(temp.pdf)) {
     } else {
         warning("unable to properly create bookmarks")

and can be used like so:

 bookmarks <- list()
 bookmarks <- save.bookmark("First plot",bookmarks)
 bookmarks <- save.bookmark("Second plot",bookmarks)

et voila. Bookmarks and a table of contents for PDFs.

This basic methodology can be extended to any language which writes PDFs and does not have a built-in method for generating a Table of Contents. Currently, the usage of Cairo.onSave is a horrible hack, and may conflict with anything else which uses the onSave hook, but hopefully R will report the current page number from Cairo in the future.

Adding a newcomer (⎈) tag to the BTS

Some of you may already be aware of the gift tag which has been used for a while to indicate bugs which are suitable for new contributors to use as an entry point to working on specific packages. Unfortunately, some of us (including me!) were unaware that this tag even existed.

Luckily, Lucas Nussbaum clued me in to the existence of this tag, and after a brief bike-shed-naming thread, and some voting using pocket_devotee we decided to name the new tag newcomer, and I have now added this tag to the BTS documentation, and tagged all of the bugs which were user tagged "gift" with this tag.

If you have bugs in your package which you think are ideal for new contributors to Debian (or your package) to fix, please tag them newcomer. If you're getting started in Debian, and working on bugs to fix, please search for the newcomer tag, grab the helm, and contribute to Debian.

Virginia King selected for Debbugs FOSS Outreach Program for Women

I'm glad to announce that Virginia King has been selected as one of the three interns for this round of the FOSS Outreach Program for women. Starting December 9th, and continuing until March 9th, she'll be working on improving the documentation of Debian's bug tracking system.

The initial goal is to develop a Bug Triager Howto to help new contributors to Debian jump in and help existing teams triage bugs. We'll be getting in touch with some of the larger teams in Debian to help make this document as useful as possible. If you're a member of a team in Debian who would like this howto to address your specific workflow, please drop me an e-mail, and we'll keep you in the loop.

The secondary goals for this project are to:

  • Improve documentation under
  • Document of bug-tags and categories
  • Improve upstream debbugs documentation

This blog is powered by ikiwiki.