Shrinking lists of gene names in R

I've been trying to finish a paper where I compare gene expression in 14 different placentas. One of the supplemental figures compares median expression in gene trees across all 14 species, but because tree ids like ENSGT00840000129673 aren't very expressive, and names like "COL11A2, COL5A3, COL4A1, COL1A1, COL2A1, COL1A2, COL4A6, COL4A5, COL7A1, COL27A1, COL11A1, COL4A4, COL4A3, COL3A1, COL4A2, COL5A2, COL5A1, COL24A1" take up too much space, I wanted a function which could collapse the gene names into something which uses bash glob syntax to more succinctly list the gene names, like: COL{11A{1,2},1A{1,2},24A1,27A1,2A1,3A1,4A{1,2,3,4,5,6},5A{1,2,3},7A1}.

Thus, a crazy function which uses lcprefix from Biostrings and some looping was born:

collapse.gene.names <- function(x,min.collapse=2) {
    ## longest common substring
    if (is.null(x) || length(x)==0) {
    x <- sort(unique(x))
    str_collapse <- function(y,len) {
        if (len == 1 || length(y) < 2) {
        y.tree <-
        y.rem <-
        y.rem.prefix <-
            sum(combn(y.rem,2,function(x){Biostrings::lcprefix(x[1],x[2])}) >= 2)
        if (length(y.rem) > 3 &&
            y.rem.prefix >= 2
            ) {
            y.rem <- 
    i <- 1
    ret <- NULL
    while (i <= length(x)) {
        col.pmin <-
        collapseable <-
            which(col.pmin > min.collapse)
        if (length(collapseable) == 0) {
            ret <- c(ret,x[i])
            i <- i+1
        } else {
            ret <- c(ret,
            i <- max(collapseable)+1
H3ABioNet Hackathon (Workflows)

I'm in Pretoria, South Africa at the H3ABioNet hackathon which is developing workflows for Illumina chip genotyping, imputation, 16S rRNA sequencing, and population structure/association testing. Currently, I'm working with the imputation stream and we're using Nextflow to deploy an IMPUTE-based imputation workflow with Docker and NCSA's openstack-based cloud (Nebula) underneath.

The OpenStack command line clients (nova and cinder) seem to be pretty usable to automate bringing up a fleet of VMs and the cloud-init package which is present in the images makes configuring the images pretty simple.

Now if I just knew of a better shared object store which was supported by Nextflow in OpenStack besides mounting an NFS share, things would be better.

You can follow our progress in our git repo: []

Bioinformatic Supercomputer Wishlist

Many bioinformatic problems require large amounts of memory and processor time to complete. For example, running WGCNA across 10⁶ CpG sites requires 10⁶ choose 2 or 10¹³ comparisons, which needs 10 TB to store the resulting matrix. While embarrassingly parallel, the dataset upon which the regressions are calculated is very large, and cannot fit into main memory of most existing supercomputers, which are often tuned for small-data fast-interconnect problems.

Another problem which I am interested in is computing ancestral trees from whole human genomes. This involves running maximum likelihood calculations across 10⁹ bases and thousands of samples. The matrix itself could potentially take 1 TB, and calculating the likelihood across that many positions is computationally expensive. Furthermore, an exhaustive search of trees for 2000 individuals requires 2000!! comparisons, or 10²⁸⁶⁸; even searching a small fraction of that subspace requires lots of computational time.

Some things that a future supercomputer could have that would enable better solutions to bioinformatic problems include:

  1. Fast local storage
  2. Better hierarchical storage with smarter caching. Data should ideally move easily between local memory, shared memory, local storage, and remote storage.
  3. Fault-tolerant, storage affinity aware schedulers.
  4. GPUs and/or other coprocessors with larger memory and faster memory interconnects.
  5. Larger memory (at least on some nodes)
  6. Support for docker (or similar) images.
  7. Better bioinformatics software which can actually take advantage of advances in computer architecture.
Essential Data Science: Git

Having a new student join me to work in the lab reminded me that I should collect some of the many resources around for getting started in bioinformatics and any data-based science in general. So towards this end, one of the first essential tools for any data scientist is a knowledge of git.

Start first with Code School's simple introduction to git which gives you the basics of using git from the command line.

Then, check out set of lectures on Git and GitHub which goes into setting up git and using it with github. This is a set of lectures which was used in a Data Science course.

Finally, I'd check out the set of resources on github for even more information, and then learn to love the git manpages.

Introducing dqsub

I've been using qsub for a while now on the cluster here at the IGB at UofI. qsub is a command line program which is used to submit jobs to a scheduler to eventually be run on one (or more) nodes of a cluster.

Unfortunately, qsub's interface is horrible. It requires that you write a shell script for every single little thing you run, and doesn't do simple things like providing defaults or running multiple jobs at once with slightly different arguments. I've dealt with this for a while using some rudimentary shell scripting, but I finally had enough.

So instead, I wrote a wrapper around qsub called dqsub.

What used to require a complicated invocation like:

echo -e '#!/bin/bash\nmake foo'| \
 qsub -q default -S /bin/bash -d $(pwd) \
  -l mem=8G,nodes=1:ppn=4 -;

can now be run with

dqsub --mem 8G --ppn 4 make foo;

Want to run some command in every single directory which starts with SRX? That's easy:

ls -1 SRX*|dqsub --mem 8G --ppn 4 --array chdir make bar;

Want instead to behave like xargs but do the same thing?

ls -1 SRX*|dqsub --mem 8G --ppn 4 --array xargs make bar -C;

Now, this wrapper isn't complete yet, but it's already more than enough to do what I require, and has saved me quite a bit of time already.

You can steal dqsub for yourself

Feel free to request specific features, too.

Adding a Table of Contents to PDFs from R

I routinely generate very large PDFs from R which have hundreds (or thousands) of pages, and navigating these pages can be very difficult. Unfortunately, neither R's pdf() nor its cairopdf() drivers support creating Table of Contents (or Index) while plots are being written out. In the case of cairo, the underlying library doesn't support it either, so this isn't something that can easily be added to R directly. I had been thinking about sitting down for months and writing the support into cairo and R's cairo package... but real life kept getting in the way.

Fast forward to a week ago, when I realized that pdftk does support dumping the table of contents and updating the table of contents using dump_data_utf8 and update_info_utf8! Armed with that knowledge, and a bit of hackery, we can save an index, and then update the pdf once it's been closed.

The R code then looks like the following:

 ..device.set.up <- FALSE <<- 0

 save.bookmark <- function(text,bookmarks=list(),level=1,page=NULL) {
     if (!..device.set.up) {
         Cairo.onSave(device = dev.cur(),
                 <<- page
         ..device.set.up <<- TRUE
     if (missing(page)|| is.null(page)) {
         page <-
     bookmarks[[length(bookmarks)+1]] <-

 write.bookmarks <- function(pdf.file,bookmarks=list()) {
     pdf.bookmarks <- ""
     for (bookmark in 1:length(bookmarks)) {
         pdf.bookmarks <-
                    "BookmarkTitle: ",bookmarks[[bookmark]]$text,"\n",
                    "BookmarkLevel: ",bookmarks[[bookmark]]$level,"\n",
                    "BookmarkPageNumber: ",bookmarks[[bookmark]]$page,"\n")
     temp.pdf <- tempfile(pattern=basename(pdf.file)) <- tempfile(pattern=paste0(basename(pdf.file),"info_utf8"))
     if (file.exists(temp.pdf)) {
     } else {
         warning("unable to properly create bookmarks")

and can be used like so:

 bookmarks <- list()
 bookmarks <- save.bookmark("First plot",bookmarks)
 bookmarks <- save.bookmark("Second plot",bookmarks)

et voila. Bookmarks and a table of contents for PDFs.

This basic methodology can be extended to any language which writes PDFs and does not have a built-in method for generating a Table of Contents. Currently, the usage of Cairo.onSave is a horrible hack, and may conflict with anything else which uses the onSave hook, but hopefully R will report the current page number from Cairo in the future.

Adding a newcomer (⎈) tag to the BTS

Some of you may already be aware of the gift tag which has been used for a while to indicate bugs which are suitable for new contributors to use as an entry point to working on specific packages. Unfortunately, some of us (including me!) were unaware that this tag even existed.

Luckily, Lucas Nussbaum clued me in to the existence of this tag, and after a brief bike-shed-naming thread, and some voting using pocket_devotee we decided to name the new tag newcomer, and I have now added this tag to the BTS documentation, and tagged all of the bugs which were user tagged "gift" with this tag.

If you have bugs in your package which you think are ideal for new contributors to Debian (or your package) to fix, please tag them newcomer. If you're getting started in Debian, and working on bugs to fix, please search for the newcomer tag, grab the helm, and contribute to Debian.

Virginia King selected for Debbugs FOSS Outreach Program for Women

I'm glad to announce that Virginia King has been selected as one of the three interns for this round of the FOSS Outreach Program for women. Starting December 9th, and continuing until March 9th, she'll be working on improving the documentation of Debian's bug tracking system.

The initial goal is to develop a Bug Triager Howto to help new contributors to Debian jump in and help existing teams triage bugs. We'll be getting in touch with some of the larger teams in Debian to help make this document as useful as possible. If you're a member of a team in Debian who would like this howto to address your specific workflow, please drop me an e-mail, and we'll keep you in the loop.

The secondary goals for this project are to:

  • Improve documentation under
  • Document of bug-tags and categories
  • Improve upstream debbugs documentation
ErgoDox keyboard assembly

I routinely use a Kinesis Advantage Pro keyboard, which is a split, ergonomic keyboard with thumb clusters that uses brown cherryMX switches. Over the thirteen years that I've been using it, I've become a huge fan of this style of keyboard. However, I have two major annoyances with the Kinesis. First, while the firmware is good, remapping the keys is complicated and producing more complicated keyboard layouts with layers and keycodes that are not present in the original layout is not possible. Secondly, the interconnect between the main key wells and the controller board in the middle occasionally fails, and requires disassembly and occasional re-tinning of the circuit board interconnect connector.

1 About a year ago, I became aware of the ErgoDox keyboard, which is a keyboard design which mimics the kinesis to some degree, but with completely separated key halves (useful, because I'm substantially bigger than the average human), programmable firmware (so I can finally have the layers and missing keys) and with slightly more elegant interconnects (TRRS cables). Unfortunately, at the time I first heard about it (and other custom keyboards), making it required sourcing circuit boards, parts, and finding someone to cut a case for the keyboard. Then, a few months ago, I learned about MassDrop, a company who puts together groups of people to do buys of products at near-wholesale level prices, and their offer of all of the parts to build an ErgoDox. After waiting for a group buy of the keyboard to become available, I put in an order, and received the parts two months later.

Over a few hours yesterday, I learned how to do surface mount soldering of the 78 diodes (one for each key), and finished assembling and flashing the firmware. This morning, I fixed up the few key bindings that I needed to be productive, and viola, my laptop at home now has a brand new ergonomic keyboard.

Dropbox Recursive Downloader

I'm working on some analyses for the Genetic Analysis Workshop #19, which has placed it's data on Dropbox. Unfortunately, Dropbox doesn't allow for people to download zip archives larger than 1GB, and the data was made available in an unpacked structure with more than a hundred files. Some searching indicated that no one had written a recursive downloader for Dropbox, so 30 minutes of hacking with WWW::Mechanize later, I wrote a simple recursive downloader for Dropbox.

Two hours later, all of the files had downloaded.

This blog is powered by ikiwiki.