I've been working for a while on a reasonably large Genome-Wide Association Study dataset which has lead me through various interesting parts of handing large datasets in R. This dataset is approximately 320,000 rows by 5000 columns. After getting Rmpi working, and handling the dataset by row so I don't run out of memory, I've managed to get pretty decent performance. However, one small section of the code seemed to be taking forever to run.
It turns out that assigning data to a data.frame by row is incredibly slow in R. Thus, a section of my code which should have taken microseconds was taking tenths of seconds, and threatening to run all week. Using a matrix instead (which is basically what I want anyway) and converting to a data.frame at the very end makes the code multiple orders of magnitude faster.
Moral of the story? Don't use data.frame unnecessarily.
I use R a lot. It's one of the primary tools I use in my day job as a scientist analyzing large datasets. If you use LaTeX with R (as I often do), you probably use Sweave to interleave R output and figures with your text describing those figures using the noweb method of literate programming.
Sweavealike is a plugin for IkiWiki that tries to do some of the useful things for IkiWiki that sweave does for R and LaTeX.
You use it like the following:
[[!sweavealike echo=1 code="""
a <- 1
a <- a + 10
print(a)
"""]]
which produces this result when run:
> a <- 1
> a <- a + 10
> print(a)
[1] 11
You can also generate figures with it:
[[!sweavealike fig=1 echo=1 results="hide" code="""
plot(1:10,(1:10)^2,xlab="x",ylab=expression(x^2),main="Example Figure")
"""]]
> plot(1:10,(1:10)^2,xlab="x",ylab=expression(x^2),main="Example Figure")

The plugin itself uses the neat Statistics::R perl module to handle all of the heavy lifting. I personally plan on using this plugin to help write some more entries in my learning R series of posts that I'm beginning to work on. Hopefully I'll find and fix most of the bugs as I embark on that process so anyone else who uses the plugin won't, but feel free to e-mail me if something isn't working as it should.
Finally, you shouldn't run this plugin on a publicly editable IkiWiki instance, because that would be a trivial local user exploit as R can run arbitrary code, read and write to arbitrary files, exhaust all memory, etc.
I've been working for a while on analyzing a fairly large dataset for my Lupus genetics project. One of the major annoyances with analyzing large datasets is not knowing when a particular part of the analysis is going to finish, and whether I should go back and rewrite part of the code to be faster, or just wait for it to finish. In R, I've been using txtProgressBar to handle this, but I hadn't bothered to find a similar module for perl until now.
Luckily, Term::ProgressBar exists, and is pretty easy to use:
my $pos = $sfh->tell();
$sfh->seek(0,SEEK_END);
my $p = Term::ProgressBar->new({count => $sfh->tell,
remove => 1,
ETA => 'linear'});
$sfh->seek($pos,SEEK_SET);
while (<$sfh>) {
...; # yada yada yada
$p->update($sfh->tell());
}
producing useful output, which told me that my SQLite database creation routine would take about 2 days to finish instead of the 7 years that the slightly less optimal version wanted.
One of the earliest features I wrote for the Debian bug tracking
system (Debbugs) after joining the team was support for forcibly
merging bugs. Originally, merging two bugs required that the bugs be
in exactly the same state before merging them; forcemerge removed
this requirement.
Unfortunately, the way I originally implemented this was shortsighted, and merely forced the merge partners to have the same values as the merge master. This meant that owners, blocking bugs, and many other things were silently changed, which meant that people weren't notified of changes, and bugs could end up in an inconsistent state.
A while ago, I decided to fix this by calculating the changes required
to actually merge the bugs, making those changes, and then merging the
bugs normally; thus, doing everything that a maintainer would normally
have done for them. This necessitated abstracting out the entire
control apparatus into the Debbugs::Control module.
Now that it's complete, you can do the following:
> forcemerge 1 2
Bug #1 [foo] new title
Bug #2 {Done: foo@bugs.something} [foo] foo
Unset bug forwarded-to-address
Severity set to 'wishlist' from 'grave'
3 was blocked by: 2
3 was not blocking any bugs.
Removed blocking bug(s) of 3: 2
2 was blocked by: 4
2 was not blocking any bugs.
Removed blocking bug(s) of 2: 4
Bug reopened
Removed annotation that bug was owned by bar@baz.com.
Removed indication that 2 affects bleargh
Removed tag(s) unreproducible and moreinfo.
Merged 1 2
> thanks
Stopping processing here.
and bug 2 now is merged with 1 and matches the state of 1.
[The above is the control output from the appropriate bit of the 06_mail_handling.t test.]
This change also means that I'll be able to finally write support for control@ operations at submit@ time. Also, all of the bug modifications that happen at submit@ or nnn@ time (setting title, found, etc.) will be implemented as calls to Debbugs::Control so we can eventually keep a postgresql database updated in addition to the flatfile database.
For those of you were were in the various Debian infrastructure channels, you might have noticed that I was playing around with wanna-build, dak, sbuild, and buildd and friends. [Thanks to everyone who answered questions, btw.] Over the past week, I've been building most of CRAN, Bioconductor, and omegahat for unstable, amd64. I plan to build the same set of packages for i386, and will start a build shortly for stable as well. This effort builds on top of Charles Blundell and Dirk Eddelbuettel's cran2deb, which does most of the heavy lifting.
If you're like me, and use lots of different R packages, or already use some of the R packages available on the previous build, you can simply point your sources.list to the [http://debian-r.debian.net] archive, load the appropriate GPG key, and away you go. I have a bit more information available here and I will try to keep that page updated as I build other architectures and build out for stable.
Laurel sent me this gem[^1], which lead me to find the wikipedia page on Mathematical Jokes:

A Dozen, a Gross, and a Score,
plus three times the square root of four,
divided by seven,
plus five times eleven,
equals nine squared and not a bit more.
And from there, I found another:
![\int^{\sqrt[3]{3}}_1 \! z^2\, \mathrm{d}z \cdot \cos \frac{3\pi}{9}=\log \sqrt[3]{e}](./posts/numerology/80a289ec9b17284937af11448ec03379.png)
Integral z-squared dz
From 1 to the cube root of 3
Times the cosine
Of three pi over 9
Equals log of the cube root of e.
[^1]: Apparently by Leigh Mercer
I have a mythtv box which (when working) records television shows for
me. As I'm not interested in the vast majority of shows shown on US
television, it spends most of it's time off, waiting for a show that I
want to record. This requires using nvram-wakeup, and one of the
oddities of my machine's bios is that it wants to be rebooted after
setting the nvram.
[This is likely due to Debian writing to the RTC after the nvram being updated, but not setting the RTC seems stupid.]
After the reboot, the machine should halt, and grub should be
configured to start the machine normally once the bios starts.
As grub2 now supports named default entries, this is fairly
straightforward. We create a menu entry like the following in
/etc/grub.d/40_custom:
menuentry 'halt' {
set saved_entry=0;
save_env saved_entry;
load_env;
halt;
}
make sure that GRUB_DEFAULT="saved" in /etc/default/grub; and set MythShutdownNvramRestartCmd to /usr/sbin/grub-set-default halt:
mysql mythdb -e "UPDATE settings SET data='/usr/sbin/grub-set-default halt' WHERE value='MythShutdownNvramRestartCmd'";
and viola, the machine now behaves properly with grub2.
UC Riverside's wireless network uses WPA-EAP for the encrypted network. [The unencrypted network does a https based browser capture.] Unfortunately, none of the default wicd encryption templates support the precise brand of WPA that the network does, so you have to make your own template. Luckily, wicd makes this fairly simple:
Create a new template, say, /etc/wicd/encryption/templates/eap-only,
with appropriate contents.
name = EAP
author = Don Armstrong
version = 1
require identity *Identity passwd *Password
-----
ctrl_interface=/var/run/wpa_supplicant
network={
ssid="$_ESSID"
key_mgmt=WPA-EAP
identity="$_IDENTITY"
password="$_PASSWD"
}
Then tell wicd about this new template by editing
/etc/wicd/encryption/templates/active and adding eap-only to the
existing list of templates, and restart wicd /etc/init.d/wicd
restart.
[I'm not sure if restarting wicd is necessary, but it shouldn't hurt.]
Finally, configure the network using the appropriate wicd interface as usual.
Trying to do some work for the Mystery Hunt which starts on Friday at noon, and the wireless at MIT keeps deauthenticating me for reason 1. (Which apparently is the dreaded "unknown reason for deauthentication".) Reassociating makes everything work again. Fast, hack solution:
while sleep 1s; do
if iwconfig wireless |grep -q "ESSID:off"; then
iwconfig wireless essid "MIT GUEST";
echo "reset wireless";
fi;
done;
and I'm back at work with relatively continuous network connectivity.
This year has been fairly good for our lab; a few years ago we found an association of rs17849502 with SLE. Finally, after significant work by our collaborators, we managed to demonstrate a functional significance of this SNP with possible importance for SLE. Changing H389 to Q reduces the function of the NADPH oxidase complex by 50% in Vav dependent Fcγ response.
You can read the paper "Lupus-associated causal mutation in neutrophil cytosolic factor 2 (NCF2) brings unique insights to the structure and function of NADPH oxidase." (pmid: 22203994)
This blog is powered by ikiwiki.