libravatar for the BTS (and boring encoding fixes)

While working on fixing a few encoding problems that I managed to introduce to the BTS almost half a year ago, I took a side bit of coding, and introduced libravatar support to the BTS. Every e-mail now has an avatar to the right which should correspond to the sender. Libravatar is a federated service, which means that if you control your domain, you can serve your own icons. It also automatically falls back to gravatar, so if you're using that service, things should "just work". Hopefully this will be primarily amusing, and people won't abuse it.

More importantly, but much less fun, the double encoding problems (where mails would get double-encoded if any of the headers contained non-us-ascii text), and mojibake wontfix icon (☹) should be fixed now. If you see any additional cases of this, please report them to owner@bugs.debian.org.

Bug Reporting Rate in Debian

Christian's most recent blog post got me wondering if the decline in the bug reporting rate in Debian was something new, or something which often happened during releases. So, lets try to figure that out. In the BTS, when a bug report is filed, the report is written to a file called bugnum.report, and then not touched from then on. Let's look at the modification date on that file to see when each bug was filed; and since we're going to plot this, lets only look at bugs ending in 00:

stat -c '%n %Y' /srv/bugs.debian.org/spool/{archive,db-h}/00/*.report > ~/reporting_rate.txt

Now, lets get the data into R and plot it. [For clarity, I'm not showing the R code, but it's available in the source code for this post.]

From the plot (Bugs reported per second over time with a red loess fit line), it looks like we do see a decline during certain periods. However, there's an even more alarming trend of a decrease in bug reporting in Debian which has been happening since 2006. (Note that I've truncated the y scale significantly; there are periods in Debian where the bug rate is astronomically high, usually corresponding to mass bug filings; I've also limited the plot to data from 2003 on, as I have to clean up that data significantly before I can plot it like this.)

Not sure exactly what that means, but it is troubling.

Plotting Ethnic Regions

I was asked over the weekend how to plot SNPs which are associated with specific ethnicities; the following code is a really quick stab at the problem using the grid graphics engine in R.

> require(grid)
> snp.position <- 1:10
> snp.integers <- sample.int(5,size=length(snp.position),replace=TRUE)
> ### the position is really the midpoint of the range
> snp.midpoint <- snp.position
> ### start of the SNP; we're assuming it should start at 0.
> snp.start <- c(0,snp.position[-1]-diff(snp.position)/2)
> ### stop of the snp; we're assuming it should stop at the last snp position
> snp.stop <- c(snp.position[-length(snp.position)]+diff(snp.position)/2,
>   snp.position[length(snp.position)])
> snp.width <- snp.stop - snp.start
> 
> ### these are the colors
> possible.colors <- c("red","blue","yellow","green","purple")
> snp.colors <- possible.colors[snp.integers]
> 
> ### this sets up the viewport that we'll plot into
> pushViewport(viewport(height=unit(1,"npc")-unit(7,"lines"),
> width=unit(1,"npc")-unit(7,"lines")
>   ))
> pushViewport(dataViewport(xscale=range(c(0,snp.stop)),yscale=c(0,1)))
> 
> ### this draws a rectangle around the graph
> grid.rect()
> ### this sets up the x axis
> grid.xaxis()
> ### this labels the X axis
> grid.text("Position on Chromosome",y=unit(-2.5,"lines"))
> 
> ### this draws all of the boxes corresponding to each SNP in the
> ### appropriate color
> grid.rect(x=unit(snp.start,"native"),
>         width=unit(snp.width,"native"),
>         y=unit(0.5,"native"),
>         height=unit(0.25,"native"),just=c("left","center"),
>         gp=gpar(col=snp.colors,fill=snp.colors))
> 
> ### this pops the viewport
> popViewport(2)

Finding out Cytobands/Idiograms for assemblies

In many organisms it is common to use idiograms or cytobands which provide information on approximately where something is located on a chromosome in reference to the chromosome's larger structure, or when exact locations are not required.

Until recently, I didn't know where NCBI kept their idiogram annotations, which made my mirror of dbsnp (which I use to annotate my whole genome analyses) slightly less useful than it could have been. But, after a bit of searching of NCBI's ftp site, I was able to locate the file in the new movie directory: ideogram_9606_GCF_000001305.13_850_V1.

Then, a quick bit of work with SQL, I have the following schema:

CREATE TABLE idiogram (
       chr TEXT NOT NULL,
       pq  TEXT NOT NULL,
       idiogram TEXT NOT NULL,
       -- I think these are related to recombination rates, but I'm not sure
       rstart INT NOT NULL,
       rstop INT NOT NULL,
       start INT NOT NULL,
       stop INT NOT NULL,
       -- I believe this indicates whether the band is black or white
       posneg TEXT NOT NULL
       );

CREATE UNIQUE INDEX ON idiogram(chr,pq,ideogram);
CREATE UNIQUE INDEX ON idiogram(chr,start);
CREATE UNIQUE INDEX ON idiogram(chr,stop);

and an additional bit of SQL in my SNP annotation perl script:

SELECT CONCAT(chr,pq,idiogram) AS idiogram
  FROM idiogram
  WHERE idiogram.chr = ? AND idiogram.start <= ? AND idiogram.stop < ? LIMIT 1;

and some code:

sub find_idiogram {
    my %param = @_;

    my %info;
    my $rv = $param{sth}->execute($param{chr},$param{pos},$param{pos}) //
    die "Unable to execute statement properly: ".$param{dbh}->errstr;
    my ($idiogram) = map {ref $_ ?@{$_}:()} map {ref $_ ?@{$_}:()} $param{sth}->fetchall_arrayref([0]);
    if ($param{sth}->err) {
    print STDERR $param{sth}->errstr;
    $param{sth}->finish;
    return 'NA';
    }
    $param{sth}->finish;
    return $idiogram // 'NA';
}

and viola:

id chr pos idiogram ref alt orig_id gene [...]
rs10000010 4 21618674 4p16.3 T C rs10000010 KCNIP4 [...]

idiograms for every SNP.

Watching the Presidential Debates

Tonight I will be watching the presidential debates. Since I can't stand the commentators on any of the major news networks, I will once again be watching the debates on cspan. If you don't have cable (or your cable plan doesn't include CSPAN), you can watch the video feed online. If you're running Debian, you can also use rtmpdump and mplayer to play the stream on your computer fairly easily:

  rtmpdump -v -r rtmpt://cp82346.live.edgefcs.net:1935/live?ovpfv=2.1.4 \
     --tcUrl rtmp://cp82346.live.edgefcs.net:1935/live?ovpfv=2.1.4 \
     --app live?ovpfv=2.1.4 --flashVer LNX.11,2,202,238 \
     --playpath CSPAN1@14845 \
     --swfVfy http://www.c-span.org/cspanVideoHD.swf \
     --pageUrl http://www.c-span.org/ | \
     mplayer -xy 3 -;

Then you can find a bingo card of your own, and play debate bingo! Or some other horrible drinking game.

Posted
Switching to KGB from CIA

CIA.vc has unfortunately disappeared, and is unlikely to return any time soon. I personally have decided to switch to KGB, but other alternatives such as FBI and irker exist.

To switch, you first need to find or set up a kgb bot. If this is a Debian associated FOSS project, feel free to contact me or join #kgb-devel on irc.oftc.net and ask for someone to allow your project to talk to their bot. Once you've found a bot, we need to set up the client. [I'll talk about bot set up at the end.]

kgb-client configuration

Install the kgb-client and kgb-client-git packages. Currently, kgb only supports subversion, git, and cvs, but support for additional VCSes continue to be added as kgb gains popularity.

For git repositories, add a post-receive hook like the following:

#!/bin/sh
tee hooks/reflog | kgb-client --conf /path/to/kgbclient.conf --repository git --git-reflog -

For subversion repositories, add a post-commit hook like the following:

#!/bin/sh
kgb-client --conf /path/to/kgbclient.conf --repository svn "$1" "$2"

Then update the configuration file /path/to/kgbclient.conf:

---
repo-id: my-repository
servers:
 - uri: http://servername:9999/
   password: verysecret
# optional link to a website where the commits are;
# needs newish kgb-client and server
web-link: http://example.com/?p=my-repository;a=commitdiff;h=${commit}

Then, send the bot owner the password, repo-id, channel, and network you'd like the changes to be reported to.

Configuring kgb-bot

The bots just listen to soap requests and if the password matches, sends the commit to the appropriate IRC channel. To set one up, install kgb-bot.

Then, enable the bot (set BOT_ENBALED=1 in /etc/default/kgb-bot), and configure the bot's configuration file /etc/kgb-bot/kgb.conf:

---
soap:
  server_addr: 0.0.0.0
  server_port: 9999
  service_name: KGB
queue_limit: 150
log_file: "/var/log/kgb-bot.log"
repositories:
  # just a name to identify it
  my-repository:
    # needs to be the same on the client
    password: verysecret
networks:
  oftc:
    nick: KGB-you
    ircname: KGB bot
    username: kgb
    password: ~
    nickserv_password: yournickservpassword
    server: irc.oftc.net
    port: 6667
  freenode:
    nick: KGB-you
    ircname: KGB bot
    username: kgb
    password: ~
    nickserv_password: yournickservpassword
    server: irc.freenode.net
    port: 6667
channels:
  - name: '#your-channel'
    network: oftc
    repos:
     - your-repo
  - name: '#commits'
    network: freenode
    repos:
     - your-repo

Then start the bot (/etc/init.d/kgb-bot start), and watch as it joins channels and reports your changes!

You'll probably actually want to register whatever nick you are using on the networks, etc... but you can figure that out yourself!

Migrating from Subversion to git with git-annex

Recently, I've started converting many of my subversion repositories to git, some of which contain fairly large files (2-3G). However, git can be slow to deal with repositories with large files, and it also isn't able to selectively discard unneeded files when disk space is pressing. Thankfully, git-annex resolves most of these problems with git, but the process required to use git-annex on a converted subversion repository is slightly complicated.

Basic conversion of svn to git

The basic conversion of svn to git is done using git-svn:

 git svn clone file:///srv/svn/foo --no-metadata -A authors.txt -T trunk foo

where /srv/svn/foo is the subversion repository, authors.txt is a list of login = Full Name <email@example.com> pairs matching each of the subversion commit authors, and foo is the git repository to create.

git-svn has a ton of useful options, but the basic invocation above is all I'm concerned with.

Migrating large files from git into git-annex

In order to migrate from git to a git+git-annex setup, we'll have to walk the entire commit history, and edit each commit to instead store large files in git-annex, replacing the large file with a symlink, and finally eliminate all of the references to the old large objects, and do garbage collection.

Because we may have the same file move around, we're going to use the git-annex SHA1 backend instead of the default WORM backend which is based on filename and size, and init git-annex.

  cd foo; echo '* annex.backend=SHA1' > .git/info/attributes
  git annex init

Then, we're going to filter out the large files using git filter-branch. To do that, we'll first, we'll create a little helper script git_annex_add.sh, which will remove the file from the git repository, add to git annex, and fix up the symlinks:

 #!/bin/bash
 f="$1";
 git rm --cached "${f}";
 git annex add "${f}";
 annexdest="$(/bin/readlink -v ${f})";
 ln -sf "${annexdest#../../}" "${f}";
 echo -n "Added: "
 ls -l "${f}";

Then we will run filter-branch, and annex all files larger than 5 megabytes. [Tweak the find command if you want to do something different.]

 git filter-branch  --tag-name-filter cat --tree-filter \
'find . -ipath \*.git\* -prune -o -path \*.temp\* -prune -o -size +5M -type f -print0|xargs -0 -r -n1 ~/git_annex_add.sh;
 git reset HEAD .git-rewrite; :' -- master

This operation will take a while. [It would be better to do this during the initial svn→git conversion, but since that requires more knowledge of git-svn, svn, git, and git-annex internals than I have, and I only have to do this once for each repository, it's not worth my time.]

Now we have successfully switched everything to using git-annex, and we need to clean out the old references to the files:

 rm .git/svn -rf;
 rm -rf .git/refs/original .git/refs/remote/trunk .git/refs/remote/git-svn;
 git reflog expire --expire=now --all
 git gc --prune=now
 git gc --prune=now --aggressive

(I'm not sure if the last two commands need to be separate; I'm cargo culting a bit there.)

Storing all git-annex files in a remote repository

Because git-annex allows you to easily throw away files which are no longer referred to by the tip of any branch using git annex unneeded (and because I'd like all of the files on my central remote repository), I'm going to shove all of the git annex files into the remote bare repository. Normally, you would use git annex copy --to=remote; to do this, but because that only copies needed files, not everything, we'll have to do it manually.

First, create the remote repository:

 git init --bare /srv/git/foo.git
 cd /srv/git/foo.git; git annex init foo.example.com

Add the remote to the local repository, push to the remote, and sync the objects and sync the annex:

 git remote add origin ssh://foo.example.com/srv/git/foo.git
 git push origin master
 rsync -avP .git/annex/objects ssh://foo.example.com/srv/git/foo.git/annex/.;
 git annex sync

Finally, on the remote, run git annex fsck to clean up the links to the imported objects:

 cd /srv/git/foo.git; git annex fsck;

Unresolved issues

I don't know if the above works properly for branches. I suspect that it does not. I also have not exhaustively tested this methodology to verify that all of the history is present in every case. But hopefully this post (or some modification of it) will be helpful to you.

Credit

Many of the methodologies described here I originally found in tyger's git-annex forum post, the git gc stuff came from random google searches about shrinking git repositories, and the rsync suggestion came from joeyh (author of git-annex) and the other helpful denizens of #vcs-home on irc.oftc.net.

Debbugs: Control at Submit time

One of the features that I have been asked for multiple times is the ability to use control@bugs.debian.org commands at submit@bugs.debian.org time. I have now implemented this with the following syntax:

Package: foo
Version: 1.0-3
Control: retitle -1 this is the title
Control: severity -1 bleargh
Control: summary -1 0
Control: forward -1 http://bugs.debian.org/nnn

In short, you preface any control commands with Control:, -1 is the current bug, and the rest of each line is the control@ grammar you already know. This also now works for every kind of message to nnn@bugs.debian.org with the exception of messages received at nnn-done and nnn-forwarded. I don't know why you'd use it for anything else but submit@ messages, but hey, whatever works.

Debbugs: outlook command

Neil McGovern asked me to add an additional feature to the BTS to support tracking the current status of attempts at fixing a bug. In past releases, we've used the nice commenting feature of bts.turmzimmer.net to keep track of what is going on in a particular nasty RC bug, who is working on it, and what needs to be done next (or if everyone can just ignore the bug).

This feature should probably have already been in the BTS to start with, but now it is. In addition to the existing summary feature, where you can nominate a message or text to be the summary for a bug, there is an outlook command, which tracks the current status of the bug, and behaves in exactly the same way:

outlook 12345 not good
outlook 54321 0
thanks

I'm totally stymied by #54321.

for example.

I plan to include the outlook in the bugscan output in the future too, so it'll be easily accessible. (And possibly up-to-the-minute with some javascript-fu.)

Wikipedia Deletion Reviews

Many community driven projects have a problem with overgrown bureaucratic processes reducing the desire and ability of casual contributors to contribute. Debian has struggled with this problem, with efforts like Debian Maintainers and sponsorship to address it, but it's insidious and difficult to completely overcome. I recently ran into this problem again with Wikipedia, where I'm a casual contributor (I probably average an edit a month).

Sometime in 2006, I uploaded an image of a Boojum tree I took in 2004 to Wikipedia to provide an image for the Boojum tree article. It wasn't a particularly awe inspiring image, as I took it while I was teaching second quarter freshmen biology on campus, and showing the students the awesome Botanical Gardens at UCR. In 2011, a Wikipedia user asked for deletion of the image because of some confusion about the copyright on the image, as Apache::Gallery's default template footer contains Apache::Gallery's copyright. I didn't notice that during the 7 day deletion review period because I rarely log into Wikipedia.

A few days ago, I noticed the deletion and asked for a Deletion Review. I assumed that my explanation that the copyright notice was for Apache::Gallery would be understood (or at least believed), and that at least the original reason for the deletion would be seen to be invalid. Instead, during the process I was questioned as to whether I actually took the picture, why I used GPLv2+ for the pictures, whether I was claiming other people's images, and whether the image was actually good enough to be in Wikipedia in the first place. Hundreds of lines of text, an edit to the template in A::G, hours wasted, people still unsatisfied, and the potential contributor (myself) feeling so annoyed with the entire process that I bothered to write this blog entry.

While I'm not sure what to do about Wikipedia, I've been forcibly reminded of how important enabling easy contributions are, and how alienated one can feel when one is stymied by them to the point that your (admittedly insignificant) contribution to a project no longer seems worth the effort.

This blog is powered by ikiwiki.