Postgresql dbsnp Mirror

Getting Files

Assuming you are interested in humans (like I am) the following will download the databases and schemas for the current release of dbsnp. For humans, this is currently ≈60G, and takes a while to retrieve (about 24 hours or so).

rsync -rvP --include 'organism_**' --exclude '**' rsync://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/ .
rsync -rvP rsync://ftp.ncbi.nlm.nih.gov/snp/database/shared_data .
rsync -rvP rsync://ftp.ncbi.nlm.nih.gov/snp/database/shared_schema .

Preparing SQL Schemas

If you're loading the human databases, I've already done the work for you. Simply use my git repository (git init db_snp_utils; git pull http://git.donarmstrong.com/dbsnp.git).

Otherwise, you'll want to use mssql_psql_conversion.pl in my dbsnp git repository to convert your organism's schema into appropriate bits. Something like the following will get you close (you may still need to tweak the sql manually):

for a in {organism,data}_schema/*.sql; do 
    ./mssql_psql_conversion.pl ${a} > ${a%%.sql}_postgrseql.sql;
done;

Loading data

Once the schema are correct, you want to load the schema and the data, then apply the constraints and indexes. This will take some time even on a fairly fast machine. [I would expect at least 2-3 days, unless you have exceptionally fast disks.]

I have included a script in the utils directory of the git repository above called load_snp_data.sh, which applies the schema, loads the data, and then applies the indexes and constraints. It looks like the following:

psql -c 'DROP DATABASE snp';
psql -c 'CREATE DATABASE snp';

DATA_DIR=/srv/ncbi/db_snp/
SCHEMA_DIR=/srv/ncbi/db_snp_utils/schema
UTIL_DIR=${SCHEMA_DIR}/../utils/

(cd ${SCHEMA_DIR}/shared_schema;
    cat dbSNP_main_table_postgresql.sql |psql snp;
)
(cd ${SCHEMA_DIR}/human_9606_schema;
    cat *_table_postgresql.sql|psql snp;
    ${UTIL_DIR}/human_gty1_indexes_creation.pl create trigger |psql snp;
)
(cd ${DATA_DIR}/shared_data;
    for a in $(find -type f -iname '*.bcp.gz' -printf '%f\n'|sort); do
    echo $a;
    zcat $a | perl -pe 's/\r/\\r/g' |psql snp -c "COPY ${a%%.bcp.gz} FROM STDIN WITH NULL ''";
    done;
)
(cd ${DATA_DIR}/organism_data;
    for a in $(find -type f -iname '*.bcp.gz' -printf '%f\n'|sort); do
    echo $a;
    zcat $a | perl -pe 's/\r/\\r/g' |psql snp -c "COPY ${a%%.bcp.gz} FROM STDIN WITH NULL ''";
    done;
)
(cd ${SCHEMA_DIR}/shared_schema;
    cat dbSNP_main_index_postgresql.sql dbSNP_main_constraint_postgresql.sql|psql snp;
)
(cd ${SCHEMA_DIR}/human_9606_schema;
    cat *_{index,constraint}_postgresql.sql|psql snp;
    ${UTIL_DIR}/human_gty1_indexes_creation.pl index |psql snp;
)

Permissions on the database

Since I have my database on a server separate from the workstations (and other machines) that I often do work on, I need remote access to the database. To make this easy (and avoid having to hard code database details into the few dozen scripts I use), I created a postgresql service called snp.

An entry like this:

[snp]
 dbname=snp
 user=snpuser
 password=somepassword
 port=9212
 host=snpdb.donarmstrong.com

in the pg_service.conf file (in /etc/postgresql-common or PGSYSCONFDIR) will configure the service.

You then need to make sure that the database server is listening on the appropriate ip address (edit the database's postgresql.conf file), and that snpuser has select privileges and can connect. A line like the following in pg_hba.conf

host    snp             snpuser         192.168.0.0/24           md5

and the following sql will set that up for you.

CREATE USER snpuser WITH PASSWORD ('somepassword');
GRANT SELECT ON ALL TABLES IN SCHEMA public TO snpuser;

Then, to test, you should be able to run:

psql "service=snp" -c 'SELECT * FROM snp LIMIT 5';

On another machine.

Querying the database

Once the process above has finished, you can actually query the database. For example, to select information about rs17849502 with its chromosome and position, you do something like the following:

SELECT scpr.snp_id AS snp_id,
       scpr.chr AS chr,
       scpr.pos AS pos,
       scpr.orien AS orien,
       scl.allele AS ref,
       uv.var_str AS var_str,
       ruv.var_str AS rev_var_str,
       ml.locus_id AS uid,
       ml.locus_symbol AS symbol,
       gitn.gene_name AS description
       FROM snp s
         JOIN b135_snpchrposonref_37_3 scpr ON s.snp_id=scpr.snp_id
         JOIN b135_snpcontigloc_37_3 scl ON scpr.snp_id=scl.snp_id
         JOIN b135_contiginfo_37_3 ci ON scl.ctg_id = ci.ctg_id
         LEFT OUTER JOIN b132_snpcontiglocusid_37_1 ml ON s.snp_id=ml.snp_id
         LEFT OUTER JOIN geneidtoname gitn ON ml.locus_id=gitn.gene_id
         JOIN univariation uv ON s.univar_id=uv.univar_id
         JOIN univariation ruv ON uv.rev_univar_id=ruv.univar_id
WHERE ci.group_term LIKE 'GRCh%' AND s.snp_id='17849502';

[I personally use this very query in a program called snp_info, which I'll probably share later.]