Friday, October 11, 2013

Merging plink binary files

If you have a bunch of plink binary files for the same set of people split over different chromosomes, how do you merge them? A simple answer would be to use the --merge-list option in plink. However, this will require some setting up to do and it turns out there is a much easier way to perform this.

Assuming that the order of the samples within the plink files is the same (i.e. the fam files are all identical) and that each plink binary file is in SNP-major format (this is important! see here: http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml) then the following code will just work:

cat plink.chr1.fam > plink.fam
for chr in {1..22} X Y; do cat plink.chr$chr.bim; done > plink.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X Y; do tail -c +4 plink.chr$chr.bed; done) > plink.bed

This assumes your plink files you want to merge have names like plink.chr1.bed, plink.chr2.bed, ..., plink.chrY.bed and each file corresponds to a different chromosome for your dataset. This is a typical scenario especially if you are converting a big sequencing project delivered in multiple VCF files into plink.

A plink binary file in SNP-major format is just a big table preceded by three bytes that identify the file type and the order (SNP-major or individual-major). By default plink will generate SNP-major format binary files, which are ordered by SNPs, like VCF files, so you can just concatenate them after removing the three bytes headers.

No comments:

Post a Comment