Assuming that the order of the samples within the plink files is the same (i.e. the fam files are all identical) and that each plink binary file is in SNP-major format (this is important! see here: http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml) then the following code will just work:
cat plink.chr1.fam > plink.fam for chr in {1..22} X Y; do cat plink.chr$chr.bim; done > plink.bim (echo -en "\x6C\x1B\x01"; for chr in {1..22} X Y; do tail -c +4 plink.chr$chr.bed; done) > plink.bed
This assumes your plink files you want to merge have names like plink.chr1.bed, plink.chr2.bed, ..., plink.chrY.bed and each file corresponds to a different chromosome for your dataset. This is a typical scenario especially if you are converting a big sequencing project delivered in multiple VCF files into plink.
A plink binary file in SNP-major format is just a big table preceded by three bytes that identify the file type and the order (SNP-major or individual-major). By default plink will generate SNP-major format binary files, which are ordered by SNPs, like VCF files, so you can just concatenate them after removing the three bytes headers.
I was looking for this for a while. Thank you so much for this useful post!
ReplyDeleteI have received the imputed data divided into ch 1 to 22 (.bid, .bim, and .fam) files.
ReplyDeleteIn the next step, I would like to perform the post imputation QC and association analysis.
Please let me know if I can use this code to merge the multiple files into one plink file.
Thanks in advance
M
I don't see why not. However, imputed data can generate really large bed/bim files so merging is not always the best course of action.
Delete