Tuesday, May 19, 2015

Annotate CpG mutations in a VCF file

Mutations in the human genome are not made all the same. Even when restricting our attention to the most common form of human variation, that is, single nucleotide polymorphism, there are different categories. On a first approximation we have transversion and transitions. And among transitions we have transitions on CpG sites, which are much more common than transitions on non-CpG sites. Hence, the need to distinguish the two. If you have mutations encoded in the standard VCF format, it is easy to distinguish transitions from transversions. The first kind shows as a A<->G or a C<->T, while the second kind shows as a A<->C, A<->T, C<->G, or G<->T. But distinguishing CpG transitions from non-CpG transitions within a VCF is not possible without additional information as we need information about the sequence context.

We definitely need the reference genome sequence of the organism the VCF file refers to. However there is no tool to my knowledge that will take a VCF file and a fasta sequence and yield as output an annotated VCF file indicating which mutations are CpG mutations (it would be nice if someone wrote a bcftools pluging for this purporse). But there are tools that will annotate a VCF file given the presence of a mutation in another VCF file. Therefore one solution would be to have a VCF file with all possible CpG mutations. Here is how to do that:

# create your personal binary directory
mkdir -p ~/bin/

# download human genome reference (GRCh37)
wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta &&
samtools faidx human_g1k_v37.fasta

# install latest version of bedtools
git clone https://github.com/arq5x/bedtools2.git &&
cd bedtools2 && make && cd .. &&
cp bedtools2/bin/bedtools ~/bin/

# create a VCF file with all CpG mutations (it takes several hours)
awk '{for (i=1; i<$2; i++)
  print $1"\t"i-1"\t"i+1"\t"$1" "i}' human_g1k_v37.fasta.fai |
  bedtools getfasta -name -fi human_g1k_v37.fasta \
  -bed /dev/stdin -fo /dev/stdout |
  tr '\n' ' ' | tr '>' '\n' | grep "CG $" |
  awk -v OFS="\t" 'BEGIN {print "##fileformat=VCFv4.1";
  {print $1"\t"$2"\t.\tC\tT\t.\t.\t.";
  print $1"\t"$2+1"\t.\tG\tA\t.\t.\t."}' |
  bgzip > cpg.vcf.gz &&
  tabix -f cpg.vcf.gz
To be completely thorough these mutations will be CpG mutations assuming that the reference sequence is the ancestral sequence which is not always the case. But for the majority of rare mutations it is a fair assumptions. Furthermore, this code does not take into account whether a CpG mutation is located within a CpG island and this is also information that might be important as not all CpG sites are equally mutable.

Now that we have our VCF file with all CpG mutations, we can use SnpSift to annotate our input VCF file. I selected SnpSift because it is quite fast and flexible compared to other tools like bcftools (see here). This can be achieved as follows:

# install latest version of snpEff/SnpSift
wget http://downloads.sourceforge.net/project/snpeff/snpEff_latest_core.zip &&
unzip snpEff_latest_core.zip

# annotate your VCF file with SnpSift
java -jar snpEff/SnpSift.jar annotate -exists CPG cpg.vcf.gz input.vcf.gz | bgzip > output.vcf.gz &&
tabix -f output.vcf.gz
If you want to extract from the VCF file only the variants which are CpG mutations, the following code will work:

# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
mv bcftools/bcftools ~/bin/

# extract from VCF file all CpG mutations
bcftools view -Oz -i "CPG==1" output.vcf.gz -o output.cpg.vcf.gz &&
tabix -f output.cpg.vcf.gz

Tuesday, November 25, 2014

Best practice for converting VCF files to plink format

Converting VCF files to plink format has never been easier. However, there are a few issues related to some intrinsic limitations of the plink format. The first is related to the fact that variants in a plink file are bi-allelic only, while variants in a VCF file can be multi-allelic. The second is related to an intrinsic limitation of plink which makes indel definitions ambiguous. Here is an example: is the following variant an insertion or a deletion compared to the GRCh37 reference?

20 31022441 A AG

There is no way to tell, as the plink format does not record this information.

Keeping this in mind, we are going to need two pieces of software for the conversion, bcftools and plink2. Here how to install them:

# create your personal binary directory
mkdir -p ~/bin/

# install latest version of bcftools
git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
mv bcftools/bcftools ~/bin/

# install latest version of plink2
wget https://www.cog-genomics.org/static/bin/plink/plink_linux_x86_64.zip
unzip -od ~/bin/ plink_linux_x86_64.zip plink

We are also going to need a copy of the GRCh37 reference. If you don't have this, it will take a while to download, but it can be done with the following commands:

wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta
samtools faidx human_g1k_v37.fasta

The following command will take the VCF file, strip the variant IDs, split multi-allelic sites into bi-allelic sites, assign names to make sure indels will not become ambiguous, and finally convert to plink format:

bcftools norm -Ou -m -any input.vcf.gz |
  bcftools norm -Ou -f human_g1k_v37.fasta |
  bcftools annotate -Ob -x ID \
    -I +'%CHROM:%POS:%REF:%ALT' |
  plink --bcf /dev/stdin \
    --keep-allele-order \
    --vcf-idspace-to _ \
    --const-fid \
    --allow-extra-chr 0 \
    --split-x b37 no-fail \
    --make-bed \
    --out output

Usually dbSNP IDs are used for variants in VCF files, but in my opinion this is really a bad practice especially after splitting multi-allelic sites into bi-allelic sites (see here for a discussion).

The first command will split multi-allelic alleles so that a variant like the following:

20 31022441 AG AGG,A

Will become two variants as follows:

20 31022441 AG AGG
20 31022441 AG A

The second command will make sure that after having been split, indels are normalized so to have a unique representation. The above multi-allelic variant would then become:

20 31022441 A AG
20 31022441 AG A

Notice that plink will have a very hard time to distinguish between the two variants above, as they look alike once you forget which allele is the reference allele. The third command will assign a unique name to each bi-allelic variant:

20 31022441 20:31022441:A:AG A AG
20 31022441 20:31022441:AG:A AG A

The fourth command will convert the final output to plink binary format. At this point, you will have a plink binary file with unique names for each variant. You can test that this is the case with the following command:

cut -f2 output.bim | sort | uniq -c | awk '$1>1'

If the above command returns any output, then you still have duplicate IDs but this means your VCF file must be flawed to begin with. Plink will not like having multiple variants with the same ID.

You should now be able to use and to merge the file you have generated with other plink files generated from VCF files in the same way.

Your last worry is individuals' sex, as the VCF format, contrary to plink format, does not encode this information. If your VCF file contains enough information about the X chromosome, you should be able to assign the sex straight from genotype. This is a command that might do the trick:

plink --bfile output --impute-sex .3 .3 --make-bed --out output && rm output.{bed,bim,fam}~

However, the two thresholds you should use (.3 .3 in the above example) to separate males from females are going to be dependent on the particular type of data you are using (e.g. exome or whole genome) and might have to be selected ad hoc. See here for additional information.
A python wrapper to perform the above operations is available online (here).

Tuesday, May 13, 2014

How do I identify all the homopolymer run indels in a VCF file?

With sequencing projects we analysts usually are left to deal with genotype data in VCF format. These files can contain genotypes for a variety of markers, from SNPs to structural variants. Small structural variants, commonly known as indels, are usually harder to call from sequencing data, and we tend to be wary of their genotypes. However, not all indels are born the same. Among indels there are a variety of different kind of indels, and depending on your project, you might be more or less interested in these. A form of particularly difficult indels to call are homopolymer run indels, that is, deletion or insertions within repeats of the same nucleotide. A very famous homopolymer run indel is the one causing medullary cystic kidney disease type 1 by disrupting the coding sequence within a VNTR of MUC1.

It might be important to identify in your VCF file which indel are homopolymer run indels, but this is not so straightforward. Also, this information cannot be extracted from the VCF file alone without the reference, so the information from two files needs to be integrated. I thought of a little hack to do this hastily without writing any serious amount of code using bedtools only. The following bash script should do it:
input="$1" # input VCF file in any format accepted by bcftools
ref="$2" # reference genome in fasta format 
bcftools view -HG $input | cut -f1-5 | awk '$4!~"^[ACGT]($|,)" || $5!~"^[ACGT]($|,)"' |
  awk '{print $1"\t"$2+length($4)-1"\t"$2+length($4)+5"\t"$3}' |
  bedtools getfasta -name -fi $ref -bed /dev/stdin -fo /dev/stdout |
  awk 'NR%2==1 {printf substr($0,2)"\t"} NR%2==0' |
  grep "AAAAAA$\|CCCCCC$\|GGGGGG$\|TTTTTT$" | cut -f1 | sort
The script will take the VCF file and the reference as output and will yield the names of all markers identified as homopolymer run indels. Notice that it requires each marker in your VCF file to have its own unique ID. Also, the definition of homopolymer run indel is arbitrary here, consisting of the repeat of six base pairs. It should be easy to change the code if you want to tweak that.

Monday, December 16, 2013

Estimating global ancestral components with 1000 Genomes

Following the last two posts, here I will show you how to use the PCA results after merging your dataset with the 1000 Genomes project dataset to estimate global ancestral components for the four main ancestries around the world, that is, European, African, Native American, and Asian.

Since there are no Native Americans in the 1000 Genomes project, we need the global ancestral estimates for the Latinos in the project which will be necessary to estimate this component. These estimates are available as part of the project, so we just need to download them:


mkdir -p $anc

# download global proportions table
for pop in ASW CLM MXL PUR; do
  wget -O- ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/ancestry_deconvolution/${pop}_phase1_globalproportions_and_unknown.txt
done) > $anc/gap

The 1000 Genomes project individuals can then be used to train a linear model that estimates global ancestral components. If you followed the directory structure for the PCA as discussed in the previous post, then the following R script will take the three tables $kgp/$set.kgp.pop kgp/$set.kgp.pca $anc/gap as input and will output a fourth table, say kgp/$set.anc, with the global ancestry estimates.

args <- commandArgs(trailingOnly = TRUE)
# load and merge input tables
df1 <- read.table(args[1],header=TRUE)
df2 <- read.table(args[2],header=TRUE)
df3 <- read.table(args[3],header=TRUE)
data <- merge(df1,df2,all=TRUE,incomparables=NaN)
data <- merge(data,df3,all=TRUE,incomparables=NaN)

data$ASN <- NaN

# European populations
eur <- data$POP=="CEU" | data$POP=="FIN" | data$POP=="GBR" | data$POP=="IBS" | data$POP=="TSI"
data[which(eur),"EUR"] <- 1; data[which(eur),"AFR"] <- 0; data[which(eur),"NAT"] <- 0; data[which(eur),"ASN"] <- 0

# African populations
afr <- data$POP=="YRI"
data[which(afr),"EUR"] <- 0; data[which(afr),"AFR"] <- 1; data[which(afr),"NAT"] <- 0; data[which(afr),"ASN"] <- 0

# Asian populations
asn <- data$POP=="CDX" | data$POP=="CHB" | data$POP=="CHD" | data$POP=="CHS" | data$POP=="JPT"
data[which(asn),"EUR"] <- 0; data[which(asn),"AFR"] <- 0; data[which(asn),"NAT"] <- 0; data[which(asn),"ASN"] <- 1

# Admixed populations
mix <- data$POP=="ASW" | data$POP=="CLM" | data$POP=="MXL" | data$POP=="PUR"
for (anc in c("EUR","AFR","ASN")) {
  data[which(mix),anc] <- data[which(mix),anc] * 1/(1-data[which(mix),"UNK"])
data[which(mix),"ASN"] <- 0

# African American samples to be excluded
exclude = c("NA20299","NA20314","NA20414","NA19625","NA19921")
for (anc in c("EUR","AFR","NAT","ASN")) {
  data[is.element(data$IID,exclude),anc] <- NaN

# predict global ancestral proportions
for (anc in c("EUR","AFR","NAT","ASN")) {
  tmp <- lm(as.formula(paste0(anc," ~ PCA1 + PCA2 + PCA3")), data)
  data[is.na(data[,anc]),anc] = predict.lm(tmp,data[is.na(data[,anc]),])

# output data to disk

If you look into the code, you will see that some African American samples are excluded from the computation. This is due to the fact that a few African American individuals in the 1000 Genomes project have a significant amount of Native American ancestry, but this does not show up in the global ancestry proportion tables as it was not modeled. It is therefore better to exclude those individuals from these computations.

Also, you might notice that, due to the inherent noise in the way the global ancestral components are modeled, it is possible that some ancestral components will be negative. This is to be expected, as these are noisy estimates. Make sure any further downstream analysis takes care of this eventuality. Overall, estimating ancestral components from principal components is a more noisy business than estimating ancestral components from local ancestry deconvolutions. However, for many purposes these estimates should work just fine.

Friday, December 13, 2013

PCA of your plink dataset with 1000 Genomes

Estimating global ancestral components of your samples should really not be harder than running a principal component analysis (PCA). Here in this post, I will show you how you can do exactly that, using the 1000 Genomes project samples. You will first need your dataset and the 1000 Genomes dataset in plink format, as explained in the previous post

As usual, you will need a few variables defined:

It is a good idea to make sure that you dataset is properly cleaned and pruned before merging the data with the 1000 Genomes project data. It is also important that SNPs be coded in the forward strand, as this is how the 1000 Genomes project is coded as. The following code will identify SNPs that are shared and compatible between the datasets, merge the individuals over these SNPs, and compute the genetic relationship matrix necessary for the PCA analysis:

mkdir -p kgp

# identify SNPs compatible with the merge
awk 'NR==FNR {chr[$2]=$1; pos[$2]=$4; a1[$2]=$5; a2[$2]=$6} NR>FNR && ($2 in chr) && chr[$2]==$1 && pos[$2]==$4 && (a1[$2]==$5 && a2[$2]==$6 || a1[$2]==$6 && a2[$2]==$5) {print $2}' $set.bim <(cat $(seq -f $pth/kgp.chr%g.bim 1 22) $pth/kgp.chrX.bim) > kgp/extract.txt

# extract SNPs from the 1000 Genomes dataset
for chr in {1..22} X; do plink --noweb --bfile $pth/kgp.chr$chr --extract kgp/extract.txt --make-bed --out kgp/kgp.chr$chr; done
/bin/cp kgp/kgp.chr1.fam kgp/kgp.fam
for chr in {1..22} X; do cat kgp/kgp.chr$chr.bim; done > kgp/kgp.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp/kgp.chr$chr.bed; done) > kgp/kgp.bed
for chr in {1..22} X; do /bin/rm kgp/kgp.chr$chr.bed kgp/kgp.chr$chr.bim kgp/kgp.chr$chr.fam kgp/kgp.chr$chr.log; done

# merge two datasets
plink --noweb --bfile $set --bmerge kgp/kgp.bed kgp/kgp.bim kgp/kgp.fam --make-bed --out kgp/$set.kgp

# create population categorical data
awk 'BEGIN {print "FID","IID","POP"} NR==FNR {pop[$2]=$7} !($2 in pop) {pop[$2]="SET"} NR>FNR {print $1,$2,pop[$2]}' $pth/20130606_g1k.ped kgp/$set.kgp.fam > kgp/$set.kgp.pop

# use GCTA to compute the genetic relationship matrix
gcta64 --bfile kgp/$set.kgp --autosome --make-grm-bin --out kgp/$set.kgp --thread-num 10
At this point we are ready to compute principal components.
# compute principal components using GCTA
gcta64 --grm-bin kgp/$set.kgp --pca 20 --out kgp/$set.kgp --thread-num 10

# create PCA matrix for use with R data.frame
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp/$set.kgp.eigenval | tr '\n' ' '; echo; cat kgp/$set.kgp.eigenvec) |
    awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
    NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp/$set.kgp.pca

The previous code should have created two files, kgp/$set.kgp.pop and kgp/$set.kgp.pca, which can be easily loaded into R with the following script which takes the two files as paramters:

args <- commandArgs(trailingOnly = TRUE)
df1 <- read.table(args[1], header = TRUE)
df2 <- read.table(args[2], header = TRUE)
data <- merge(df1,df2)
ggplot(data,aes(x=PCA1,y=PCA2,colour=POP,shape=POP)) + geom_point(alpha=1/3) + scale_colour_discrete() + scale_shape_manual(values=1:length(levels(data[,"POP"])))

Notice that you will need R and ggplot2 already installed to run this script.

Thursday, December 12, 2013

1000 Genomes PCA analysis

The easiest way run a PCA analysis with the 1000 Genomes samples is to download the data, convert it to plink format, and use GCTA to perform the bulk of the computation. The dataset provided on the beagle website is likely the easiest to start with. Here some code that can get the work done.

First we need to setup a few variables:

And then we can download all necessary files:
# install plink and utilities to convert to plink format
sudo apt-get install plink
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tfam.sh
wget http://broadinstitute.org/~giulio/vcf2plink/vcf2tped.sh
chmod a+x vcf2tfam.sh vcf2tped.sh
# download the 1000 Genomes files
for chr in {1..22} X; do wget $pfx$chr$sfx $pfx$chr$sfx.tbi -P $vcf/; done
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

# install GCTA
wget http://www.complextraitgenomics.com/software/gcta/gcta_1.22.zip
mkdir -p $gct
unzip gcta_1.22.zip gcta64 -d $gct/

For each chromosome we now need to create a plink binary file:
# convert each VCF file to a plink binary file with sex information included
for chr in {1..22} X; do
  tabix -H $pfx$chr$sfx | ./vcf2tfam.sh /dev/stdin > kgp.chr$chr.tfam
  tabix $pfx$chr$sfx $chr | ./vcf2tped.sh /dev/stdin > kgp.chr$chr.tped
  p-link --noweb --tfile kgp.chr$chr --make-bed --out kgp.chr$chr
  /bin/rm kgp.chr$chr.tfam kgp.chr$chr.tped kgp.chr$chr.nosex kgp.chr$chr.log
  awk 'NR==FNR {sex[$2]=$5} NR>FNR {$5=sex[$2]; print}' 20130606_g1k.ped kgp.chr$chr.fam > kgp.chr$chr.fam.tmp && /bin/mv kgp.chr$chr.fam.tmp kgp.chr$chr.fam

At this point we can use plink to prune each chromosome and then join the individual datasets into a single large dataset as discussed in a previous post:
# prune each chromosome
for chr in {1..22} X; do
  p-link --noweb --bfile kgp.chr$chr --maf .01 --indep 50 5 2 --out kgp.chr$chr
  p-link --noweb --bfile kgp.chr$chr --extract kgp.chr$chr.prune.in --make-bed --out kgp.chr$chr.prune

# join pruned datasets into single dataset
cat kgp.prune.chr1.fam > kgp.prune.fam
for chr in {1..22} X; do cat kgp.chr$chr.prune.bim; done > kgp.prune.bim
(echo -en "\x6C\x1B\x01"; for chr in {1..22} X; do tail -c +4 kgp.chr$chr.prune.bed; done) > kgp.prune.bed

We can now run GCTA to compute the genetic relationship matrix and then the main principal components:
# perform PCA analysis of the whole dataset
$gct/gcta64 --bfile kgp.prune --autosome --make-grm-bin --out kgp --thread-num 10
$gct/gcta64 --grm-bin kgp --pca 20 --out kgp --thread-num 10

# create PCA matrix to be loaded into R
(echo "FID IID "`seq -f PCA%.0f -s" " 1 20`;
(echo -n "#eigvals: "; head -n20 kgp.eigenval | tr '\n' ' '; echo; cat kgp.eigenvec) |
  awk 'NR==1 {n=NF-1; for (i=1; i<=n; i++) eval[i]=$(i+1)}
  NR>1 {NF=2+n; for (i=1; i<=n; i++) $(i+2)*=sqrt(eval[i]); print}') > kgp.pca

This will generate a table that can be easily loaded into R for further analyses.

Monday, November 18, 2013

IBD pipeline

There are a lot of interesting analyses that segments inherited identical-by-descent (IBD) can be informative for. However, it takes some steps to setup a fully functional IBD pipeline. Here is a basic one that will work with any dataset in binary plink format. It will first phase the data with hapi-ur, then detect IBD segments using beagle, and finally detect clusters of IBD individuals using dash.

To begin with, you need to setup a few variables:

And have a couple of programs pre-installed on your Linux machine:
# install plink
sudo apt-get install plink

# install hapi-ur
wget https://hapi-ur.googlecode.com/files/hapi-ur-1.01.tgz
tar xzvf hapi-ur-1.01.tgz
mkdir -p $hpr
/bin/cp hapi-ur-1.01/hapi-ur hapi-ur-1.01/insert-map.pl $hpr/

# install genetic maps
wget http://mathgen.stats.ox.ac.uk/impute/genetic_maps_b37.tgz
tar xzvf genetic_maps_b37.tgz
mkdir -p $gmp
for i in {1..22} X_PAR1 X_nonPAR X_PAR2; do gzip -c genetic_maps_b37/genetic_map_chr${i}_combined_b37.txt > $gmp/chr$i.gmap.gz; done

# install beagle
mkdir -p $bgl
wget http://faculty.washington.edu/browning/beagle/b4.r1196.jar -P $bgl/

# install dash
wget http://www1.cs.columbia.edu/~gusev/dash/dash-1-1.tar.gz
tar xzvf dash-1-1.tar.gz
mkdir -p $dsh
/bin/cp dash-1-1/bin/dash_cc $dsh/

The following pipeline will run the analysis independently on each autosome:
# perform IBD and DASH analysis separately for each autosome
for i in {1..22}; do
  # create a directory
  mkdir -p chr$i

  # extract autosome of interest
  p-link --noweb --bfile $set --chr $i --make-bed --out chr$i/$set

  # generate marker file with genetic distance
  zcat $gmp/chr$i.gmap.gz | $hpr/insert-map.pl chr$i/$set.bim - | awk -v OFS="\t" \
    'length($5)>1 || $6=="-" {$5="I"; $6="D"} $5=="-" || length($6)>1 {$5="D"; $6="I"} {print $1,$2,$3,$4,$5,$6}' > chr$i/$set.gmap.bim

  # clean up just in case
  /bin/rm chr$i/$set.sample chr$i/$set.haps

  # phase haplotypes using hapi-ur
  $hpr/hapi-ur -g chr$i/$set.bed -s chr$i/$set.gmap.bim -i chr$i/$set.fam -w 73 --impute -o chr$i/$set

  # convert hapi-ur output for use with beagle
  (echo "##fileformat=VCFv4.1";
  echo "##FORMAT=";
  awk 'NR>2 {printf $1"."$2"\t"}' chr$i/$set.sample | sed 's/\t$/\n/g';
  awk -v c=$i '{printf c"\t"$3"\t"$2"\t"$4"\t"$5"\t.\tPASS\t.\tGT";
    for (i=6; i<NF; i+=2) printf "\t"$i"|"$(i+1); printf "\n"}' chr$i/$set.haps) |
    java -Xmx3g -jar $bgl/b4.r1185.jar ibd=true gt=/dev/stdin usephase=true burnin-its=0 phase-its=0 out=chr$i/$set

  # convert beagle output for use with DASH
  awk '{sub("\\\."," ",$1); sub("\\\."," ",$3); print $1"."$2-1,$3"."$4-1,$6,$7}' chr$i/$set.hbd chr$i/$set.ibd |
    $dsh/dash_cc chr$i/$set.fam chr$i/$set

  # create map file with IBD clusters
  cut -f1-5 chr$i/$set.clst | awk -v c=$i '{ print c,c"_"$1,$2,$3,$4,$5 }' > chr$i/$set.cmap

  # create map file with IBD clusters for use with plink
  awk -v c=$i '{ print c,$2,0,int(($3+$4)/2) }' chr$i/$set.cmap > chr$i/$set.map

  # convert output to plink binary format
  p-link --maf 0.001 --file chr$i/$set --make-bed --recode --out chr$i/$set

You can then join the results in a single plink binary file using the code discussed in a previous post:
# join separate autosome outputs into a single plink binary file
for i in {1..22}; do cat chr$i/$set.cmap; done > $set.dash.cmap
cp $set.fam $set.dash.fam
for i in {1..22}; do cat chr$i/$set.bim; done > $set.dash.bim
(echo -en "\x6C\x1B\x01"; for i in {1..22}; do tail -c +4 chr$i/$set.bed; done) > $set.dash.bed