20 31022441A AG
There is no way to tell, as the plink format does not record this information.
Keeping this in mind, we are going to need two pieces of software for the conversion, bcftools and plink2. Here how to install them:
# install latest version of bcftools git clone --branch=develop git://github.com/samtools/bcftools.git git clone --branch=develop git://github.com/samtools/htslib.git cd htslib && make && cd .. cd bcftools && make && cd .. mv bcftools/bcftools ~/bin/ # install latest version of plink2 ver=$(wget -O- https://www.cog-genomics.org/plink2/ | grep plink_linux_x86_64.zip | cut -d/ -f4) wget -qO plink_linux_x86_64.zip https://www.cog-genomics.org/static/bin/$ver/plink_linux_x86_64.zip unzip -o plink_linux_x86_64.zip plink mv plink ~/bin/
We are also going to need a copy of the GRCh37 reference. If you don't have this, it will take a while to download, but it can be done with the following commands:
wget -O- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip > human_g1k_v37.fasta samtools faidx human_g1k_v37.fasta
The following command will take the VCF file, strip the variant IDs, split multi-allelic sites into bi-allelic sites, assign names to make sure indels will not become ambiguous, and finally convert to plink format:
bcftools norm -Ou -m -any input.vcf.gz | bcftools norm -Ou -f human_g1k_v37.fasta | bcftools annotate -Ob -x ID -I +'chr%CHROM\_%POS\_b37_%REF\_%ALT' | plink --bcf /dev/stdin \ --keep-allele-order \ --double-id \ --allow-extra-chr \ --split-x b37 no-fail \ --make-bed \ --out output
The first command will split multi-allelic alleles so that a variant like the following:
20 31022441 AG AGG,A
Will become two variants as follows:
20 31022441 AG AGG 20 31022441 AG A
The second command will make sure that after having been split, indels are normalized so to have a unique representation. The above multi-allelic variant would then become:
20 31022441 A AG 20 31022441 AG A
Notice that plink will have a very hard time to distinguish between the two variants above, as they look alike once you forget which allele is the reference allele. The third command will assign a unique name to each bi-allelic variant:
20 31022441 chr20_31022441_b37_A_AG A AG 20 31022441 chr20_31022441_b37_AG_A AG A
The fourth command will convert the final output to plink binary format. At this point, you will have a plink binary file with unique names for each variant. You can test that this is the case with the following command:
cut -f2 output.bim | sort | uniq -c | awk '$1>1'
If the above command returns any output, then you still have duplicate IDs but this means your VCF file must be flawed to begin with. Plink will not like having multiple variants with the same ID.
You should now be able to use and to merge the file you have generated with other plink files generated from VCF files in the same way.
Your last worry is individuals' sex, as the VCF format, contrary to plink format, does not encode this information. If your VCF file contains enough information about the X chromosome, you should be able to assign the sex straight from genotype. This is a command that might do the trick:
plink --bfile output --impute-sex --impute-sex .3 .4 --make-bed --out output.with.sex
However, the two thresholds you should use (.3 .4 in the above examples) to separate males from females are going to be dependent on the particular type of data you are using (e.g. exome or whole genome) and might have to be selected ad hoc. See here for additional information.