Sunday, April 8, 2018

1000 Genomes Phase 1 data has some SNPs that Phase 3 doesn't

Yesterday I noticed that the 1000 Genomes Phase 3 data set is missing some common SNPs. Today I checked whether the SNP I was looking for was in the Phase 1 data set, and it is. The Phase 1 VCF lists it as the same high quality as the rest of the SNPs that I could find in the Phase 3 VCF, so I'm not sure why it was dropped. It's almost certainly real - I've been working with a different set of sequences with many reads that contain the mutation. The proportion of non-co-occurrence from the Phase 1 VCF is an order of magnitude higher than that from the Phase 3 VCF, whether or not the lost-and-found SNP is included in the Phase 1 calculation. This suggests that there is a bit more error in the Phase 1 reads than Phase 3.

No comments:

Post a Comment