Friday, March 30, 2018

Use VCF files to get SNP information from large data sets

Recently I needed to find information on the co-occurrence of some single-nucleotide polymorphisms in a human gene. For a small data set covering only the gene of interest, manipulating the information in SAM format worked well. But then I wanted to look at the data generated by the 1000 Genomes project, which is comprised of thousands of full genomes - a ton of data. My scripts probably wouldn't be able to handle files that large, and even extracting the relevant SNPs using my current tools from a single genome could take ages. Fortunately, the 1000 Genomes project also offers VCF files, which are catalogs of variations with each individual's type at that site right there on the same line. That's extremely convenient. The rsID is the third tab-delimited field, and filtering for rsIDs I needed was pretty easy with a bit of PowerShell.

No comments:

Post a Comment