User manual

The first example lists all the commands required for genotyping with FastGT package, step-by-step. A single male individual from Illumina Platinum dataset (NA12877) and our standard k-mer list with 30 million SNVs is used in this example.

Preparations

Before you can start genotyping, you need to a) download FastGT binaries, b) download k-mer database and c) have some sequencing data in FASTQ-format file. Make sure you have enough space for storing these files. The FASTQ file that is used in this example has ca 50x coverage and is rather large - ca 411 GB unpacked. K-mer database is ca 5 GB. Text files with results are both 1.3 GB.

1. Download FastGT binaries, unpack.
wget http://bioinfo.ut.ee/FastGT/downloads/fastgt_binaries_1.0.tar.gz
tar zxvf fastgt_binaries_1.0.tar.gz
cd fastgt_1.0/


2. Download database of k-mers
wget http://bioinfo.ut.ee/FastGT/downloads/kmer_list_WG30238282.db

3. Download FASTQ files with sequencing reads of the individual NA12877 (coded as ERR194146) and unpack. You can use other FASTQ files if you already have them on your server.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146_2.fastq.gz
gunzip ERR194146*.fastq


Example #1: Standard genotyping procedure

Determine genotypes of ERR194146:
./gmer_counter -db kmer_list_WG30238282.db ERR194146*.fastq > ERR194146.counts
./gmer_caller ERR194146.counts > ERR194146.calls

This should give you a text file with genotype calls.

First 10 lines of output should look like this:
head ERR194146.calls
1:657698:rs565995692:C/T	AA	1.00	40	0
1:658426:rs188842781:G/T	AA	1.00	32	0
1:668346:rs115048193:G/A	AA	1.00	44	0
1:668374:rs138476838:G/A	AA	1.00	50	0
1:676127:rs150820983:C/T	AA	1.00	43	0
1:693588:rs574459339:G/A	AA	1.00	45	0
1:693747:rs532427839:A/G	AA	1.00	50	0
1:697919:rs539728205:T/C	AA	1.00	46	0
1:705475:rs564206378:G/A	AA	1.00	39	1
1:706645:rs148085246:A/C	AA	1.00	43	0
In the first column is marker name, which in our k-mer database consists of the following fields chr:position:rs_number:reference_allele/alternative_allele
Second column shows predicted genotype. 'A' means reference allele and 'B' means alternative allele
Third column shows likelihood of the genotype
Fourth and fifth columns show frequencies of k-mers corresponding to reference allele (A) and to alternative allele (B) in the FASTQ file

Let's see how the lines with alternative (non-reference) allele look like. These are denoted as AB.
grep AB ERR194146.calls | head
1:752721:rs3131972:A/G	AB	1.00	21	19
1:757936:rs4951862:C/A	AB	1.00	17	9
1:772755:rs2905039:A/C	AB	1.00	19	20
1:776546:rs12124819:A/G	AB	1.00	18	18
1:777122:rs2980319:A/T	AB	1.00	21	18
1:778745:rs1055606:A/G	AB	1.00	26	12
1:783318:rs6686696:A/G	AB	1.00	25	22
1:785050:rs2905062:G/A	AB	1.00	24	15
1:787399:rs2905055:G/T	AB	1.00	13	15
1:790465:rs61768207:G/A	AB	1.00	22	15
Grep helps to reveal lines with homozygous non-reference allele also. These genotypes are denoted as BB.
grep BB ERR194146.calls | head
1:792480:rs2905036:C/T	BB	1.00	0	42
1:817234:rs4970387:A/T	BB	1.00	0	49
1:819917:rs6605057:A/G	BB	1.00	0	48
1:829637:rs4437820:C/T	BB	1.00	0	38
1:847250:rs7416129:G/A	BB	1.00	0	43
1:857728:rs6689107:T/G	BB	0.64	0	27
1:862866:rs3892970:C/T	BB	1.00	0	41
1:863124:rs4040604:G/T	BB	1.00	0	39
1:864757:rs2340588:G/A	BB	0.77	0	28
1:866319:rs9988021:G/A	BB	1.00	0	38
There are always some genotypes that cannot be reliably called as AA, AB or BB. These lines are denoted as NC (no call).
grep NC ERR194146.calls | head
1:828090:rs572227970:G/A	NC		68	0
1:839778:rs558497420:C/T	NC		21	0
1:839797:rs544279786:G/A	NC		19	0
1:839818:rs554395906:T/G	NC		19	0
1:840098:rs548643931:C/G	NC		21	0
1:840696:rs377564551:C/T	NC		20	0
1:843912:rs551374321:A/G	NC		0	8
1:845311:rs145001858:C/T	NC		21	0
1:859206:rs188534988:G/A	NC		19	0
1:859240:rs576931724:G/A	NC		21	0
Usually the no-call situation is caused by too high (first line) or too low frequency of k-mers.

Example #2: Alternative outputs

If you want to see the genotype prediction of ML model for these lines as well, re-run gmer_caller with the argument '--non_canonical'.
./gmer_caller --non_canonical ERR194146.counts > ERR194146.all_calls
1:828090:rs572227970:G/A	AAA	0.53	68	0
1:839778:rs558497420:C/T	A	0.54	21	0
1:839797:rs544279786:G/A	A	0.81	19	0
1:839818:rs554395906:T/G	A	0.81	19	0
1:840098:rs548643931:C/G	A	0.54	21	0
1:840696:rs377564551:C/T	A	0.69	20	0
1:843912:rs551374321:A/G	B	1.00	0	8
1:845311:rs145001858:C/T	A	0.54	21	0
1:859206:rs188534988:G/A	A	0.81	19	0
1:859240:rs576931724:G/A	A	0.54	21	0
The likelihoods of ALL considered genotypes can also be shown. For this, re-run gmer_caller with the argument '--alternatives'.
./gmer_caller --alternatives --header ERR194146.counts > ERR194146.alt_calls
The last 15 columns show likelihoods of genotypes in the following order:
                                  --   A-   B-   AA   AB   BB   AAA  AAB  BBA  BBB  AAAA AAAB BBBA AABB BBBB
1:828090:rs572227970:G/A NC 68	0 0.00 0.00 0.00 0.47 0.00 0.00 0.53 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1:839778:rs558497420:C/T NC 21	0 0.00 0.54 0.00 0.46 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1:839797:rs544279786:G/A NC 19	0 0.00 0.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1:839818:rs554395906:T/G NC 19	0 0.00 0.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1:840098:rs548643931:C/G NC 21	0 0.00 0.54 0.00 0.46 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1:840696:rs377564551:C/T NC 20	0 0.00 0.69 0.00 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1:843912:rs551374321:A/G NC  0	8 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1:845311:rs145001858:C/T NC 21	0 0.00 0.54 0.00 0.46 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1:859206:rs188534988:G/A NC 19	0 0.00 0.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1:859240:rs576931724:G/A NC 21	0 0.00 0.54 0.00 0.46 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Example #3: Generating FASTQ from BAM file

In many projects the raw reads are deleted as soon as possible and only the BAM file is archived after mapping the reads. If you want to use the sequence data hidden in these BAM files you would first need to convert the BAM file to a FASTQ file. There are several software packages for this. For example:
TopHat (ver 2.1)
bam2fastx -A -o NA12877.fastq NA12877_S1.bam

Samtools (ver 1.3)
samtools fastq NA12877_S1.bam > ERR194146.fastq

older Samtools (ver 0.1.XX)
samtools bam2fq NA12877_S1.bam > ERR194146.fastq

Picard

Example #4: Re-using the ML model

Small SNV databases (SNVs from single chromosome or from few candidate genes, SNVs from protein-coding regions) can be created and used as described in the Example #1. However, the parameters for ML model are always estimated more precicely from larger set of markers. Thus, it is reasonable to calculate model parameters from the largest possible marker set (currently 30M SNVs) and then re-use the model for calling any subsequent marker sets, if necessary. Model parameters can be shown by starting gmer_caller using the argument '--info'.
This allows saving the parameters by copy-paste and re-running gmer_caller with the same model.
./gmer_caller --info --no_genotypes ERR194146.counts
#Sex	M
#EstimatedCoverage	40.5653
#AverageMAF	0.0432994
#AutosomeModel	0.0169558 0.000216533 0.00525876 0.993028 40.5653 503.224 -0.155953
#XModel	0.00855989 0.000122085 0.999546 0.000289629 40.6607 1366.89 -0.134021
For re-using parameters just copy the 7 numbers that define the model behind argument '--params'
NB! The autosomal and sex-cromosome models have different parameters.
Therefore, you need to process them separately if you want to calculate the genotypes using previously saved model parameters.
./gmer_caller --runs 0 --model diploid --params 0.0169558 0.000216533 0.00525876 0.993028 40.5653 503.224 -0.155953 ERR194146.autosomal_SNV.counts > ERR194146.autosomal_SNV.calls
./gmer_caller --runs 0 --model haploid --params 0.00855989 0.000122085 0.999546 0.000289629 40.6607 1366.89 -0.134021 ERR194146.chrXY_SNV.counts > ERR194146.chrXY_SNV.calls

Instructions for creating custom k-mer databases

You can download couple of databases with common human SNV k-mers from the download page.
However, if you need to search for custom variants with gmer_counter, then you need to create your own database.
The minimal required format for k-mer databases is rather straightforward tabulated text format:
Name_of_the_variant	Number_of_kmers_for_given_variant	k-mer_1_for_variant_A	k-mer_1_for_variant_B
Please note:
1. You can use 1-3 pairs of k-mers (either 2, 4 or 6 k-mers) for each variant.
2. Each k-mer and it's reverse complement should appear only ONCE in the entire database. Using the same k-mer for different variants will cause errors.
3. All k-mers in one database must have the same length k.
4. If database contains both autosomal and sex chromosomes then the names of autosomal variants should start with a number and sex chromosome should start with letter X or letter Y.