User manual
The first example lists all the commands required for genotyping with the KATK package, step-by-step. A single male individual from Illumina Platinum dataset (NA12877) and our standard k-mer list with all protein coding exons is used in this example.Preparations
Before you can start genotyping, you need to a) download KATK binaries, b) download region-specific database and c) have some sequencing data in FASTQ-format file. Make sure you have enough space for storing these files. The FASTQ file that is used in this example has ca 50x coverage and is rather large - ca 411 GB unpacked. Region-specific database contains all protein-coding exons from human genome and is ca 5 GB in size. The output files are rather small, ca 50 MB.1. Download KATK binaries, unpack.
wget http://bioinfo.ut.ee/KATK/downloads/KATK_4_1_26.tar.gz
tar zxvf KATK_4_1_26.tar.gz
2. Download region-specific files
wget http://bioinfo.ut.ee/KATK/downloads/KATK_db_20200401.tar.gz
tar zxvf KATK_db_20200401.tar.gz
These files contain binary database of region-specific k-mers (cmd_20190410.dbb) and text file with coordinates of regions (cmd_20191031.txt)
3. Download FASTQ files with sequencing reads of the individual NA12877 (coded as ERR194146) and unpack. You can use other FASTQ files if you already have them on your server.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146_2.fastq.gz
gunzip ERR194146*.fastq.gz
Example #1: Standard genotyping procedure
A. Create an index from your FASTQ file ERR194146:
./gmer_counter -dbb cmd_20190410.dbb --compile_index ERR194146.index ERR194146*.fastq
./gassembler -dbi ERR194146.index --file cmd_20191031.txt > ERR194146.calls
This should give you a text file with genotype calls.
head ERR194146.calls
First page of results shows mainly NC type of genotypes. These are NoCall positions, where the depth of coverage is too low to make a certain call.
We intentionally show all these positions, because we cannot reliably detect whether these are mutations or reference alleles.
We believe it is better to explicitly list all these uncertainly called positions instead of showing reference nucleotide by default.
#KATK version: 4.1.26 #KMer Database: ERR194146.index #Coverage: 38.00 CHR POS SUB REF COV CALL CLASS P PMUT 1 923837 0 G 3 NC 0 0.511 0.211 1 923838 0 T 3 NC 0 0.511 0.211 1 923839 0 C 3 NC 0 0.511 0.206 1 923840 0 G 3 NC 0 0.511 0.199 1 923841 0 G 3 NC 0 0.511 0.190 1 923842 0 C 3 NC 0 0.511 0.180
It is possible to exclude uncertain positions (NC) with grep -v
First 30 lines of reliably detected mutations should look like this:
grep -v NC ERR194146.calls | head -n 34
#KATK version: 4.1.26 #KMer Database: ERR194146.index #Coverage: 38.00 CHR POS SUB REF COV CALLCLASS P PMUT 1 924024 0 C 20 GG S 0.999 0.650 1 924310 0 C 28 GG S 0.999 0.853 1 924321 0 C 28 GG S 0.999 0.831 1 924533 0 A 18 GG S 0.999 0.625 1 925081 0 G 21 AA S 0.999 0.719 1 925141 0 C 28 AA S 1.000 0.831 1 935954 0 G 26 GT S 0.997 0.768 1 940390 0 A 36 AG S 1.000 0.999 1 941119 0 A 27 GG S 1.000 0.837 1 942335 0 C 22 GG S 1.000 0.826 1 942451 0 T 45 CC S 1.000 0.936 1 944296 0 G 60 AA S 1.000 0.903 1 944307 0 T 56 CC S 1.000 0.914 1 944858 0 A 40 GG S 1.000 0.923 1 945261 0 C 30 C- D 1.000 0.852 1 946247 0 G 39 AG S 0.999 0.937 1 946653 0 G 50 AG S 0.999 0.919 1 948245 0 A 39 GG S 1.000 0.935 1 952180 0 A 34 CC S 1.000 0.927 1 952421 0 A 25 GG S 0.997 0.809 1 953259 0 T 14 CC S 0.997 0.555 1 953279 0 T 31 CC S 1.000 0.876 1 953778 0 G 30 CC S 1.000 0.855 1 953779 0 A 31 CC S 1.000 0.876 1 954258 0 G 36 CC S 1.000 0.916 1 954333 0 C 45 AA S 1.000 0.965 1 957365 0 G 54 AA S 1.000 0.904 1 957900 0 G 43 AA S 1.000 0.999 1 959193 0 G 29 AA S 1.000 0.864 1 961945 0 G 32 CC S 1.000 0.997First two columns indicate position in reference genome.
Fourth column (REF) indicates the reference nucleotide.
Fifth column (COV) shows depth of coverage of usable reads at this position.
Sixth column (CALL) shows predicted genotype in studied individual.
Seventh column (P) indicates how certain we are that given position is different from reference nucleotide.
Eighth column (PMUT) indicates how certain we are that we would have detected alternative allele if there was one.
By default we classify a position as no-call genotype (NC) if P<0.95 or PMUT<0.5
Example #2: Alternative outputs
If you want to see the frequencies of each nucleotide at each position, re-run gassembler with the argument '--counts'. The column GAP means frequency of reads that have deletion at given position./gassembler -dbi ERR194146.index --file cmd_20191031.txt --counts > ERR194146.calls
grep -v NC ERR194146.calls | head -n 34
#KATK version: 4.1.26 #KMer Database: ERR194146.index #Coverage: 38.00 CHR POS SUB REF COV CALL CLASS P PMUT A C G T GAP 1 924024 0 C 20 GG S 0.999 0.650 0 0 20 0 0 1 924310 0 C 28 GG S 0.999 0.853 0 0 28 0 0 1 924321 0 C 28 GG S 0.999 0.831 0 0 28 0 0 1 924533 0 A 18 GG S 0.999 0.625 0 0 18 0 0 1 925081 0 G 21 AA S 0.999 0.719 21 0 0 0 0 1 925141 0 C 28 AA S 1.000 0.831 28 0 0 0 0 1 935954 0 G 26 GT S 0.997 0.768 0 0 13 13 0 1 940390 0 A 36 AG S 1.000 0.999 19 0 17 0 0 1 941119 0 A 27 GG S 1.000 0.837 0 0 27 0 0 1 942335 0 C 22 GG S 1.000 0.826 0 0 22 0 0 1 942451 0 T 45 CC S 1.000 0.936 0 45 0 0 0 1 944296 0 G 60 AA S 1.000 0.903 60 0 0 0 0 1 944307 0 T 56 CC S 1.000 0.914 0 56 0 0 0 1 944858 0 A 40 GG S 1.000 0.923 0 0 40 0 0 1 945261 0 C 30 C- D 1.000 0.852 0 19 0 0 11 1 946247 0 G 39 AG S 0.999 0.937 23 0 16 0 0 1 946653 0 G 50 AG S 0.999 0.919 26 0 24 0 0 1 948245 0 A 39 GG S 1.000 0.935 0 0 39 0 0 1 952180 0 A 34 CC S 1.000 0.927 0 34 0 0 0 1 952421 0 A 25 GG S 0.997 0.809 0 0 25 0 0 1 953259 0 T 14 CC S 0.997 0.555 0 14 0 0 0 1 953279 0 T 31 CC S 1.000 0.876 0 31 0 0 0 1 953778 0 G 30 CC S 1.000 0.855 0 30 0 0 0 1 953779 0 A 31 CC S 1.000 0.876 0 31 0 0 0 1 954258 0 G 36 CC S 1.000 0.916 0 36 0 0 0 1 954333 0 C 45 AA S 1.000 0.965 45 0 0 0 0 1 957365 0 G 54 AA S 1.000 0.904 54 0 0 0 0 1 957900 0 G 43 AA S 1.000 0.999 43 0 0 0 0 1 959193 0 G 29 AA S 1.000 0.864 29 0 0 0 0 1 961945 0 G 32 CC S 1.000 0.997 0 32 0 0 0
Adjusting performance
The performance of gmer_counter can be adjusted by arguments '--num_threads' and '--prefetch'.The performance of gassembler can be adjusted by arguments '--num_threads' and '--prefetch_seq'.
In both cases, '--num_threads' should not exceed the number of CPU cores that you have access to (default is 24).
Prefetching files should be switched on if you have more than 64GB RAM available. This gives significant boost to performance, particularly if you plan to call more than one individual.
For example:
./gmer_counter -dbb cmd_20190410.dbb --compile_index ERR194146.index --prefetch ERR194146*.fastq
./gassembler -dbi ERR194146.index --file cmd_20191031.txt --prefetch_seq > ERR194146.calls
Another way top speed up the calling by gassembler is to define sex of each individual on command line.
This saves time by omitting automated sex estimation from data.
For example:
./gassembler -dbi ERR194146.index --file cmd_20191031.txt --prefetch_seq --sex male > ERR194146.calls