User manual

The first example lists all the commands required for genotyping with the KATK package, step-by-step. A single male individual from Illumina Platinum dataset (NA12877) and our standard k-mer list with all protein coding exons is used in this example.

Preparations

Before you can start genotyping, you need to a) download KATK binaries, b) download region-specific database and c) have some sequencing data in FASTQ-format file. Make sure you have enough space for storing these files. The FASTQ file that is used in this example has ca 50x coverage and is rather large - ca 411 GB unpacked. Region-specific database contains all protein-coding exons from human genome and is ca 5 GB in size. The output files are rather small, ca 50 MB.

1. Download KATK binaries, unpack.
wget http://bioinfo.ut.ee/KATK/downloads/KATK_4_1_26.tar.gz
tar zxvf KATK_4_1_26.tar.gz


2. Download region-specific files
wget http://bioinfo.ut.ee/KATK/downloads/KATK_db_20200401.tar.gz
tar zxvf KATK_db_20200401.tar.gz

These files contain binary database of region-specific k-mers (cmd_20190410.dbb) and text file with coordinates of regions (cmd_20191031.txt)

3. Download FASTQ files with sequencing reads of the individual NA12877 (coded as ERR194146) and unpack. You can use other FASTQ files if you already have them on your server.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146_2.fastq.gz
gunzip ERR194146*.fastq.gz

Example #1: Standard genotyping procedure

A. Create an index from your FASTQ file ERR194146:
./gmer_counter -dbb cmd_20190410.dbb --compile_index ERR194146.index ERR194146*.fastq
./gassembler -dbi ERR194146.index --file cmd_20191031.txt > ERR194146.calls

This should give you a text file with genotype calls.


head ERR194146.calls
First page of results shows mainly NC type of genotypes. These are NoCall positions, where the depth of coverage is too low to make a certain call.
We intentionally show all these positions, because we cannot reliably detect whether these are mutations or reference alleles.
We believe it is better to explicitly list all these uncertainly called positions instead of showing reference nucleotide by default.
#KATK version: 4.1.26
#KMer Database: ERR194146.index
#Coverage: 38.00
CHR	POS	SUB	REF	COV	CALL	CLASS	P	PMUT
1	923837	0	G	3	NC	0	0.511	0.211
1	923838	0	T	3	NC	0	0.511	0.211
1	923839	0	C	3	NC	0	0.511	0.206
1	923840	0	G	3	NC	0	0.511	0.199
1	923841	0	G	3	NC	0	0.511	0.190
1	923842	0	C	3	NC	0	0.511	0.180

It is possible to exclude uncertain positions (NC) with grep -v
First 30 lines of reliably detected mutations should look like this:
grep -v NC ERR194146.calls | head -n 34
#KATK version: 4.1.26
#KMer Database: ERR194146.index
#Coverage: 38.00
CHR	POS	    SUB	REF	COV	CALLCLASS	P	PMUT
1	924024	0	C	20	GG	S	0.999	0.650
1	924310	0	C	28	GG	S	0.999	0.853
1	924321	0	C	28	GG	S	0.999	0.831
1	924533	0	A	18	GG	S	0.999	0.625
1	925081	0	G	21	AA	S	0.999	0.719
1	925141	0	C	28	AA	S	1.000	0.831
1	935954	0	G	26	GT	S	0.997	0.768
1	940390	0	A	36	AG	S	1.000	0.999
1	941119	0	A	27	GG	S	1.000	0.837
1	942335	0	C	22	GG	S	1.000	0.826
1	942451	0	T	45	CC	S	1.000	0.936
1	944296	0	G	60	AA	S	1.000	0.903
1	944307	0	T	56	CC	S	1.000	0.914
1	944858	0	A	40	GG	S	1.000	0.923
1	945261	0	C	30	C-	D	1.000	0.852
1	946247	0	G	39	AG	S	0.999	0.937
1	946653	0	G	50	AG	S	0.999	0.919
1	948245	0	A	39	GG	S	1.000	0.935
1	952180	0	A	34	CC	S	1.000	0.927
1	952421	0	A	25	GG	S	0.997	0.809
1	953259	0	T	14	CC	S	0.997	0.555
1	953279	0	T	31	CC	S	1.000	0.876
1	953778	0	G	30	CC	S	1.000	0.855
1	953779	0	A	31	CC	S	1.000	0.876
1	954258	0	G	36	CC	S	1.000	0.916
1	954333	0	C	45	AA	S	1.000	0.965
1	957365	0	G	54	AA	S	1.000	0.904
1	957900	0	G	43	AA	S	1.000	0.999
1	959193	0	G	29	AA	S	1.000	0.864
1	961945	0	G	32	CC	S	1.000	0.997

First two columns indicate position in reference genome.
Fourth column (REF) indicates the reference nucleotide.
Fifth column (COV) shows depth of coverage of usable reads at this position.
Sixth column (CALL) shows predicted genotype in studied individual.
Seventh column (P) indicates how certain we are that given position is different from reference nucleotide.
Eighth column (PMUT) indicates how certain we are that we would have detected alternative allele if there was one.

By default we classify a position as no-call genotype (NC) if P<0.95 or PMUT<0.5

Example #2: Alternative outputs

If you want to see the frequencies of each nucleotide at each position, re-run gassembler with the argument '--counts'. The column GAP means frequency of reads that have deletion at given position
./gassembler -dbi ERR194146.index --file cmd_20191031.txt --counts > ERR194146.calls
grep -v NC ERR194146.calls | head -n 34
#KATK version: 4.1.26
#KMer Database: ERR194146.index
#Coverage: 38.00
CHR	POS	SUB	REF	COV	CALL	CLASS	P	PMUT	A	C	G	T	GAP
1	924024	0	C	20	GG	S	0.999	0.650	0	0	20	0	0
1	924310	0	C	28	GG	S	0.999	0.853	0	0	28	0	0
1	924321	0	C	28	GG	S	0.999	0.831	0	0	28	0	0
1	924533	0	A	18	GG	S	0.999	0.625	0	0	18	0	0
1	925081	0	G	21	AA	S	0.999	0.719	21	0	0	0	0
1	925141	0	C	28	AA	S	1.000	0.831	28	0	0	0	0
1	935954	0	G	26	GT	S	0.997	0.768	0	0	13	13	0
1	940390	0	A	36	AG	S	1.000	0.999	19	0	17	0	0
1	941119	0	A	27	GG	S	1.000	0.837	0	0	27	0	0
1	942335	0	C	22	GG	S	1.000	0.826	0	0	22	0	0
1	942451	0	T	45	CC	S	1.000	0.936	0	45	0	0	0
1	944296	0	G	60	AA	S	1.000	0.903	60	0	0	0	0
1	944307	0	T	56	CC	S	1.000	0.914	0	56	0	0	0
1	944858	0	A	40	GG	S	1.000	0.923	0	0	40	0	0
1	945261	0	C	30	C-	D	1.000	0.852	0	19	0	0	11
1	946247	0	G	39	AG	S	0.999	0.937	23	0	16	0	0
1	946653	0	G	50	AG	S	0.999	0.919	26	0	24	0	0
1	948245	0	A	39	GG	S	1.000	0.935	0	0	39	0	0
1	952180	0	A	34	CC	S	1.000	0.927	0	34	0	0	0
1	952421	0	A	25	GG	S	0.997	0.809	0	0	25	0	0
1	953259	0	T	14	CC	S	0.997	0.555	0	14	0	0	0
1	953279	0	T	31	CC	S	1.000	0.876	0	31	0	0	0
1	953778	0	G	30	CC	S	1.000	0.855	0	30	0	0	0
1	953779	0	A	31	CC	S	1.000	0.876	0	31	0	0	0
1	954258	0	G	36	CC	S	1.000	0.916	0	36	0	0	0
1	954333	0	C	45	AA	S	1.000	0.965	45	0	0	0	0
1	957365	0	G	54	AA	S	1.000	0.904	54	0	0	0	0
1	957900	0	G	43	AA	S	1.000	0.999	43	0	0	0	0
1	959193	0	G	29	AA	S	1.000	0.864	29	0	0	0	0
1	961945	0	G	32	CC	S	1.000	0.997	0	32	0	0	0

Adjusting performance

The performance of gmer_counter can be adjusted by arguments '--num_threads' and '--prefetch'.
The performance of gassembler can be adjusted by arguments '--num_threads' and '--prefetch_seq'.
In both cases, '--num_threads' should not exceed the number of CPU cores that you have access to (default is 24).

Prefetching files should be switched on if you have more than 64GB RAM available. This gives significant boost to performance, particularly if you plan to call more than one individual.
For example:
./gmer_counter -dbb cmd_20190410.dbb --compile_index ERR194146.index --prefetch ERR194146*.fastq
./gassembler -dbi ERR194146.index --file cmd_20191031.txt --prefetch_seq > ERR194146.calls

Another way top speed up the calling by gassembler is to define sex of each individual on command line.
This saves time by omitting automated sex estimation from data.
For example:
./gassembler -dbi ERR194146.index --file cmd_20191031.txt --prefetch_seq --sex male > ERR194146.calls