Manual for downloadable version of StrainSeeker

Installing instructions
Database
- Size
- Creating database
- Contents
- Subtrees
- Parameters and their effect
Search
- Search process
- Custom parameters
Creating blacklist
File descriptions

Installing instructions

To run StrainSeeker, please first download the database and programs.
Extract all the files into a separate directory for StrainSeeker. Database resides in its own subdirectory (directory is already created, just use tar -xzvf to unpack)
All the programs and scripts should be under the main directory, but not IN the database directory
Finally, just run either "seeker.pl" for detecting strains or "builder.pl" if you wish to build your own database.

Dependencies

System requirements

PERL
R

Programs required to run StrainSeeker

Builder

GenomeTester4 programs:

GlistMaker
GlistCompare
GlistQuery

Seeker

GenomeTester4 programs:

GlistMaker
GlistCompare
GlistQuery
GDistribution

R scripts for statistical tests

Database

Structure and size

The total space required while building the database was about 200 GB but 300 GB is recommended (some large temporary files are created). Structural information (which are parent/child nodes, k-mer counts) is stored in a small text file info.txt, which resides in the database directory.

Creating database

EXAMPLE COMMAND LINE: perl builder.pl -n refseq_guide_tree.nwk -d strain_fasta_directory -w 32 -b ss_blacklist_w32.list -o my_database

-n is the guide tree in Newick format, describing the relationships between given strains.
-d is a directory containing all the .fna files for strains used in the Newick file.
-b is the path to blacklist (must have the same k-mer length as parameter -w).
-w is the k-mer length.
-o user-defined database name.

Additional parameters can be used which can be seen below or with the help flag: perl builder.pl -h

Required files

Assembled genomes with suffix .fna
Guide tree in newick (nwk) format

Database contents

K-mer lists for each leaf and node (*.list)
Info file describing relations of each node as well as total unique k-mer count in it (info.txt)
Whitelist - is a union list of all node and leaf used before each search a intersection with this and the sample k-mer list is made not to go through k-mers that are not in the database repeatedly
Subwhite - union of all roots, used after whitelist and sample intersection is made, intersected with whitelist and sample intersection giving a small list subroot k-mers present in the sample, each subroot list is then compared with this list and subroots which exceed observed fraction limit (default O>5%) are used in the main search

Subtrees

Depending on the tree size and diversity of the strains used, some nodes (including root) might be empty of k-mers. Therefore multiple subtrees are automatically produced where the number of total unique k-mers in node exceed given cutoff (Builder's -m or --min parameter). A tree is also split into subtrees if the number of k-mers in a node exceeds previously mentioned cutoff, but still has considerably less k-mers than one of it's subnodes (difference can be set with Builder's -g or --greater parameter). Builder and Seeker take subtrees into account automatically.

Builder parameters and their effect

-b, --blacklist

-w, --word

-m, --min

-g, --greater

-t, --threads

Search

EXAMPLE COMMAND LINE: perl seeker.pl -i sample_file.fastq -d ss_db_w32 -o sample_result.txt

Search process

Info file is read from db_name/info.txt
Sample is converted into k-mer list
Finding subroots to start search from
Searching from each subtree

Displaying search process and paths taken if "-verbose" flag is used

Printing results to output file (default StrainSeeker_output)

Seeker parameters and their effect

-verbose

Creating blacklist

Option 1: Use GlistMaker to create blacklist (it can take multiple FASTA files as input)
Option 2: Make lists of sequences to be added to the blacklist and use MakeUnionMT.pl to join these lists into one big union list
Option 3: Download pre-made blacklist

File descriptions

Builder.pl - Perl script used to build StrainSeeker database (requires: GenomeTester4 programs)
Seeker.pl - Perl script used to search from StrainSeeker database (requires: GenomeTester4 programs, gDistribution, oe.R, cov.R)
GenomeTester4 programs

Link to GitHub

GlistMaker
GlistCompare
GlistQuery

oe.R - R script used to calculate O/E ratio (required by: Seeker.pl)
cov.R - R script used to calculate coverage (requires: gDistribution; required by Seeker.pl)
gDistribution - gives distribution of frequencies for given k-mer list file (required by cov.R, which is required by Seeker.pl)
MakeUnionMT.pl - Perl script for making large unions of k-mer list files (can be used to make blacklist)

Welcome to StrainSeeker

sequencing read analyzer for detecting bacterial strains