Manual for downloadable version of StrainSeeker
Contents
-
Installing instructions
-
Dependencies
- System
- Programs
-
Database
- Size
- Creating database
- Should I use blacklist?
- Contents
- Subtrees
- Parameters and their effect
-
Search
- Search process
- Custom parameters
- Creating blacklist
- File descriptions
Installing instructions
- To run StrainSeeker, please first download the database and programs.
- Extract all the files into a separate directory for StrainSeeker. Database resides in its own subdirectory (directory is already created, just use tar -xzvf to unpack)
- All the programs and scripts should be under the main directory, but not IN the database directory
- Finally, just run either "seeker.pl" for detecting strains or "builder.pl" if you wish to build your own database.
Dependencies
System requirements Programs required to run StrainSeeker- Builder
- GenomeTester4 programs:
- GlistMaker
- GlistCompare
- GlistQuery
- Seeker
- GenomeTester4 programs:
- GlistMaker
- GlistCompare
- GlistQuery
- GDistribution
- R scripts for statistical tests
Database
Structure and size
The total space required while building the database was about 200 GB but 300 GB is recommended (some large temporary files are created). Structural information (which are parent/child nodes, k-mer counts) is stored in a small text file info.txt, which resides in the database directory.
Creating database
EXAMPLE COMMAND LINE: perl builder.pl -n refseq_guide_tree.nwk -d strain_fasta_directory -w 32 -b ss_blacklist_w32.list -o my_database-n is the guide tree in Newick format, describing the relationships between given strains.
-d is a directory containing all the .fna files for strains used in the Newick file.
-b is the path to blacklist (must have the same k-mer length as parameter -w).
-w is the k-mer length.
-o user-defined database name.
Additional parameters can be used which can be seen below or with the help flag: perl builder.pl -h
Required files
- Assembled genomes with suffix .fna
- Guide tree in newick (nwk) format NOTE: Assembled genome file names must match the names used in .nwk file (without .fna; beware of underscores (_) turning into spaces if using MEGA).
EXAMPLE: if Newick file contains a genome named "E_coli_MG1655", then the fasta file name must be "E_coli_MG1655.fna"
Database contents
- K-mer lists for each leaf and node (*.list)
- Info file describing relations of each node as well as total unique k-mer count in it (info.txt)
- Whitelist - is a union list of all node and leaf used before each search a intersection with this and the sample k-mer list is made not to go through k-mers that are not in the database repeatedly
- Subwhite - union of all roots, used after whitelist and sample intersection is made, intersected with whitelist and sample intersection giving a small list subroot k-mers present in the sample, each subroot list is then compared with this list and subroots which exceed observed fraction limit (default O>5%) are used in the main search
Subtrees
Depending on the tree size and diversity of the strains used, some nodes (including root) might be empty of k-mers. Therefore multiple subtrees are automatically produced where the number of total unique k-mers in node exceed given cutoff (Builder's -m or --min parameter). A tree is also split into subtrees if the number of k-mers in a node exceeds previously mentioned cutoff, but still has considerably less k-mers than one of it's subnodes (difference can be set with Builder's -g or --greater parameter). Builder and Seeker take subtrees into account automatically.
Builder parameters and their effect
-b, --blacklist
.list file of k-mers unwanted in database (human, plasmids etc).Using a blacklist while creating a database gives more accurate results. For example by random chance some of the k-mers from the strains added to the database might contain k-mers represented in human genome as well. As many clinical samples contain human DNA, the results might be skewed towards some strains. The problem is more pronounced in case of plasmids which can integrate to the bacterial genome or be absent in others.
-w, --word
K-mer length (word size) used in database building and later searchingIf the k-mer length is very short, there are very little node specific k-mers in the database. For example a 3-mer with a sequence of ATG is probably in every organism's DNA sequence. The longer the k-mer the more specific it is and the number of all k-mers from a sequence increases. The k-mer can not be too long due to read lengths and for the fact that every SNP loses k k-mers from the sample.
-m, --min
Minimal amout of k-mers in node to be considered as subroot.
-g, --greater
Maximum times child could have more k-mers than parent.
-t, --threads
Number of cores used.
Search
EXAMPLE COMMAND LINE: perl seeker.pl -i sample_file.fastq -d ss_db_w32 -o sample_result.txtSearch process
- Info file is read from db_name/info.txt
- Sample is converted into k-mer list
- Finding subroots to start search from
- Searching from each subtree
- Displaying search process and paths taken if "-verbose" flag is used
- Printing results to output file (default StrainSeeker_output)
Seeker parameters and their effect
-verbose
Outputs the search process while running.
Creating blacklist
- Option 1: Use GlistMaker to create blacklist (it can take multiple FASTA files as input)
- Option 2: Make lists of sequences to be added to the blacklist and use MakeUnionMT.pl to join these lists into one big union list
- Option 3: Download pre-made blacklist
File descriptions
- Builder.pl - Perl script used to build StrainSeeker database (requires: GenomeTester4 programs)
- Seeker.pl - Perl script used to search from StrainSeeker database (requires: GenomeTester4 programs, gDistribution, oe.R, cov.R)
- GenomeTester4 programs
- GlistMaker
- GlistCompare
- GlistQuery
- oe.R - R script used to calculate O/E ratio (required by: Seeker.pl)
- cov.R - R script used to calculate coverage (requires: gDistribution; required by Seeker.pl)
- gDistribution - gives distribution of frequencies for given k-mer list file (required by cov.R, which is required by Seeker.pl)
- MakeUnionMT.pl - Perl script for making large unions of k-mer list files (can be used to make blacklist)
-
Link to GitHub