4. The annotate command

snptoolkit annotate -h
usage: snptoolkit annotate [-h] -i IDENTIFIER -g GENBANK [-p PROCESSORS] [-f EXCLUDECLOSESNPS] [-q QUALITY] [-d DEPTH] [-r RATIO] [-e EXCLUDE]

optional arguments:
  -h, --help           show this help message and exit

snpToolkit annotate required options:
  -i IDENTIFIER        provide a specific identifier to recognize the file(s) to be analyzed
  -g GENBANK           Pleae provide a genbank file

snpToolkit annotate additional options:
  -p PROCESSORS        number of vcf files to be annotated in parallel default value [1]
  -f EXCLUDECLOSESNPS  exclude SNPs if the distance between them is lower then the specified window size in bp
  -q QUALITY           quality score to consider as a cutoff for variant calling. default value [20]
  -d DEPTH             minimum depth caverage. default value [3]
  -r RATIO             minimum ratio that correspond to the number of reads that has the mutated allele / total depth in that particular position. default
                       value [0]
  -e EXCLUDE           provide a tab file with genomic regions to exclude in this format: region start stop. region must correspond to the same name(s) of
                       chromsome and plasmids as in the genbank file

4.1. Options

This command allows to filter and annotate all SNPs from each selected VCF files. Only two options are required:

Option Function
-i The user need to specify a common identifier found on all VCF files he wants to analyze. If only one VCF file is to be analyzed, provide the file name. If all VCF files should be analyzed, the user needs to provide e.g vcf as all vcf files will have at the end .vcf.gz of .vcf.
-g genbank file. The genbank file must include the fasta sequence for the chromosome and plasmids, if any. genbank files can be downloaded from NCBI.

Several options are additional and are needed to filter SNPs:

Option Function
-f To be able to exclude all SNPs that can be located in hotspot zones or short repeats, it is possible to specify an integer that will correspond to the minimum of distance between SNPs to be kept. if the distance between two SNPs is lower than the specified cutoff, both SNPs will be ignored.
-q Quality score to consider as a cutoff for variant calling. The default value is 20.
-d Minimum depth coverage. The default value is 3.
-r r = M/M+R where M is the number of reads that carry the mutated allele and R is the is the number of reads that carry the reference allele. If not specified all SNPs will be taken into account.
-e This is to specify a tab delimited file with the coordinates of the regions to be ignored when annotating SNPs.

If we take the example of the genbank used for this tutorial:

$ grep 'LOCUS' /Users/amine/Documents/tutorials/snptoolkit/GCF_000009065.1_ASM906v1_genomic.gbff
LOCUS       NC_003143            4653728 bp    DNA     circular CON 20-MAR-2020
LOCUS       NC_003131              70305 bp    DNA     circular CON 20-MAR-2020
LOCUS       NC_003134              96210 bp    DNA     circular CON 20-MAR-2020
LOCUS       NC_003132               9612 bp    DNA     circular CON 20-MAR-2020

as you can see there is one chromosome NC_003143 and three plasmids: NC_003131, NC_003134 and NC_003132. The tab delimited file should look as follows:

NC_003143.1 4016    4079
NC_003143.1 7723    7758
NC_003143.1 11562   19149
NC_003143.1 25663   26698

If there are regions on the plasmids sequences you can also add them in the same file.

4.2. Run it

Now time to run the annotate command

$ snptoolkit annotate -i vcf -g GCF_000009065.1_ASM906v1_genomic.gbff -d 5 -q 30 -r 0.9 -p 4

[15:37:30] [INFO] [4 CPUs requested out of 8 detected on this machine]
[15:37:30] [INFO] [snpToolkit is filtering and annotating your SNPs]
100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:01<00:00,  2.67it/s]
[15:37:32] [INFO] [snpToolkit output files will be located in folders
                 snpToolkit_SNPs_output_...
                 and snpToolkit_INDELS_output_...]
100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:02<00:00,  3.95it/s]

4.3. Outputs

snpToolkit generates two folders with the date and time stamp, one for SNPs and one for indels:

├── snpToolkit_INDELS_output_...
│   ├── sample3_snpToolkit_indels.txt
│   ├── sample9_snpToolkit_indels.txt
│   ├── sample10_snpToolkit_indels.txt
│   ├── sample1_snpToolkit_indels.txt
│   ├── sample2_snpToolkit_indels.txt
│   ├── sample4_snpToolkit_indels.txt
│   ├── sample5_snpToolkit_indels.txt
│   ├── sample6_snpToolkit_indels.txt
│   ├── sample7_snpToolkit_indels.txt
│   └── sample8_snpToolkit_indels.txt
├── snpToolkit_SNPs_output_...
    ├── sample3_snpToolkit_SNPs.txt
    ├── sample9_snpToolkit_SNPs.txt
    ├── sample10_snpToolkit_SNPs.txt
    ├── sample1_snpToolkit_SNPs.txt
    ├── sample2_snpToolkit_SNPs.txt
    ├── sample4_snpToolkit_SNPs.txt
    ├── sample5_snpToolkit_SNPs.txt
    ├── sample6_snpToolkit_SNPs.txt
    ├── sample7_snpToolkit_SNPs.txt
    └── sample8_snpToolkit_SNPs.txt

All generated output files are tab delimited.

4.3.1. Example of SNP output file

##snpToolkit=__version__
##commandline= snptoolkit annotate -i vcf -g GCF_000009065.1_ASM906v1_genomic.gbff -d 5 -q 30 -r 0.9 -p 4
##VcfFile=sample5.vcf.gz
##Total number of SNPs before snpToolkit processing: 406
##The options -f and -e were not used
##Filtred SNPs. Among the 406 SNPs, the number of those with a quality score >= 30, a depth >= 5 and a ratio >= 0.9 is: 218
##After mapping, SNPs were located in:
##NC_003131.1: Yersinia pestis CO92 plasmid pCD1, complete sequence 70305 bp
##NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp
##The mapped and annotated SNPs are distributed as follow:
##Location      Genes   RBS     tRNA    rRNA    ncRNA   Pseudogenes     intergenic      Synonymous      NonSynonumous
##SNPs in NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp 155     0       0       1       0       0       57      54      101
##SNPs in NC_003131.1: Yersinia pestis CO92 plasmid pCD1, complete sequence 70305 bp    2       0       0       0       0       0       3       1       1
##Syn=Synonymous NS=Non-Synonymous
##Coordinates   REF     SNP     Depth   Nb of reads REF Nb reads SNPs   Ratio   Quality Annotation      Product Orientation     Coordinates in gene     Ref codon       SNP codon       Ref AA  SNP AA  Coordinates protein     Effect  Location
82      C       A       36      0       34      1.0     138.0   intergenic      .       +       .       -       -       -       -       -       -       NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp
130     G       C       28      0       27      1.0     144.0   intergenic      .       +       .       -       -       -       -       -       -       NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp
855     G       A       69      0       62      1.0     228.0   YPO_RS01010|asnC        transcriptional regulator AsnC  -       411     ACC     AC[T]   T       T       137     Syn     NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp

The first lines of the snptoolkit file for SNPs contain a summary and useful information. The SNPs annotation is organized in tab delimited table. The columns of this table are:

Column name Description
Coordinates SNP coordinate
REF Reference allele
SNP New allele in analyzed sample
Depth Total depth of coverage
Nb of reads REF Number of reads with the reference allele
Nb reads SNPs Number of reads with the new allele
Ratio Nb reads SNPs/(Nb of reads REF+Nb reads SNPs)
Quality Quality score
Annotation Distribution within genes or intergenic
Product Functional product of the gene
Orientation Gene orientation
Coordinates in gene Coordinate of the SNP within the gene
Ref codon Reference codon, ACC in the example above
SNP codon New codon, AC[T]
Ref AA Amino Acid corresponding to reference codon
SNP AA Amino Acid corresponding to new codon
Coordinates protein Coordinate of the Amino Acid
Effect Could be Synonymous (Syn) or Non-Synonymous (NS)
Location ID of the chromosome and plasmids.

Warning

In the example above, the total depth for the first SNP is 36, while the number of reads that carry the reference allele plus the number of reads that carry the new allele is equal to 34. The VCF file corresponding to that sample is generated using samtools mpileup. By default, samtools mpileup applies Phred-scaled probability of a read base being misaligned, known as BAQ. As indicated in samtools documentation, this greatly helps to reduce false SNPs caused by misalignments. The total depth shown by snpToolkit is the raw depth taking into account all reads (column 4). However, the columns 5 and 6 show the number of reads with Phred-scaled probability. The ratio in column 7 is based only on column 5 and 6. I have made this decision to store as much information as possible from the original VCF file. If the VCF files where produced using samtools-mpileup with the option -B to skip Phred-scaled probability, you will not see such difference.

4.3.2. Example of INDELS output file

The indels output is in tab delimited format as follows:

55732   CCGGGGCGGGGCGGGGCGG CCGGGGCGGGGCGG  62  0   20  1.0 228.0   intergenic  .   .   deletion    5   NC_003131.1: Yersinia pestis CO92 plasmid pCD1, complete sequence 70305 bp
35188   T   TTC 41  0   32  1.0 228.0   intergenic  .   .   insertion   2   NC_003134.1: Yersinia pestis CO92 plasmid pMT1, complete sequence 96210 bp
73418   GAA GA  72  0   68  1.0 228.0   intergenic  .   .   deletion    1   NC_003134.1: Yersinia pestis CO92 plasmid pMT1, complete sequence 96210 bp
16  AC  A   13  0   13  1.0 149.0   intergenic  .   .   deletion    1   NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp
183029  CCAATAACAAT CCAATAACAATAACAAT   95  0   24  1.0 228.0   intergenic  .   .   insertion   6   NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp
266466  AGGGGGGGG   AGGGGGGGGG  40  1   25  0.96    66.0    CDS YPO_RS02340|YPO_RS02340 EscV/YscV/HrcV family type III secretion system export apparatus protein    insertion   1   NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp
552919  TGGGGGGG    TGGGGGGGG   93  0   71  1.0 122.0   CDS YPO_RS03585|tssM    type VI secretion system membrane subunit TssM  insertion   1   NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp
581519  GTTCAATTCAATTCAAT   GTTCAATTCAATTCAATTCAAT  31  0   9   1.0 228.0   intergenic  .   .   insertion   5   NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp
747924  AGGGGGGGG   AGGGGGGGGG  41  1   26  0.96    71.0    CDS YPO_RS04395|YPO_RS04395 pseudopilin insertion   1   NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp
813977  GC  GCCTGGCCATC 54  0   10  1.0 228.0   CDS YPO_RS04755|YPO_RS04755 DASS family sodium-coupled anion symporter  insertion   9   NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp

for the case of the position 266466 for example

266466  AGGGGGGGG   AGGGGGGGGG  40  1   25  0.96    66.0    CDS YPO_RS02340|YPO_RS02340 EscV/YscV/HrcV family type III secretion system export apparatus protein    insertion   1   NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp

The different columns are:

Column number Description
1 Coordinates (266466)
2 Reference (AGGGGGGGG)
3 Sample (AGGGGGGGGG)
4 Number of total reads (40)
5 Number of reads with reference sequence (1)
6 Number of reads with new sequence (25)
7 Ratio (0.96)
8 Quality score (66.0)
9 Location (CDS)
10 Gene or intergenic (YPO_RS02340|YPO_RS02340)
11 Gene product (EscV/YscV/HrcV family type III secretion system export apparatus protein)
12 Type of indel (insertion)
13 Number of nucleotide (1)
14 Sequence name (NC_003143.1: Yersinia pestis CO92, complete genome 4653728 bp)

Note

While snpToolkit annotate indels, it is important to be careful and check any indels you are interested in before to elaborate any hypothesis and conclusions.