Bam Files

BAM is a binary format for reads and accompanying information about their alignments to a reference genome. They are for instance generated by bwa, a popular alignment program. SAM is the human readable version of BAM. A full description of both formats is available here: https://samtools.github.io/hts-specs/SAMv1.pdf

samtools: Converting, Filtering SAM and BAM Files

A commandline-tool called samtools can be used to work with BAM and SAM. You can run samtools without any parameters to get an overview of parameters and options. The command man samtools shows you a longer documentation.

The basic pattern of usage for samtools is

samtools [command] [-paramter] [-parameter] bam/sam-file

A useful command is view which converts a BAM file to SAM. The command tview allows you to view the alignments.

I prepared two files that you can try these commands on:

/data/bam/workshop1.bam
/data/bam/workshop2.bam

Exercise

Run samtools tview on workshop2.bam and use space to scroll until you see an alignment. Try ? to see further options.

Hint

If you want to find a position with lots of data, check out samtools tview -p 17:6944949 /data/bam/workshop1.bam. You can also add the reference genome, which will render the reads differently: samtools tview -p 17:6944949 /data/bam/workshop1.bam --reference /data/bam/whole_genome.fa

Exercise

Run samtools view workshop1.bam | less. What chromosome do the first couple of sequences align to? How does the output change if you use -h which adds the SAM header information?

The view command can also be instructed to print specific regions (as long as the bam file is sorted and indexed): samtools view workshop1.bam 17 will only print alignments on chromosome 17 and samtools view workshop1.bam 17:6944949-6947242 only alignments overlapping the specified coordinates.

Samtools view also allows for alignments to be filtered. You can specify which alignments should be printed using the option -f or exclude alignments with -F. These options are followed by a integer for the flags that should be set/not set (see format description pdf). samtools view -f 16 for instance, selects all sequences that align on the minus strand.

Bash programming

To get some basic statistics from BAM-files it is useful to know a couple of shell-commands and how to chain them together. You can get a full decription on these commands by running man [command]. Commands can be chained together in bash to form longer commands using the pipe-operator |. The output of each command in the chain will be used as input for the next command.

command1 | command2 | command3

Here are some basic shell-commands:

head and tail
The commands head and tail give you the first 10 lines of the beginning and 10 lines of the end of a file, respectively. Parameter -n can be used to specify how many lines.
wc
The command wc (for “word count”) counts the number of characters, words and lines in a document. Use parameter -c for just the character count, -w for just the words and -l for just the number of lines.

Exercise

Run samtools view /data/bam/workshop1.bam | head. Count the number of lines that are printed using wc. Specify the number of lines to be printed with head -n.

Exercise

How many sequences in total are aligned workshop1.bam? How many of these are single reads (not paired end)?

cut
Will only print specific columns of a text table. Parameter -f specifies which column you want printed. cut -f 1,2 will for instance print the first and second column from a file.

Exercise

Run samtools view /data/bam/workshop1.bam | cut -f 3. What does this column of the SAM format mean?

tr
This command (for “transpose”) will substitute a set of characters by diffent characters. For instance, tr 'ABC' 'abc' will substitute all occurrences of A, B or C by a, b and c, respectively. The command can also delete characters. tr -d "N" will for instance delete all occurrences of “N” from the input.
grep
grep will only print lines matching a specific pattern. Example: grep AAAACCCC will print all lines containing “AAAACCCC”. grep "^Word" will print lines starting with “Word”.

Exercise

Convert the sequences from the SAM output of workshop1.bam to lowercase using tr. Use grep to check for sequences that contain the character “N”.

sort
Sorts the input alphabetically. Use option -k to specify the column to sort by and -n if you want to sort numerically
uniq
Only print unique occurrences of lines on the input. Input must be sorted (see sort). Use option -c if you’d like to get counts of occurrences.

Exercise

How many sequences align to each chromosome in workshop1.bam and workshop2.bam? Seeing how many sequences align to chromosome X and chromosome 7 (which is similar in size to X) for workshop2.bam, would you say this individual is male or female?

Basic programming in AWK

awk is a simple programming language that is particularly useful when processing line-wise input. The basic format of any awk program looks like this:

BEGIN{ }
{ }
END{ }

Everything in curly brackets in the first line is going to be done before the first line is read. The middle line specifies everything that should be done for each line. The last line says what should be done after the last line. A simple awk program that counts the number of lines may be written like this:

BEGIN{ line=0 }
{ line=line+1 }
END{ print line }

Since awk keeps track of the number of a line in the variable /NR/, you can simplify this program to just one line:

END{ print NR }

The formatting of awk programs doesn’t matter. This makes it easy to specify programs on one line inbetween other shell-commands. For instance:

# long version:
samtools view workshop1.bam 17 | awk 'BEGIN{ line=0 }{ line=line+1 }END{ print line }'

# simplified version:
samtools view workshop1.bam 17 | awk 'END{ print NR }'

Awk can also select specific columns (like cut does). To refer to a specific column, you add a $ before the number of the column. This prints the first column from a file:

{ print $1 }

To count the characters in each line (like wc -c), you can use the function length():

{ print length($1) }

Exercise

Calculate the average size of sequences in workshop1.bam and workshop2.bam. Select only sequences that are not paired-end.

Exercise

Calculate the number of GC and AT bases in workshop2.bam. Extra question: is the GC content in workshop1.bam different and why?

Calculating coverage with samtools

samtools depth gives the number of sequences covering sites. With -a, all positions are given, also those not covered. Example:

samtools depth /data/bam/workshop1.bam | less

Exercise

Calculate the average coverage on chrX and chr7 for workshop2.bam.

Exercise

Calculate the average coverage for the region 17:6944949-6947242 on workshop1.bam.

EXTRA: Genotype calling from bam files

When several sequences overlap a position in the nuclear genome, then the genotype of the carrier can be inferred. How this is done best in every case goes beyond the scope of the workshop. However, when ancient DNA damage is low, you can use samtools together with a program called bcftools to produce genotype calls in VCF format.

samtools mpileup -v -f [reference_genome] -I [input-bam] | bcftools call -m > output.vcf

The VCF format is described here: http://www.internationalgenome.org/wiki/Analysis/vcf4.0/

Exercise

Run the above command on workshop1.bam using the reference genome /data/bam/whole_genome.fa. Have a look at the output and see how many differences you observe to the human reference.

The solution notebook for this session is here