sctools package¶
Submodules¶
sctools.bam module¶
Tools for Manipulating SAM/BAM format files¶
This module provides functions and classes to subsample reads from bam files that correspond to specific chromosomes, split bam files into chunks, assign tags to bam files from paired fastq records, and iterate over sorted bam files by one or more tags
This module makes heavy use of the pysam wrapper for HTSlib, a high-performance c-library designed to manipulate sam files
- iter_tag_groups function to iterate over reads by an arbitrary tag
- iter_cell_barcodes wrapper for iter_tag_groups that iterates over cell barcode tags
- iter_genes wrapper for iter_tag_groups that iterates over gene tags
- iter_molecules wrapper for iter_tag_groups that iterates over molecule tags
- sort_by_tags_and_queryname sort bam by given list of zero or more tags, followed by query name
- verify_sort verifies whether bam is correctly sorted by given list of tags, then query name
- sctools.Classes()¶
- -------
- SubsetAlignments class to extract reads specific to requested chromosome(s)
- Tagger class to add tags to sam/bam records from paired fastq records
- AlignmentSortOrder abstract class to represent alignment sort orders
- QueryNameSortOrder alignment sort order by query name
- TagSortableRecord class to facilitate sorting of pysam.AlignedSegments
- SortError error raised when sorting is incorrect
References
htslib : https://github.com/samtools/htslib
- class sctools.bam.AlignmentSortOrder[source]¶
Bases:
object
The base class of alignment sort orders.
- abstract property key_generator: Callable[pysam.libcalignedsegment.AlignedSegment, Any]¶
Returns a callable function that calculates a sort key from given pysam.AlignedSegment.
- class sctools.bam.QueryNameSortOrder[source]¶
Bases:
sctools.bam.AlignmentSortOrder
Alignment record sort order by query name.
- property key_generator¶
Returns a callable function that calculates a sort key from given pysam.AlignedSegment.
- exception sctools.bam.SortError[source]¶
Bases:
Exception
- args¶
- with_traceback()¶
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- class sctools.bam.SubsetAlignments(alignment_file: str, open_mode: Optional[str] = None)[source]¶
Bases:
object
Wrapper for pysam/htslib that extracts reads corresponding to requested chromosome(s)
- Parameters
alignment_file (str) – sam or bam file
open_mode ({'r', 'rb', None}, optional) – open mode for pysam.AlignmentFile. ‘r’ indicates a sam file, ‘rb’ indicates a bam file, and None attempts to autodetect based on the file suffix (Default = None)
- indices_by_chromosome()[source]¶
returns indices to line numbers containing the requested number of reads for a specified chromosome
Notes
samtools is a good general-purpose tool for that is capable of most subsampling tasks. It is a good idea to check the samtools documentation when approaching these types of tasks.
References
samtools documentation : http://www.htslib.org/doc/samtools.html
- indices_by_chromosome(n_specific: int, chromosome: str, include_other: int = 0) Union[List[int], Tuple[List[int], List[int]]] [source]¶
Return the list of first n_specific indices of reads aligned to chromosome.
- Parameters
n_specific (int) – Number of aligned reads to return indices for
chromosome (str) – Only reads from this chromosome are considered valid
include_other (int, optional) – The number of reads to include that are NOT aligned to chromosome. These can be aligned or unaligned reads (default = 0).
- Returns
chromosome_indices (List[int]) – list of indices to reads aligning to chromosome
other_indices (List[int], optional) – list of indices to reads NOT aligning to chromosome, only returned if include_other is not 0.
- class sctools.bam.TagSortableRecord(tag_keys: Iterable[str], tag_values: Iterable[str], query_name: str, record: Optional[pysam.libcalignedsegment.AlignedSegment] = None)[source]¶
Bases:
object
Wrapper for pysam.AlignedSegment that facilitates sorting by tags and query name.
- classmethod from_aligned_segment(record: pysam.libcalignedsegment.AlignedSegment, tag_keys: Iterable[str]) sctools.bam.TagSortableRecord [source]¶
Create a TagSortableRecord from a pysam.AlignedSegment and list of tag keys
- class sctools.bam.Tagger(bam_file: str)[source]¶
Bases:
object
Add tags to a bam file from tag generators.
- Parameters
bam_file (str) – Bam file that tags are to be added to.
- tag()[source]¶
tag bam records given tag_generators (often generated from paired bam or fastq files) # todo this should probably be wrapped up in __init__ to make this more function-like
- tag(output_bam_name: str, tag_generators) None [source]¶
Add tags to bam_file.
Given a bam file and tag generators derived from files sharing the same sort order, adds tags to the .bam file, and writes the resulting file to output_bam_name.
- Parameters
output_bam_name (str) – Name of output tagged bam.
tag_generators (List[fastq.TagGenerator]) – list of generators that yield fastq.Tag objects
- sctools.bam.get_barcode_for_alignment(alignment: pysam.libcalignedsegment.AlignedSegment, tags: List[str], raise_missing: bool) str [source]¶
Get the barcode for an Alignment
- Parameters
alignment – pysam.AlignedSegment An Alignment from pysam.
tags – List[str] Tags in the bam that might contain barcodes. If multiple Tags are passed, will return the contents of the first tag that contains a barcode.
raise_missing – bool Raise an error if no barcodes can be found.
- Returns
str A barcode for the alignment, or None if one is not found and raise_missing is False.
- sctools.bam.get_barcodes_from_bam(in_bam: str, tags: List[str], raise_missing: bool) Set[str] [source]¶
Get all the distinct barcodes from a bam
- Parameters
in_bam – str Input bam file.
tags – List[str] Tags in the bam that might contain barcodes.
raise_missing – bool Raise an error if no barcodes can be found.
- Returns
set A set of barcodes found in the bam This set will not contain a None value
- sctools.bam.get_tag_or_default(alignment: pysam.libcalignedsegment.AlignedSegment, tag_key: str, default: Optional[str] = None) Optional[str] [source]¶
Extracts the value associated to tag_key from alignment, and returns a default value if the tag is not present.
- sctools.bam.iter_cell_barcodes(bam_iterator: Iterator[pysam.libcalignedsegment.AlignedSegment]) Generator [source]¶
Iterate over all the cells of a bam file sorted by cell.
- Parameters
bam_iterator (Iterator[pysam.AlignedSegment]) – open bam file that can be iterated over
- Yields
grouped_by_tag (Iterator[pysam.AlignedSegment]) – reads sharing a unique cell barcode tag
current_tag (str) – the cell barcode that reads in the group all share
- sctools.bam.iter_genes(bam_iterator: Iterator[pysam.libcalignedsegment.AlignedSegment]) Generator [source]¶
Iterate over all the cells of a bam file sorted by gene.
- Parameters
bam_iterator (Iterator[pysam.AlignedSegment]) – open bam file that can be iterated over
- Yields
grouped_by_tag (Iterator[pysam.AlignedSegment]) – reads sharing a unique gene name tag
current_tag (str) – the gene id that reads in the group all share
- sctools.bam.iter_molecule_barcodes(bam_iterator: Iterator[pysam.libcalignedsegment.AlignedSegment]) Generator [source]¶
Iterate over all the molecules of a bam file sorted by molecule.
- Parameters
bam_iterator (Iterator[pysam.AlignedSegment]) – open bam file that can be iterated over
- Yields
grouped_by_tag (Iterator[pysam.AlignedSegment]) – reads sharing a unique molecule barcode tag
current_tag (str) – the molecule barcode that records in the group all share
- sctools.bam.iter_tag_groups(tag: str, bam_iterator: Iterator[pysam.libcalignedsegment.AlignedSegment], filter_null: bool = False) Generator [source]¶
Iterates over reads and yields them grouped by the provided tag value
- Parameters
tag (str) – BAM tag to group over
bam_iterator (Iterator[pysam.AlignedSegment]) – open bam file that can be iterated over
filter_null (bool, optional) – If False, all reads that lack the requested tag are yielded together. Else, all reads that lack the tag will be discarded (default = False).
- Yields
grouped_by_tag (Iterator[pysam.AlignedSegment]) – reads sharing a unique value of tag
current_tag (str) – the tag that reads in the group all share
- sctools.bam.merge_bams(bams: List[str]) str [source]¶
Merge input bams using samtools.
This cannot be a local function within split because then Python “cannot pickle a local object”. :param bams: Name of the final bam + bams to merge.
Because of how its called using multiprocessing, the bam basename is the first element of the list.
- Returns
The output bam name.
- sctools.bam.sort_by_tags_and_queryname(records: Iterable[pysam.libcalignedsegment.AlignedSegment], tag_keys: Iterable[str]) Iterable[pysam.libcalignedsegment.AlignedSegment] [source]¶
Sorts the given bam records by the given tags, followed by query name. If no tags are given, just sorts by query name.
- sctools.bam.split(in_bams: List[str], out_prefix: str, tags: List[str], approx_mb_per_split: float = 1000, raise_missing: bool = True, num_processes: Optional[int] = None) List[str] [source]¶
split in_bam by tag into files of approx_mb_per_split
- Parameters
in_bams (str) – Input bam files.
out_prefix (str) – Prefix for all output files; output will be named as prefix_n where n is an integer equal to the chunk number.
tags (List[str]) – The bam tags to split on. The tags are checked in order, and sorting is done based on the first identified tag. Further tags are only checked if the first tag is missing. This is useful in cases where sorting is executed over a corrected barcode, but some records only have a raw barcode.
approx_mb_per_split (float) – The target file size for each chunk in mb
raise_missing (bool, optional) – if True, raise a RuntimeError if a record is encountered without a tag. Else silently discard the record (default = True)
num_processes (int, optional) – The number of processes to parallelize over. If not set, will use all available processes.
- Returns
output_filenames – list of filenames of bam chunks
- Return type
List[str]
- Raises
ValueError – when tags is empty
RuntimeError – when raise_missing is true and any passed read contains no tags
- sctools.bam.verify_sort(records: Iterable[sctools.bam.TagSortableRecord], tag_keys: Iterable[str]) None [source]¶
Raise AssertionError if the given records are not correctly sorted by the given tags and query name
- sctools.bam.write_barcodes_to_bins(in_bam: str, tags: List[str], barcodes_to_bins: Dict[str, int], raise_missing: bool) List[str] [source]¶
Write barcodes to appropriate bins as defined by barcodes_to_bins
- Parameters
in_bam – str The bam file to read.
tags – List[str] Tags in the bam that might contain barcodes.
barcodes_to_bins – Dict[str, int] A Dict from barcode to bin. All barcodes of the same type need to be written to the same bin. These numbered bins are merged after parallelization so that all alignments with the same barcode are in the same bam.
raise_missing – bool Raise an error if no barcodes can be found.
- Returns
A list of paths to the written bins.
sctools.barcode module¶
Nucleotide Barcode Manipulation Tools¶
This module contains tools to characterize oligonucleotide barcodes and a simple hamming-base error-correction approach which corrects barcodes within a specified distance of a “whitelist” of expected barcodes.
Classes¶
Barcodes Class to characterize a set of barcodes ErrorsToCorrectBarcodesMap Class to carry out error correction routines
- class sctools.barcode.Barcodes(barcodes: Mapping[str, int], barcode_length: int)[source]¶
Bases:
object
Container for a set of nucleotide barcodes.
Contained barcodes are encoded in 2bit representation for fast operations. Instances of this class can optionally be constructed from an iterable where barcodes can be present multiple times. In these cases, barcodes are analyzed based on their observed frequencies.
- Parameters
barcodes (Mapping[str, int]) – dictionary-like mapping barcodes to the number of times they were observed
barcode_length (int) – the length of all barcodes in the set. Different-length barcodes are not supported.
See also
- base_frequency(weighted=False) numpy.ndarray [source]¶
return the frequency of each base at each position in the barcode set
Notes
weighting is currently not supported, and must be set to False or base_frequency will raise NotImplementedError # todo fix
- Parameters
weighted (bool, optional) – if True, each barcode is counted once for each time it was observed (default = False)
- Returns
frequencies – barcode_length x 4 2d numpy array
- Return type
np.array
- Raises
NotImplementedError – if weighted is True
- effective_diversity(weighted=False) numpy.ndarray [source]¶
Returns the effective base diversity of the barcode set by position.
maximum diversity for each position is 1, and represents a perfect split of 25% per base at a given position.
- Parameters
weighted (bool, optional) – if True, each barcode is counted once for each time it was observed (default = False)
- Returns
effective_diversity – 1-d array of size barcode_length containing floats in [0, 1]
- Return type
np.array[float]
- classmethod from_iterable_bytes(iterable: Iterable[bytes], barcode_length: int)[source]¶
Construct an ObservedBarcodeSet from an iterable of bytes barcodes.
- Parameters
iterable (Iterable[bytes]) – iterable of barcodes in bytes representation
barcode_length (int) – the length of the barcodes in iterable
- Returns
barcodes – class object containing barcodes from a whitelist file
- Return type
- classmethod from_iterable_encoded(iterable: Iterable[int], barcode_length: int)[source]¶
Construct an ObservedBarcodeSet from an iterable of encoded barcodes.
- Parameters
iterable (Iterable[int]) – iterable of barcodes encoded in TwoBit representation
barcode_length (int) – the length of the barcodes in iterable
- Returns
barcodes – class object containing barcodes from a whitelist file
- Return type
- classmethod from_iterable_strings(iterable: Iterable[str], barcode_length: int)[source]¶
Construct an ObservedBarcodeSet from an iterable of string barcodes.
- Parameters
iterable (Iterable[str]) – iterable of barcodes encoded in TwoBit representation
barcode_length (int) – the length of the barcodes in iterable
- Returns
barcodes – class object containing barcodes from a whitelist file
- Return type
- classmethod from_whitelist(file_: str, barcode_length: int)[source]¶
Creates a barcode set from a whitelist file.
- Parameters
file (str) – location of the whitelist file. Should be formatted one barcode per line. Barcodes should be encoded in plain text (UTF-8, ASCII), not bit-encoded. Each barcode will be assigned a count of 1.
barcode_length (int) – Length of the barcodes in the file.
- Returns
barcodes – class object containing barcodes from a whitelist file
- Return type
- summarize_hamming_distances() Mapping[str, float] [source]¶
Returns descriptive statistics on hamming distances between pairs of barcodes.
- Returns
descriptive_statistics – minimum, 25th percentile, median, 75th percentile, maximum, and average hamming distance between all pairs of barcodes
- Return type
Mapping[str, float]
References
- class sctools.barcode.ErrorsToCorrectBarcodesMap(errors_to_barcodes: Mapping[str, str])[source]¶
Bases:
object
Correct any barcode that is within one hamming distance of a whitelisted barcode
- Parameters
errors_to_barcodes (Mapping[str, str]) – dict-like mapping 1-base errors to the whitelist barcode that they could be generated from
- get_corrected_barcode(barcode: str)[source]¶
Return a barcode if it is whitelist, or the corrected version if within edit distance 1
- correct_bam(bam_file: str, output_bam_file: str)[source]¶
correct barcodes in a bam file, given a whitelist
References
https://en.wikipedia.org/wiki/Hamming_distance
- correct_bam(bam_file: str, output_bam_file: str) None [source]¶
Correct barcodes in a (potentially unaligned) bamfile, given a whitelist.
- Parameters
bam_file (str) – BAM format file in same order as the fastq files
output_bam_file (str) – BAM format file containing cell, umi, and sample tags.
- get_corrected_barcode(barcode: str) str [source]¶
Return a barcode if it is whitelist, or the corrected version if within edit distance 1
- Parameters
barcode (str) – the barcode to return the corrected version of. If the barcode is in the whitelist, the input barcode is returned unchanged.
- Returns
corrected_barcode – corrected version of the barcode
- Return type
str
- Raises
KeyError – if the passed barcode is not within 1 hamming distance of any whitelist barcode
References
- classmethod single_hamming_errors_from_whitelist(whitelist_file: str)[source]¶
Factory method to generate instance of class from a file containing “correct” barcodes.
- Parameters
whitelist_file (str) – Text file containing barcode per line.
- Returns
errors_to_barcodes_map – instance of cls, built from whitelist
- Return type
sctools.encodings module¶
Compressed Barcode Encoding Methods¶
This module defines several classes to encode DNA sequences in memory-efficient forms, using 2 bits to encode bases of a 4-letter DNA alphabet (ACGT) or 3 bits to encode a 5-letter DNA alphabet that includes the ambiguous call often included by Illumina base calling software (ACGTN). The classes also contain several methods useful for efficient querying and manipulation of the encoded sequence.
Classes¶
Encoding Encoder base class ThreeBit Three bit DNA encoder / decoder TwoBit Two bit DNA encoder / decoder
- class sctools.encodings.Encoding[source]¶
Bases:
object
- encoding_map¶
Class that mimics a Mapping[bytes, str] where bytes must be a single byte encoded character (encoder)
- Type
- decoding_map¶
Dictionary that maps integers to bytes human-readable representations (decoder)
- Type
Mapping[int, bytes]
- bits_per_base¶
number of bits used to encode each base
- Type
int
- decode(integer_encoded: int)[source]¶
decode a compressed DNA string into a human readable bytes format
- hamming_distance(a: int, b: int)[source]¶
calculate the hamming distance between two encoded DNA strings
- bits_per_base: int = NotImplemented¶
- decode(integer_encoded: int) bytes [source]¶
Decode a DNA bytes string.
- Parameters
integer_encoded (bytes) – Integer encoded DNA string
- Returns
decoded – Bytes decoded DNA sequence
- Return type
bytes
- decoding_map: Mapping[int, AnyStr] = NotImplemented¶
- classmethod encode(bytes_encoded: bytes) int [source]¶
Encode a DNA bytes string.
- Parameters
bytes_encoded (bytes) – bytes DNA string
- Returns
encoded – Encoded DNA sequence
- Return type
int
- encoding_map: Mapping[AnyStr, int] = NotImplemented¶
- gc_content(integer_encoded: int) int [source]¶
Return the number of G or C nucleotides in integer_encoded
- Parameters
integer_encoded (int) – Integer encoded DNA string
- Returns
number of bases in integer_encoded input that are G or C.
- Return type
gc_content, int
- static hamming_distance(a, b) int [source]¶
Calculate the hamming distance between two DNA sequences
The hamming distance counts the number of bases that are not the same nucleotide
- Parameters
a (int) – integer encoded
b (int) – integer encoded
- Returns
d – hamming distance between a and b
- Return type
int
- class sctools.encodings.ThreeBit(*args, **kwargs)[source]¶
Bases:
sctools.encodings.Encoding
Encode a DNA sequence using a 3-bit encoding.
Since no bases are encoded as 0, an empty triplet is interpreted as the end of the encoded string; Three-bit encoding can be used to encode and decode strings without knowledge of their length.
- encoding_map¶
Class that mimics a Mapping[bytes, str] where bytes must be a single byte encoded character (encoder)
- Type
- decoding_map¶
Dictionary that maps integers to bytes human-readable representations (decoder)
- Type
Mapping[int, bytes]
- bits_per_base¶
number of bits used to encode each base
- Type
int
- decode(integer_encoded: int)[source]¶
decode a compressed DNA string into a human readable bytes format
- hamming_distance(a: int, b: int)[source]¶
calculate the hamming distance between two encoded DNA strings
- class ThreeBitEncodingMap[source]¶
Bases:
object
Dict-like class that maps bytes to 3-bit integer representations
All IUPAC ambiguous codes are treated as “N”
- map_ = {65: 2, 67: 1, 71: 3, 78: 6, 84: 4, 97: 2, 99: 1, 103: 3, 110: 6, 116: 4}¶
- bits_per_base: int = 3¶
- classmethod decode(integer_encoded: int) bytes [source]¶
Decode a DNA bytes string.
- Parameters
integer_encoded (bytes) – Integer encoded DNA string
- Returns
decoded – Bytes decoded DNA sequence
- Return type
bytes
- decoding_map: Mapping[int, bytes] = {1: b'C', 2: b'A', 3: b'G', 4: b'T', 6: b'N'}¶
- classmethod encode(bytes_encoded: bytes) int [source]¶
Encode a DNA bytes string.
- Parameters
bytes_encoded (bytes) – bytes DNA string
- Returns
encoded – Encoded DNA sequence
- Return type
int
- encoding_map: sctools.encodings.ThreeBit.ThreeBitEncodingMap = <sctools.encodings.ThreeBit.ThreeBitEncodingMap object>¶
- classmethod gc_content(integer_encoded: int) int [source]¶
Return the number of G or C nucleotides in integer_encoded
- Parameters
integer_encoded (int) – Integer encoded DNA string
- Returns
number of bases in integer_encoded input that are G or C.
- Return type
gc_content, int
- static hamming_distance(a: int, b: int) int [source]¶
Calculate the hamming distance between two DNA sequences
The hamming distance counts the number of bases that are not the same nucleotide
- Parameters
a (int) – integer encoded
b (int) – integer encoded
- Returns
d – hamming distance between a and b
- Return type
int
- class sctools.encodings.TwoBit(sequence_length: int)[source]¶
Bases:
sctools.encodings.Encoding
Encode a DNA sequence using a 2-bit encoding.
Two-bit encoding uses 0 for an encoded nucleotide. As such, it cannot distinguish between the end of sequence and trailing A nucleotides, and thus decoding these strings requires knowledge of their length. Therefore, it is only appropriate for encoding fixed sequence lengths
In addition, in order to encode in 2-bit, N-nucleotides must be randomized to one of A, C, G, and T.
- Parameters
sequence_length (int) – number of nucleotides that are being encoded
- encoding_map¶
Class that mimics a Mapping[bytes, str] where bytes must be a single byte encoded character (encoder)
- Type
- decoding_map¶
Dictionary that maps integers to bytes human-readable representations (decoder)
- Type
Mapping[int, bytes]
- bits_per_base¶
number of bits used to encode each base
- Type
int
- decode(integer_encoded: int)[source]¶
decode a compressed DNA string into a human readable bytes format
- hamming_distance(a: int, b: int)[source]¶
calculate the hamming distance between two encoded DNA strings
- class TwoBitEncodingMap[source]¶
Bases:
object
Dict-like class that maps bytes to 2-bit integer representations
Generates random nucleotides for ambiguous nucleotides e.g. N
- iupac_ambiguous: Set[int] = {66, 68, 72, 75, 77, 78, 82, 83, 86, 87, 89, 98, 100, 104, 107, 109, 110, 114, 115, 118, 119, 121}¶
- map_ = {65: 0, 67: 1, 71: 3, 84: 2, 97: 0, 99: 1, 103: 3, 116: 2}¶
- bits_per_base: int = 2¶
- decode(integer_encoded: int) bytes [source]¶
Decode a DNA bytes string.
- Parameters
integer_encoded (bytes) – Integer encoded DNA string
- Returns
decoded – Bytes decoded DNA sequence
- Return type
bytes
- decoding_map: Mapping[int, bytes] = {0: b'A', 1: b'C', 2: b'T', 3: b'G'}¶
- classmethod encode(bytes_encoded: bytes) int [source]¶
Encode a DNA bytes string.
- Parameters
bytes_encoded (bytes) – bytes DNA string
- Returns
encoded – Encoded DNA sequence
- Return type
int
- encoding_map: sctools.encodings.TwoBit.TwoBitEncodingMap = <sctools.encodings.TwoBit.TwoBitEncodingMap object>¶
- gc_content(integer_encoded: int) int [source]¶
Return the number of G or C nucleotides in integer_encoded
- Parameters
integer_encoded (int) – Integer encoded DNA string
- Returns
number of bases in integer_encoded input that are G or C.
- Return type
gc_content, int
- static hamming_distance(a: int, b: int) int [source]¶
Calculate the hamming distance between two DNA sequences
The hamming distance counts the number of bases that are not the same nucleotide
- Parameters
a (int) – integer encoded
b (int) – integer encoded
- Returns
d – hamming distance between a and b
- Return type
int
sctools.fastq module¶
Efficient Fastq Iterators and Representations¶
This module implements classes for representing fastq records, reading and writing them, and extracting parts of fastq sequence for transformation into bam format tags
- sctools.extract_barcode(record, embedded_barcode)¶
extract a barcode, defined by embedded_barcode from record
- sctools.Classes()¶
- -------
- Record Represents fastq records (input as bytes)
- StrRecord Represents fastq records (input as str)
- Reader Opens and iterates over fastq files
- EmbeddedBarcodeGenerator Generates barcodes from a fastq file
- BarcodeGeneratorWithCorrectedCellBarcodes Generates (corrected) barcodes from a fastq file
References
https://en.wikipedia.org/wiki/FASTQ_format
- class sctools.fastq.BarcodeGeneratorWithCorrectedCellBarcodes(fastq_files: Union[str, Iterable[str]], embedded_cell_barcode: sctools.fastq.Tag, whitelist: str, other_embedded_barcodes: Iterable[sctools.fastq.Tag] = (), *args, **kwargs)[source]¶
Bases:
sctools.fastq.Reader
Generate barcodes from FASTQ file(s) from positions defined by EmbeddedBarcode(s)
Extracted barcode objects are produced in a form that is consumable by pysam’s bam and sam set_tag methods. In this class, one EmbeddedBarcode must be defined as an embedded_cell_barcode, which is checked against a whitelist and error corrected during generation
- Parameters
fastq_files (str | List, optional) – FASTQ file or files to be read. (default = sys.stdin)
mode ({'r', 'rb'}, optional) – open mode for fastq files. If ‘r’, return string. If ‘rb’, return bytes (default = ‘r’)
whitelist (str) – whitelist file containing “correct” cell barcodes for an experiment
embedded_cell_barcodes (EmbeddedBarcode) – EmbeddedBarcode containing information about the position and names of cell barcode tags
other_embedded_barcodes (Iterable[EmbeddedBarcode], optional) – tag objects defining start and end of the sequence containing the tag, and the tag identifiers for sequence and quality tags (default = None)
- extract_cell_barcode(record: Tuple[str], cb: sctools.fastq.Tag)[source]¶
Extract a cell barcode from a fastq record
- Parameters
record (Tuple[str]) – fastq record comprised of four strings: name, sequence, name2, and quality
cb (EmbeddedBarcode) – defines the position and tag identifier for a call barcode
- Returns
sequence_tag (Tuple[str, str, ‘Z’]) – raw sequence tag identifier, sequence, SAM tag type (‘Z’ implies a string tag)
quality_tag (Tuple[str, str, ‘Z’]) – quality tag identifier, quality, SAM tag type (‘Z’ implies a string tag)
corrected_tag (Optional[Tuple[str, str, ‘Z’]]) – Whitelist verified sequence tag. Only present if the raw sequence tag is in the whitelist or within 1 hamming distance of one of its barcodes
- property filenames: List[str]¶
- select_record_indices(indices: Set) Generator ¶
Iterate over provided indices only, skipping other records.
- Parameters
indices (Set[int]) – indices to include in the output
- Yields
record, str – records from file corresponding to indices
- property size: int¶
return the collective size of all files being read in bytes
- sctools.fastq.EmbeddedBarcode¶
alias of
sctools.fastq.Tag
- class sctools.fastq.EmbeddedBarcodeGenerator(fastq_files, embedded_barcodes, *args, **kwargs)[source]¶
Bases:
sctools.fastq.Reader
Generate barcodes from a FASTQ file(s) from positions defined by EmbeddedBarcode(s)
Extracted barcode objects are produced in a form that is consumable by pysam’s bam and sam set_tag methods.
- Parameters
embedded_barcodes (Iterable[EmbeddedBarcode]) – tag objects defining start and end of the sequence containing the tag, and the tag identifiers for sequence and quality tags
fastq_files (str | List, optional) – FASTQ file or files to be read. (default = sys.stdin)
mode ({'r', 'rb'}, optional) – open mode for FASTQ files. If ‘r’, return string. If ‘rb’, return bytes (default = ‘r’)
- property filenames: List[str]¶
- select_record_indices(indices: Set) Generator ¶
Iterate over provided indices only, skipping other records.
- Parameters
indices (Set[int]) – indices to include in the output
- Yields
record, str – records from file corresponding to indices
- property size: int¶
return the collective size of all files being read in bytes
- class sctools.fastq.Reader(files='-', mode='r', header_comment_char=None)[source]¶
Bases:
sctools.reader.Reader
Fastq Reader that defines some special methods for reading and summarizing FASTQ data.
Simple reader class that exposes an __iter__ and __len__ method
Examples
#todo add examples
See also
References
https://en.wikipedia.org/wiki/FASTQ_format
- property filenames: List[str]¶
- select_record_indices(indices: Set) Generator [source]¶
Iterate over provided indices only, skipping other records.
- Parameters
indices (Set[int]) – indices to include in the output
- Yields
record, str – records from file corresponding to indices
- property size: int¶
return the collective size of all files being read in bytes
- class sctools.fastq.Record(record: Iterable[AnyStr])[source]¶
Bases:
object
Fastq Record.
- Parameters
record (Iterable[bytes]) – Iterable of 4 bytes strings that comprise a fastq record
- name¶
fastq record name
- Type
bytes
- sequence¶
fastq nucleotide sequence
- Type
bytes
- name2¶
second fastq record name field (rarely used)
- Type
bytes
- quality¶
base call quality for each nucleotide in sequence
- Type
bytes
- property name: AnyStr¶
- property name2: AnyStr¶
- property quality: AnyStr¶
- property sequence: AnyStr¶
- class sctools.fastq.StrRecord(record: Iterable[AnyStr])[source]¶
Bases:
sctools.fastq.Record
Fastq Record.
- Parameters
record (Iterable[str]) – Iterable of 4 bytes strings that comprise a FASTQ record
- name¶
FASTQ record name
- Type
str
- sequence¶
FASTQ nucleotide sequence
- Type
str
- name2¶
second FASTQ record name field (rarely used)
- Type
str
- quality¶
base call quality for each nucleotide in sequence
- Type
str
- property name: str¶
- property name2: AnyStr¶
- property quality: AnyStr¶
- property sequence: AnyStr¶
- sctools.fastq.extract_barcode(record, embedded_barcode) Tuple[Tuple[str, str, str], Tuple[str, str, str]] [source]¶
Extracts barcodes from a FASTQ record at positions defined by an EmbeddedBarcode object.
- Parameters
record (FastqRecord) – Record to extract from
embedded_barcode (EmbeddedBarcode) – Defines the barcode start and end positions and the tag name for the sequence and quality tags
- Returns
sequence_tag (Tuple[str, str, ‘Z’]) – sequence tag identifier, sequence, SAM tag type (‘Z’ implies a string tag)
quality_tag (Tuple[str, str, ‘Z’]) – quality tag identifier, quality, SAM tag type (‘Z’ implies a string tag)
sctools.gtf module¶
GTF Records and Iterators¶
This module defines a GTF record class and a Reader class to iterate over GTF-format files
Classes¶
Record Data class that exposes GTF record fields by name Reader GTF file reader that yields GTF Records
References
https://useast.ensembl.org/info/website/upload/gff.html
- class sctools.gtf.GTFRecord(record: str)[source]¶
Bases:
object
Data class for storing and interacting with GTF records
Subclassed to produce exon, transcript, and gene-specific record types. A GTF record has 8 fixed fields which are followed by optional fields separated by ; , which are stored by this class in the attributes field and accessible by get_attribute. Fixed fields are accessible by name.
- Parameters
record (str) – an unparsed GTF record
- seqname¶
The name of the sequence (often chromosome) this record is found on.
- Type
str
- chromosome¶
Synonym for seqname.
- Type
str
- source¶
The group responsible for generating this annotation.
- Type
str
- feature¶
The type of record (e.g. gene, exon, …).
- Type
str
- start¶
The start position of this feature relative to the beginning of seqname.
- Type
str
- end¶
The end position of this feature relative to the beginning of seqname….
- Type
str
- score¶
The annotation score. Rarely used.
- Type
str
- strand¶
The strand of seqname that this annotation is found on
- Type
{‘+’, ‘-‘}
- frame¶
‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on
- Type
{‘0’, ‘1’, ‘2’}
- size¶
the number of nucleotides spanned by this feature
- Type
int
- set_attribute(key: str, value: str)[source]¶
set variable field key equal to value. Overwrites key if already present.
- property chromosome: str¶
- property end: int¶
- property feature: str¶
- property frame: str¶
- get_attribute(key) str [source]¶
access an item from the attribute field of a GTF file.
- Parameters
key (str) – Item to retrieve
- Returns
value – Contents of variable attribute key
- Return type
str
- Raises
KeyError – if there is no variable attribute key associated with this record
- property score: str¶
- property seqname: str¶
- set_attribute(key, value) None [source]¶
Set variable attribute key equal to value
If attribute key is already set for this record, its contents are overwritten by value
- Parameters
key (str) – attribute name
value (str) – attribute content
- property size: int¶
- property source: str¶
- property start: int¶
- property strand: str¶
- class sctools.gtf.Reader(files='-', mode='r', header_comment_char='#')[source]¶
Bases:
sctools.reader.Reader
GTF file iterator
- Parameters
files (Union[str, List], optional) – File(s) to read. If ‘-‘, read sys.stdin (default = ‘-‘)
mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).
header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)
- filter(retain_types: Iterable[str])[source]¶
Iterate over a GTF file, only yielding records in retain_types.
See also
- property filenames: List[str]¶
- filter(retain_types: Iterable[str]) Generator [source]¶
Iterate over a GTF file, returning only record whose feature type is in retain_types.
Features are stored in GTF field 2.
- Parameters
retain_types (Iterable[str]) – Record feature types to retain.
- Yields
gtf_record (Record) – gtf Record object
- select_record_indices(indices: Set) Generator [source]¶
Iterate over provided indices only, skipping other records.
- Parameters
indices (Set[int]) – indices to include in the output
- Yields
record, str – records from file corresponding to indices
- property size: int¶
return the collective size of all files being read in bytes
- sctools.gtf.extract_extended_gene_names(files: Union[str, List[str]] = '-', mode: str = 'r', header_comment_char: str = '#') Dict[str, List[tuple]] [source]¶
Extract extended gene names from GTF file(s) and returns a map from gene names to their corresponding occurrence locations the given file(s).
- Parameters
files (Union[str, List], optional) – File(s) to read. If ‘-‘, read sys.stdin (default = ‘-‘)
mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).
header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)
- Returns
A dictionary of chromosome names mapping to a List of tuples, each containing a range as the the first element and a gene name as the second. Dict[str, List(Tuple((start,end), gene)))
- Return type
Dict[str, List[tuple]]
- sctools.gtf.extract_gene_exons(files: Union[str, List[str]] = '-', mode: str = 'r', header_comment_char: str = '#') Dict[str, List[tuple]] [source]¶
Extract extended gene names from GTF file(s) and returns a map from gene names to the the list of exons in the ascending order of the start positions file(s).
- Parameters
files (Union[str, List], optional) – File(s) to read. If ‘-‘, read sys.stdin (default = ‘-‘)
mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).
header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)
- Returns
A dictionary of chromosome names mapping to a List of tuples, each containing a the exons in the ascending order of the start positions. Dict[str, List(Tuple((start,end), gene)))
- Return type
Dict[str, List[tuple]]
- sctools.gtf.extract_gene_names(files: Union[str, List[str]] = '-', mode: str = 'r', header_comment_char: str = '#') Dict[str, int] [source]¶
Extract gene names from GTF file(s) and returns a map from gene names to their corresponding occurrence orders in the given file(s).
- Parameters
files (Union[str, List], optional) – File(s) to read. If ‘-‘, read sys.stdin (default = ‘-‘)
mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).
header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)
- Returns
A map from gene names to their linear index
- Return type
Dict[str, int]
- sctools.gtf.get_mitochondrial_gene_names(files: Union[str, List[str]] = '-', mode: str = 'r', header_comment_char: str = '#') Set[str] [source]¶
- Extract mitocholdrial gene names from GTF file(s) and returns a set of mitochondrial
gene id occurrence in the given file(s).
- Parameters
files (Union[str, List], optional) – File(s) to read. If ‘-‘, read sys.stdin (default = ‘-‘)
mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).
header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)
- Returns
A set of the mitochondrial gene ids
- Return type
Set(str)
sctools.platform module¶
Command Line Interface for SC Tools:¶
This module defines the command line interface for SC Tools. Tools are separated into those that are specific to particular chemistries (e.g. Smart-seq 2) or experimental platforms (e.g. 10x Genomics v2) and those that are general across any sequencing experiment.
Currently, only general modules and those used for 10x v2 are implemented
Classes¶
GenericPlatform Class containing all general command line utilities TenXV2 Class containing 10x v2 specific command line utilities
- class sctools.platform.BarcodePlatform[source]¶
Bases:
sctools.platform.GenericPlatform
- Command Line Interface for extracting and attaching barcodes with specified positions
generalizing TenXV2 attach barcodes
Sample, cell and/or molecule barcodes can be extracted and attached to an unmapped bam when the corresponding barcode’s start position and and length are provided. The sample barcode is extracted from the index i7 fastq file and the cell and molecule barcode are extracted from the r1 fastq file
This class defines several methods that are created as CLI tools when sctools is installed (see setup.py)
- cell_barcode¶
A data class that defines the start and end position of the cell barcode and the tags to assign the sequence and quality of the cell barcode
- Type
fastq.EmbeddedBarcode
- molecule_barcode¶
A data class that defines the start and end position of the molecule barcode and the tags to assign the sequence and quality of the molecule barcode
- Type
fastq.EmbeddedBarcode
- sample_barcode¶
A data class that defines the start and end position of the sample barcode and the tags to assign the sequence and quality of the sample barcode
- Type
fastq.EmbeddedBarcode
- attach_barcodes()[source]¶
Attach barcodes from the forward (r1) and optionally index (i1) fastq files to the reverse (r2) bam file
- classmethod attach_barcodes(args=None)[source]¶
Command line entrypoint for attaching barcodes to a bamfile.
- Parameters
args (Iterable[str], optional) – arguments list, The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod bam_to_count_matrix(args: Optional[Iterable[str]] = None) int ¶
Command line entrypoint for constructing a count matrix from a tagged bam file.
Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod calculate_cell_metrics(args: Optional[Iterable[str]] = None) int ¶
Command line entrypoint for calculating cell metrics from a sorted bamfile.
Writes metrics to .csv
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod calculate_gene_metrics(args: Optional[Iterable[str]] = None) int ¶
Command line entrypoint for calculating gene metrics from a sorted bamfile.
Writes metrics to .csv
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- cell_barcode = None¶
- classmethod get_tags(raw_tags: Optional[Sequence[str]]) Iterable[str] ¶
- classmethod group_qc_outputs(args: Optional[Iterable[str]] = None) int ¶
Commandline entrypoint for parsing picard metrics files, hisat2 and rsem statistics log files. :param args: file_names: array of files
output_name: prefix of output file name. metrics_type: Picard, PicardTable, HISAT2, RSEM and Core.
- Returns
return – return if the program completes successfully.
- Return type
0
- classmethod merge_cell_metrics(args: Optional[Iterable[str]] = None) int ¶
Command line entrypoint for merging multiple cell metrics files.
Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod merge_count_matrices(args: Optional[Iterable[str]] = None) int ¶
Command line entrypoint for constructing a count matrix from a tagged bam file.
Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod merge_gene_metrics(args: Optional[Iterable[str]] = None) int ¶
Command line entrypoint for merging multiple gene metrics files.
Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- molecule_barcode = None¶
- sample_barcode = None¶
- classmethod split_bam(args: Optional[Iterable] = None) int ¶
Command line entrypoint for splitting a bamfile into subfiles of equal size.
prints filenames of chunks to stdout
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod tag_sort_bam(args: Optional[Iterable] = None) int ¶
Command line entrypoint for sorting a bam file by zero or more tags, followed by queryname.
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod verify_bam_sort(args: Optional[Iterable] = None) int ¶
Command line entrypoint for verifying bam is properly sorted by zero or more tags, followed by queryname.
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- class sctools.platform.GenericPlatform[source]¶
Bases:
object
Platform-agnostic command line functions available in SC Tools.
- tag_sort_bam():
sort a bam file by zero or more tags and then by queryname
- verify_bam_sort():
verifies whether bam file is correctly sorted by given list of zero or more tags, then queryname
- split_bam()
split a bam file into subfiles of equal size
- calculate_gene_metrics()
calculate information about genes captured by a sequencing experiment
- calculate_cell_metrics()
calculate information about cells captured by a sequencing experiment
- merge_gene_metrics()
merge multiple gene metrics files into a single output
- merge_cell_metrics()
merge multiple cell metrics files into a single output
- bam_to_count()
construct a compressed sparse row count file from a tagged, aligned bam file
- merge_count_matrices()
merge multiple csr-format count matrices into a single csr matrix
- group_qc_outputs()
aggregate Picard, HISAT2 and RSME QC statisitics
- classmethod bam_to_count_matrix(args: Optional[Iterable[str]] = None) int [source]¶
Command line entrypoint for constructing a count matrix from a tagged bam file.
Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod calculate_cell_metrics(args: Optional[Iterable[str]] = None) int [source]¶
Command line entrypoint for calculating cell metrics from a sorted bamfile.
Writes metrics to .csv
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod calculate_gene_metrics(args: Optional[Iterable[str]] = None) int [source]¶
Command line entrypoint for calculating gene metrics from a sorted bamfile.
Writes metrics to .csv
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod group_qc_outputs(args: Optional[Iterable[str]] = None) int [source]¶
Commandline entrypoint for parsing picard metrics files, hisat2 and rsem statistics log files. :param args: file_names: array of files
output_name: prefix of output file name. metrics_type: Picard, PicardTable, HISAT2, RSEM and Core.
- Returns
return – return if the program completes successfully.
- Return type
0
- classmethod merge_cell_metrics(args: Optional[Iterable[str]] = None) int [source]¶
Command line entrypoint for merging multiple cell metrics files.
Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod merge_count_matrices(args: Optional[Iterable[str]] = None) int [source]¶
Command line entrypoint for constructing a count matrix from a tagged bam file.
Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod merge_gene_metrics(args: Optional[Iterable[str]] = None) int [source]¶
Command line entrypoint for merging multiple gene metrics files.
Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod split_bam(args: Optional[Iterable] = None) int [source]¶
Command line entrypoint for splitting a bamfile into subfiles of equal size.
prints filenames of chunks to stdout
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod tag_sort_bam(args: Optional[Iterable] = None) int [source]¶
Command line entrypoint for sorting a bam file by zero or more tags, followed by queryname.
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod verify_bam_sort(args: Optional[Iterable] = None) int [source]¶
Command line entrypoint for verifying bam is properly sorted by zero or more tags, followed by queryname.
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- class sctools.platform.TenXV2[source]¶
Bases:
sctools.platform.GenericPlatform
Command Line Interface for 10x Genomics v2 RNA-sequencing programs
This class defines several methods that are created as CLI tools when sctools is installed (see setup.py)
- cell_barcode¶
A data class that defines the start and end position of the cell barcode and the tags to assign the sequence and quality of the cell barcode
- Type
fastq.EmbeddedBarcode
- molecule_barcode¶
A data class that defines the start and end position of the molecule barcode and the tags to assign the sequence and quality of the molecule barcode
- Type
fastq.EmbeddedBarcode
- sample_barcode¶
A data class that defines the start and end position of the sample barcode and the tags to assign the sequence and quality of the sample barcode
- Type
fastq.EmbeddedBarcode
- attach_barcodes()[source]¶
Attach barcodes from the forward (r1) and optionally index (i1) fastq files to the reverse (r2) bam file
- classmethod attach_barcodes(args=None)[source]¶
Command line entrypoint for attaching barcodes to a bamfile.
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod bam_to_count_matrix(args: Optional[Iterable[str]] = None) int ¶
Command line entrypoint for constructing a count matrix from a tagged bam file.
Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod calculate_cell_metrics(args: Optional[Iterable[str]] = None) int ¶
Command line entrypoint for calculating cell metrics from a sorted bamfile.
Writes metrics to .csv
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod calculate_gene_metrics(args: Optional[Iterable[str]] = None) int ¶
Command line entrypoint for calculating gene metrics from a sorted bamfile.
Writes metrics to .csv
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- cell_barcode = Tag(start=0, end=16, sequence_tag='CR', quality_tag='CY')¶
- classmethod get_tags(raw_tags: Optional[Sequence[str]]) Iterable[str] ¶
- classmethod group_qc_outputs(args: Optional[Iterable[str]] = None) int ¶
Commandline entrypoint for parsing picard metrics files, hisat2 and rsem statistics log files. :param args: file_names: array of files
output_name: prefix of output file name. metrics_type: Picard, PicardTable, HISAT2, RSEM and Core.
- Returns
return – return if the program completes successfully.
- Return type
0
- classmethod merge_cell_metrics(args: Optional[Iterable[str]] = None) int ¶
Command line entrypoint for merging multiple cell metrics files.
Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod merge_count_matrices(args: Optional[Iterable[str]] = None) int ¶
Command line entrypoint for constructing a count matrix from a tagged bam file.
Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod merge_gene_metrics(args: Optional[Iterable[str]] = None) int ¶
Command line entrypoint for merging multiple gene metrics files.
Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.
- Parameters
args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- molecule_barcode = Tag(start=16, end=26, sequence_tag='UR', quality_tag='UY')¶
- sample_barcode = Tag(start=0, end=8, sequence_tag='SR', quality_tag='SY')¶
- classmethod split_bam(args: Optional[Iterable] = None) int ¶
Command line entrypoint for splitting a bamfile into subfiles of equal size.
prints filenames of chunks to stdout
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod tag_sort_bam(args: Optional[Iterable] = None) int ¶
Command line entrypoint for sorting a bam file by zero or more tags, followed by queryname.
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
- classmethod verify_bam_sort(args: Optional[Iterable] = None) int ¶
Command line entrypoint for verifying bam is properly sorted by zero or more tags, followed by queryname.
- Parameters
args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
- Returns
return_call – return call if the program completes successfully
- Return type
0
sctools.reader module¶
Sequence File Iterators¶
This module defines a general iterator and some helper functions for iterating over files that contain sequencing data
- sctools.infer_open(file_: str, mode: str)¶
helper function that determines the compression type of a file without relying on its extension
- sctools.zip_readers(*readers, indices=None)¶
helper function that iterates over one or more readers, optionally extracting only the records that correspond to indices
- sctools.Classes()¶
- -------
- Reader Basic reader that loops over one or more input files.
See also
- class sctools.reader.Reader(files='-', mode='r', header_comment_char=None)[source]¶
Bases:
object
Basic reader object that seamlessly loops over multiple input files.
Is subclassed to create readers for specific file types (e.g. fastq, gtf, etc.)
- Parameters
files (Union[str, List], optional) – The file(s) to read. If ‘-‘, read sys.stdin (default = ‘-‘)
mode ({'r', 'rb'}, optional) – The open mode for files. If ‘r’, yield string data, if ‘rb’, yield bytes data (default = ‘r’).
header_comment_char (str, optional) – If not None, skip lines beginning with this character (default = None).
- property filenames: List[str]¶
- select_record_indices(indices: Set) Generator [source]¶
Iterate over provided indices only, skipping other records.
- Parameters
indices (Set[int]) – indices to include in the output
- Yields
record, str – records from file corresponding to indices
- property size: int¶
return the collective size of all files being read in bytes
- sctools.reader.infer_open(file_: str, mode: str) Callable [source]¶
Helper function to infer the correct compression type of an input file
Identifies files that are .gz or .bz2 compressed without requiring file extensions
- Parameters
file (str) – the file to open
mode ({'r', 'rb'}) – the mode to open the file in. ‘r’ returns strings, ‘rb’ returns bytes
- Returns
open_function – the correct open function for the file’s compression with mode pre-set through functools partial
- Return type
Callable
- sctools.reader.zip_readers(*readers, indices=None) Generator [source]¶
Zip together multiple reader objects, yielding records simultaneously.
If indices is passed, only return lines in file that correspond to indices
- Parameters
*readers (List[Reader]) – Reader objects to simultaneously iterate over
indices (Set[int], optional) – indices to include in the output
- Yields
records (Tuple[str]) – one record per reader passed
sctools.stats module¶
Statistics Functions for Sequence Data Analysis¶
This module implements statistical modules for sequence analysis
- sctools.base4_entropy(x: np.array, axis: int = 1)¶
calculate the entropy of a 4 x sequence length base frequency matrix
- sctools.Classes()¶
- -------
- OnlineGaussianSuficientStatistic Empirical (online) calculation of mean and variance
- class sctools.stats.OnlineGaussianSufficientStatistic[source]¶
Bases:
object
Implementation of Welford’s online mean and variance algorithm
- update(new_value: float)[source]¶
incorporate new_value into the online estimate of mean and variance
- mean()¶
return the mean value
- property mean: float¶
return the mean value
- sctools.stats.base4_entropy(x, axis=1)[source]¶
Calculate entropy in base four of a data matrix x
Useful for measuring DNA entropy (with 4 nucleotides) as the output is restricted to [0, 1]
- Parameters
x (np.ndarray) – array of dimension one or more containing numeric types
axis (int, optional) – axis to calculate entropy across. Values in this axis are treated as observation frequencies
- Returns
entropy – array of input dimension - 1 containin entropy values bounded in [0, 1]
- Return type
np.ndarray