sctools package¶

Submodules¶

sctools.bam module¶

Tools for Manipulating SAM/BAM format files¶

This module provides functions and classes to subsample reads from bam files that correspond to specific chromosomes, split bam files into chunks, assign tags to bam files from paired fastq records, and iterate over sorted bam files by one or more tags

This module makes heavy use of the pysam wrapper for HTSlib, a high-performance c-library designed to manipulate sam files

iter_tag_groups function to iterate over reads by an arbitrary tag

iter_cell_barcodes wrapper for iter_tag_groups that iterates over cell barcode tags

iter_genes wrapper for iter_tag_groups that iterates over gene tags

iter_molecules wrapper for iter_tag_groups that iterates over molecule tags

sort_by_tags_and_queryname sort bam by given list of zero or more tags, followed by query name

verify_sort verifies whether bam is correctly sorted by given list of tags, then query name

sctools.Classes()¶

-------

SubsetAlignments class to extract reads specific to requested chromosome(s)

Tagger class to add tags to sam/bam records from paired fastq records

AlignmentSortOrder abstract class to represent alignment sort orders

QueryNameSortOrder alignment sort order by query name

TagSortableRecord class to facilitate sorting of pysam.AlignedSegments

SortError error raised when sorting is incorrect

References

htslib : https://github.com/samtools/htslib

class sctools.bam.AlignmentSortOrder[source]¶

Bases: object

The base class of alignment sort orders.

abstract property key_generator: Callable[pysam.libcalignedsegment.AlignedSegment, Any]¶: Returns a callable function that calculates a sort key from given pysam.AlignedSegment.

class sctools.bam.QueryNameSortOrder[source]¶

Bases: sctools.bam.AlignmentSortOrder

Alignment record sort order by query name.

static get_sort_key(alignment: pysam.libcalignedsegment.AlignedSegment) → str[source]¶

property key_generator¶: Returns a callable function that calculates a sort key from given pysam.AlignedSegment.

exception sctools.bam.SortError[source]¶

Bases: Exception

args¶

with_traceback()¶: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class sctools.bam.SubsetAlignments(alignment_file: str, open_mode: Optional[str] = None)[source]¶

Bases: object

Wrapper for pysam/htslib that extracts reads corresponding to requested chromosome(s)

Parameters

alignment_file (str) – sam or bam file
open_mode ({'r', 'rb', None}, optional) – open mode for pysam.AlignmentFile. ‘r’ indicates a sam file, ‘rb’ indicates a bam file, and None attempts to autodetect based on the file suffix (Default = None)

indices_by_chromosome()[source]¶: returns indices to line numbers containing the requested number of reads for a specified chromosome

Notes

samtools is a good general-purpose tool for that is capable of most subsampling tasks. It is a good idea to check the samtools documentation when approaching these types of tasks.

References

samtools documentation : http://www.htslib.org/doc/samtools.html

indices_by_chromosome(n_specific: int, chromosome: str, include_other: int = 0) → Union[List[int], Tuple[List[int], List[int]]][source]¶

Return the list of first n_specific indices of reads aligned to chromosome.

Parameters

n_specific (int) – Number of aligned reads to return indices for
chromosome (str) – Only reads from this chromosome are considered valid
include_other (int, optional) – The number of reads to include that are NOT aligned to chromosome. These can be aligned or unaligned reads (default = 0).

Returns

chromosome_indices (List[int]) – list of indices to reads aligning to chromosome
other_indices (List[int], optional) – list of indices to reads NOT aligning to chromosome, only returned if include_other is not 0.

class sctools.bam.TagSortableRecord(tag_keys: Iterable[str], tag_values: Iterable[str], query_name: str, record: Optional[pysam.libcalignedsegment.AlignedSegment] = None)[source]¶

Bases: object

Wrapper for pysam.AlignedSegment that facilitates sorting by tags and query name.

classmethod from_aligned_segment(record: pysam.libcalignedsegment.AlignedSegment, tag_keys: Iterable[str]) → sctools.bam.TagSortableRecord[source]¶: Create a TagSortableRecord from a pysam.AlignedSegment and list of tag keys

class sctools.bam.Tagger(bam_file: str)[source]¶

Bases: object

Add tags to a bam file from tag generators.

Parameters: bam_file (str) – Bam file that tags are to be added to.

tag()[source]¶: tag bam records given tag_generators (often generated from paired bam or fastq files) # todo this should probably be wrapped up in __init__ to make this more function-like

tag(output_bam_name: str, tag_generators) → None[source]¶

Add tags to bam_file.

Given a bam file and tag generators derived from files sharing the same sort order, adds tags to the .bam file, and writes the resulting file to output_bam_name.

Parameters

output_bam_name (str) – Name of output tagged bam.
tag_generators (List[fastq.TagGenerator]) – list of generators that yield fastq.Tag objects

sctools.bam.get_barcode_for_alignment(alignment: pysam.libcalignedsegment.AlignedSegment, tags: List[str], raise_missing: bool) → str[source]¶

Get the barcode for an Alignment

Parameters

alignment – pysam.AlignedSegment An Alignment from pysam.
tags – List[str] Tags in the bam that might contain barcodes. If multiple Tags are passed, will return the contents of the first tag that contains a barcode.
raise_missing – bool Raise an error if no barcodes can be found.

Returns

str A barcode for the alignment, or None if one is not found and raise_missing is False.

sctools.bam.get_barcodes_from_bam(in_bam: str, tags: List[str], raise_missing: bool) → Set[str][source]¶

Get all the distinct barcodes from a bam

Parameters

in_bam – str Input bam file.
tags – List[str] Tags in the bam that might contain barcodes.
raise_missing – bool Raise an error if no barcodes can be found.

Returns

set A set of barcodes found in the bam This set will not contain a None value

sctools.bam.get_tag_or_default(alignment: pysam.libcalignedsegment.AlignedSegment, tag_key: str, default: Optional[str] = None) → Optional[str][source]¶: Extracts the value associated to tag_key from alignment, and returns a default value if the tag is not present.

sctools.bam.iter_cell_barcodes(bam_iterator: Iterator[pysam.libcalignedsegment.AlignedSegment]) → Generator[source]¶

Iterate over all the cells of a bam file sorted by cell.

Parameters

bam_iterator (Iterator[pysam.AlignedSegment]) – open bam file that can be iterated over

Yields

grouped_by_tag (Iterator[pysam.AlignedSegment]) – reads sharing a unique cell barcode tag
current_tag (str) – the cell barcode that reads in the group all share

sctools.bam.iter_genes(bam_iterator: Iterator[pysam.libcalignedsegment.AlignedSegment]) → Generator[source]¶

Iterate over all the cells of a bam file sorted by gene.

Parameters

bam_iterator (Iterator[pysam.AlignedSegment]) – open bam file that can be iterated over

Yields

grouped_by_tag (Iterator[pysam.AlignedSegment]) – reads sharing a unique gene name tag
current_tag (str) – the gene id that reads in the group all share

sctools.bam.iter_molecule_barcodes(bam_iterator: Iterator[pysam.libcalignedsegment.AlignedSegment]) → Generator[source]¶

Iterate over all the molecules of a bam file sorted by molecule.

Parameters

bam_iterator (Iterator[pysam.AlignedSegment]) – open bam file that can be iterated over

Yields

grouped_by_tag (Iterator[pysam.AlignedSegment]) – reads sharing a unique molecule barcode tag
current_tag (str) – the molecule barcode that records in the group all share

sctools.bam.iter_tag_groups(tag: str, bam_iterator: Iterator[pysam.libcalignedsegment.AlignedSegment], filter_null: bool = False) → Generator[source]¶

Iterates over reads and yields them grouped by the provided tag value

Parameters

tag (str) – BAM tag to group over
bam_iterator (Iterator[pysam.AlignedSegment]) – open bam file that can be iterated over
filter_null (bool, optional) – If False, all reads that lack the requested tag are yielded together. Else, all reads that lack the tag will be discarded (default = False).

Yields

grouped_by_tag (Iterator[pysam.AlignedSegment]) – reads sharing a unique value of tag
current_tag (str) – the tag that reads in the group all share

sctools.bam.merge_bams(bams: List[str]) → str[source]¶

Merge input bams using samtools.

This cannot be a local function within split because then Python “cannot pickle a local object”. :param bams: Name of the final bam + bams to merge.

Because of how its called using multiprocessing, the bam basename is the first element of the list.

Returns: The output bam name.

sctools.bam.sort_by_tags_and_queryname(records: Iterable[pysam.libcalignedsegment.AlignedSegment], tag_keys: Iterable[str]) → Iterable[pysam.libcalignedsegment.AlignedSegment][source]¶: Sorts the given bam records by the given tags, followed by query name. If no tags are given, just sorts by query name.

sctools.bam.split(in_bams: List[str], out_prefix: str, tags: List[str], approx_mb_per_split: float = 1000, raise_missing: bool = True, num_processes: Optional[int] = None) → List[str][source]¶

split in_bam by tag into files of approx_mb_per_split

Parameters

in_bams (str) – Input bam files.
out_prefix (str) – Prefix for all output files; output will be named as prefix_n where n is an integer equal to the chunk number.
tags (List[str]) – The bam tags to split on. The tags are checked in order, and sorting is done based on the first identified tag. Further tags are only checked if the first tag is missing. This is useful in cases where sorting is executed over a corrected barcode, but some records only have a raw barcode.
approx_mb_per_split (float) – The target file size for each chunk in mb
raise_missing (bool, optional) – if True, raise a RuntimeError if a record is encountered without a tag. Else silently discard the record (default = True)
num_processes (int, optional) – The number of processes to parallelize over. If not set, will use all available processes.

Returns

output_filenames – list of filenames of bam chunks

Return type

List[str]

Raises

ValueError – when tags is empty
RuntimeError – when raise_missing is true and any passed read contains no tags

sctools.bam.verify_sort(records: Iterable[sctools.bam.TagSortableRecord], tag_keys: Iterable[str]) → None[source]¶: Raise AssertionError if the given records are not correctly sorted by the given tags and query name

sctools.bam.write_barcodes_to_bins(in_bam: str, tags: List[str], barcodes_to_bins: Dict[str, int], raise_missing: bool) → List[str][source]¶

Write barcodes to appropriate bins as defined by barcodes_to_bins

Parameters

in_bam – str The bam file to read.
tags – List[str] Tags in the bam that might contain barcodes.
barcodes_to_bins – Dict[str, int] A Dict from barcode to bin. All barcodes of the same type need to be written to the same bin. These numbered bins are merged after parallelization so that all alignments with the same barcode are in the same bam.
raise_missing – bool Raise an error if no barcodes can be found.

Returns

A list of paths to the written bins.

sctools.barcode module¶

Nucleotide Barcode Manipulation Tools¶

This module contains tools to characterize oligonucleotide barcodes and a simple hamming-base error-correction approach which corrects barcodes within a specified distance of a “whitelist” of expected barcodes.

Classes¶

Barcodes Class to characterize a set of barcodes ErrorsToCorrectBarcodesMap Class to carry out error correction routines

class sctools.barcode.Barcodes(barcodes: Mapping[str, int], barcode_length: int)[source]¶

Bases: object

Container for a set of nucleotide barcodes.

Contained barcodes are encoded in 2bit representation for fast operations. Instances of this class can optionally be constructed from an iterable where barcodes can be present multiple times. In these cases, barcodes are analyzed based on their observed frequencies.

Parameters

barcodes (Mapping[str, int]) – dictionary-like mapping barcodes to the number of times they were observed
barcode_length (int) – the length of all barcodes in the set. Different-length barcodes are not supported.

See also

sctools.encodings.TwoBit

base_frequency(weighted=False) → numpy.ndarray[source]¶

return the frequency of each base at each position in the barcode set

Notes

weighting is currently not supported, and must be set to False or base_frequency will raise NotImplementedError # todo fix

Parameters: weighted (bool, optional) – if True, each barcode is counted once for each time it was observed (default = False)
Returns: frequencies – barcode_length x 4 2d numpy array
Return type: np.array
Raises: NotImplementedError – if weighted is True

effective_diversity(weighted=False) → numpy.ndarray[source]¶

Returns the effective base diversity of the barcode set by position.

maximum diversity for each position is 1, and represents a perfect split of 25% per base at a given position.

Parameters: weighted (bool, optional) – if True, each barcode is counted once for each time it was observed (default = False)
Returns: effective_diversity – 1-d array of size barcode_length containing floats in [0, 1]
Return type: np.array[float]

classmethod from_iterable_bytes(iterable: Iterable[bytes], barcode_length: int)[source]¶

Construct an ObservedBarcodeSet from an iterable of bytes barcodes.

Parameters

iterable (Iterable[bytes]) – iterable of barcodes in bytes representation
barcode_length (int) – the length of the barcodes in iterable

Returns

barcodes – class object containing barcodes from a whitelist file

Return type

Barcodes

classmethod from_iterable_encoded(iterable: Iterable[int], barcode_length: int)[source]¶

Construct an ObservedBarcodeSet from an iterable of encoded barcodes.

Parameters

iterable (Iterable[int]) – iterable of barcodes encoded in TwoBit representation
barcode_length (int) – the length of the barcodes in iterable

Returns

barcodes – class object containing barcodes from a whitelist file

Return type

Barcodes

classmethod from_iterable_strings(iterable: Iterable[str], barcode_length: int)[source]¶

Construct an ObservedBarcodeSet from an iterable of string barcodes.

Parameters

iterable (Iterable[str]) – iterable of barcodes encoded in TwoBit representation
barcode_length (int) – the length of the barcodes in iterable

Returns

barcodes – class object containing barcodes from a whitelist file

Return type

Barcodes

classmethod from_whitelist(file_: str, barcode_length: int)[source]¶

Creates a barcode set from a whitelist file.

Parameters

file (str) – location of the whitelist file. Should be formatted one barcode per line. Barcodes should be encoded in plain text (UTF-8, ASCII), not bit-encoded. Each barcode will be assigned a count of 1.
barcode_length (int) – Length of the barcodes in the file.

Returns

barcodes – class object containing barcodes from a whitelist file

Return type

Barcodes

summarize_hamming_distances() → Mapping[str, float][source]¶

Returns descriptive statistics on hamming distances between pairs of barcodes.

Returns: descriptive_statistics – minimum, 25th percentile, median, 75th percentile, maximum, and average hamming distance between all pairs of barcodes
Return type: Mapping[str, float]

References

https://en.wikipedia.org/wiki/Hamming_distance

class sctools.barcode.ErrorsToCorrectBarcodesMap(errors_to_barcodes: Mapping[str, str])[source]¶

Bases: object

Correct any barcode that is within one hamming distance of a whitelisted barcode

Parameters: errors_to_barcodes (Mapping[str, str]) – dict-like mapping 1-base errors to the whitelist barcode that they could be generated from

get_corrected_barcode(barcode: str)[source]¶: Return a barcode if it is whitelist, or the corrected version if within edit distance 1

correct_bam(bam_file: str, output_bam_file: str)[source]¶: correct barcodes in a bam file, given a whitelist

References

https://en.wikipedia.org/wiki/Hamming_distance

correct_bam(bam_file: str, output_bam_file: str) → None[source]¶

Correct barcodes in a (potentially unaligned) bamfile, given a whitelist.

Parameters

bam_file (str) – BAM format file in same order as the fastq files
output_bam_file (str) – BAM format file containing cell, umi, and sample tags.

get_corrected_barcode(barcode: str) → str[source]¶

Return a barcode if it is whitelist, or the corrected version if within edit distance 1

Parameters: barcode (str) – the barcode to return the corrected version of. If the barcode is in the whitelist, the input barcode is returned unchanged.
Returns: corrected_barcode – corrected version of the barcode
Return type: str
Raises: KeyError – if the passed barcode is not within 1 hamming distance of any whitelist barcode

References

https://en.wikipedia.org/wiki/Hamming_distance

classmethod single_hamming_errors_from_whitelist(whitelist_file: str)[source]¶

Factory method to generate instance of class from a file containing “correct” barcodes.

Parameters: whitelist_file (str) – Text file containing barcode per line.
Returns: errors_to_barcodes_map – instance of cls, built from whitelist
Return type: ErrorsToCorrectBarcodesMap

sctools.encodings module¶

Compressed Barcode Encoding Methods¶

This module defines several classes to encode DNA sequences in memory-efficient forms, using 2 bits to encode bases of a 4-letter DNA alphabet (ACGT) or 3 bits to encode a 5-letter DNA alphabet that includes the ambiguous call often included by Illumina base calling software (ACGTN). The classes also contain several methods useful for efficient querying and manipulation of the encoded sequence.

Classes¶

Encoding Encoder base class ThreeBit Three bit DNA encoder / decoder TwoBit Two bit DNA encoder / decoder

class sctools.encodings.Encoding[source]¶

Bases: object

encoding_map¶

Class that mimics a Mapping[bytes, str] where bytes must be a single byte encoded character (encoder)

Type: TwoBitEncodingMap

decoding_map¶

Dictionary that maps integers to bytes human-readable representations (decoder)

Type: Mapping[int, bytes]

bits_per_base¶

number of bits used to encode each base

Type: int

encode(bytes_encoded: bytes)[source]¶: encode a DNA string in a compressed representation

decode(integer_encoded: int)[source]¶: decode a compressed DNA string into a human readable bytes format

gc_content(integer_encoded: int)[source]¶: calculate the GC content of an encoded DNA string

hamming_distance(a: int, b: int)[source]¶: calculate the hamming distance between two encoded DNA strings

bits_per_base: int = NotImplemented¶

decode(integer_encoded: int) → bytes[source]¶

Decode a DNA bytes string.

Parameters: integer_encoded (bytes) – Integer encoded DNA string
Returns: decoded – Bytes decoded DNA sequence
Return type: bytes

decoding_map: Mapping[int, AnyStr] = NotImplemented¶

classmethod encode(bytes_encoded: bytes) → int[source]¶

Encode a DNA bytes string.

Parameters: bytes_encoded (bytes) – bytes DNA string
Returns: encoded – Encoded DNA sequence
Return type: int

encoding_map: Mapping[AnyStr, int] = NotImplemented¶

gc_content(integer_encoded: int) → int[source]¶

Return the number of G or C nucleotides in integer_encoded

Parameters: integer_encoded (int) – Integer encoded DNA string
Returns: number of bases in integer_encoded input that are G or C.
Return type: gc_content, int

static hamming_distance(a, b) → int[source]¶

Calculate the hamming distance between two DNA sequences

The hamming distance counts the number of bases that are not the same nucleotide

Parameters

a (int) – integer encoded
b (int) – integer encoded

Returns

d – hamming distance between a and b

Return type

int

class sctools.encodings.ThreeBit(*args, **kwargs)[source]¶

Bases: sctools.encodings.Encoding

Encode a DNA sequence using a 3-bit encoding.

Since no bases are encoded as 0, an empty triplet is interpreted as the end of the encoded string; Three-bit encoding can be used to encode and decode strings without knowledge of their length.

encoding_map¶

Class that mimics a Mapping[bytes, str] where bytes must be a single byte encoded character (encoder)

Type: TwoBitEncodingMap

decoding_map¶

Dictionary that maps integers to bytes human-readable representations (decoder)

Type: Mapping[int, bytes]

bits_per_base¶

number of bits used to encode each base

Type: int

encode(bytes_encoded: bytes)[source]¶: encode a DNA string in a compressed representation

decode(integer_encoded: int)[source]¶: decode a compressed DNA string into a human readable bytes format

gc_content(integer_encoded: int)[source]¶: calculate the GC content of an encoded DNA string

hamming_distance(a: int, b: int)[source]¶: calculate the hamming distance between two encoded DNA strings

class ThreeBitEncodingMap[source]¶

Bases: object

Dict-like class that maps bytes to 3-bit integer representations

All IUPAC ambiguous codes are treated as “N”

map_ = {65: 2, 67: 1, 71: 3, 78: 6, 84: 4, 97: 2, 99: 1, 103: 3, 110: 6, 116: 4}¶

bits_per_base: int = 3¶

classmethod decode(integer_encoded: int) → bytes[source]¶

Decode a DNA bytes string.

Parameters: integer_encoded (bytes) – Integer encoded DNA string
Returns: decoded – Bytes decoded DNA sequence
Return type: bytes

decoding_map: Mapping[int, bytes] = {1: b'C', 2: b'A', 3: b'G', 4: b'T', 6: b'N'}¶

classmethod encode(bytes_encoded: bytes) → int[source]¶

Encode a DNA bytes string.

Parameters: bytes_encoded (bytes) – bytes DNA string
Returns: encoded – Encoded DNA sequence
Return type: int

encoding_map: sctools.encodings.ThreeBit.ThreeBitEncodingMap = <sctools.encodings.ThreeBit.ThreeBitEncodingMap object>¶

classmethod gc_content(integer_encoded: int) → int[source]¶

Return the number of G or C nucleotides in integer_encoded

Parameters: integer_encoded (int) – Integer encoded DNA string
Returns: number of bases in integer_encoded input that are G or C.
Return type: gc_content, int

static hamming_distance(a: int, b: int) → int[source]¶

Calculate the hamming distance between two DNA sequences

The hamming distance counts the number of bases that are not the same nucleotide

Parameters

a (int) – integer encoded
b (int) – integer encoded

Returns

d – hamming distance between a and b

Return type

int

class sctools.encodings.TwoBit(sequence_length: int)[source]¶

Bases: sctools.encodings.Encoding

Encode a DNA sequence using a 2-bit encoding.

Two-bit encoding uses 0 for an encoded nucleotide. As such, it cannot distinguish between the end of sequence and trailing A nucleotides, and thus decoding these strings requires knowledge of their length. Therefore, it is only appropriate for encoding fixed sequence lengths

In addition, in order to encode in 2-bit, N-nucleotides must be randomized to one of A, C, G, and T.

Parameters: sequence_length (int) – number of nucleotides that are being encoded

encoding_map¶

Class that mimics a Mapping[bytes, str] where bytes must be a single byte encoded character (encoder)

Type: TwoBitEncodingMap

decoding_map¶

Dictionary that maps integers to bytes human-readable representations (decoder)

Type: Mapping[int, bytes]

bits_per_base¶

number of bits used to encode each base

Type: int

encode(bytes_encoded: bytes)[source]¶: encode a DNA string in a compressed representation

decode(integer_encoded: int)[source]¶: decode a compressed DNA string into a human readable bytes format

gc_content(integer_encoded: int)[source]¶: calculate the GC content of an encoded DNA string

hamming_distance(a: int, b: int)[source]¶: calculate the hamming distance between two encoded DNA strings

class TwoBitEncodingMap[source]¶

Bases: object

Dict-like class that maps bytes to 2-bit integer representations

Generates random nucleotides for ambiguous nucleotides e.g. N

iupac_ambiguous: Set[int] = {66, 68, 72, 75, 77, 78, 82, 83, 86, 87, 89, 98, 100, 104, 107, 109, 110, 114, 115, 118, 119, 121}¶

map_ = {65: 0, 67: 1, 71: 3, 84: 2, 97: 0, 99: 1, 103: 3, 116: 2}¶

bits_per_base: int = 2¶

decode(integer_encoded: int) → bytes[source]¶

Decode a DNA bytes string.

Parameters: integer_encoded (bytes) – Integer encoded DNA string
Returns: decoded – Bytes decoded DNA sequence
Return type: bytes

decoding_map: Mapping[int, bytes] = {0: b'A', 1: b'C', 2: b'T', 3: b'G'}¶

classmethod encode(bytes_encoded: bytes) → int[source]¶

Encode a DNA bytes string.

Parameters: bytes_encoded (bytes) – bytes DNA string
Returns: encoded – Encoded DNA sequence
Return type: int

encoding_map: sctools.encodings.TwoBit.TwoBitEncodingMap = <sctools.encodings.TwoBit.TwoBitEncodingMap object>¶

gc_content(integer_encoded: int) → int[source]¶

Return the number of G or C nucleotides in integer_encoded

Parameters: integer_encoded (int) – Integer encoded DNA string
Returns: number of bases in integer_encoded input that are G or C.
Return type: gc_content, int

static hamming_distance(a: int, b: int) → int[source]¶

Calculate the hamming distance between two DNA sequences

The hamming distance counts the number of bases that are not the same nucleotide

Parameters

a (int) – integer encoded
b (int) – integer encoded

Returns

d – hamming distance between a and b

Return type

int

sctools.fastq module¶

Efficient Fastq Iterators and Representations¶

This module implements classes for representing fastq records, reading and writing them, and extracting parts of fastq sequence for transformation into bam format tags

sctools.extract_barcode(record, embedded_barcode)¶: extract a barcode, defined by embedded_barcode from record

sctools.Classes()¶

-------

Record Represents fastq records (input as bytes)

StrRecord Represents fastq records (input as str)

Reader Opens and iterates over fastq files

EmbeddedBarcodeGenerator Generates barcodes from a fastq file

BarcodeGeneratorWithCorrectedCellBarcodes Generates (corrected) barcodes from a fastq file

References

https://en.wikipedia.org/wiki/FASTQ_format

class sctools.fastq.BarcodeGeneratorWithCorrectedCellBarcodes(fastq_files: Union[str, Iterable[str]], embedded_cell_barcode: sctools.fastq.Tag, whitelist: str, other_embedded_barcodes: Iterable[sctools.fastq.Tag] = (), *args, **kwargs)[source]¶

Bases: sctools.fastq.Reader

Generate barcodes from FASTQ file(s) from positions defined by EmbeddedBarcode(s)

Extracted barcode objects are produced in a form that is consumable by pysam’s bam and sam set_tag methods. In this class, one EmbeddedBarcode must be defined as an embedded_cell_barcode, which is checked against a whitelist and error corrected during generation

Parameters

fastq_files (str | List, optional) – FASTQ file or files to be read. (default = sys.stdin)
mode ({'r', 'rb'}, optional) – open mode for fastq files. If ‘r’, return string. If ‘rb’, return bytes (default = ‘r’)
whitelist (str) – whitelist file containing “correct” cell barcodes for an experiment
embedded_cell_barcodes (EmbeddedBarcode) – EmbeddedBarcode containing information about the position and names of cell barcode tags
other_embedded_barcodes (Iterable[EmbeddedBarcode], optional) – tag objects defining start and end of the sequence containing the tag, and the tag identifiers for sequence and quality tags (default = None)

extract_cell_barcode(record: Record, cb: str)[source]¶

extract_cell_barcode(record: Tuple[str], cb: sctools.fastq.Tag)[source]¶

Extract a cell barcode from a fastq record

Parameters

record (Tuple[str]) – fastq record comprised of four strings: name, sequence, name2, and quality
cb (EmbeddedBarcode) – defines the position and tag identifier for a call barcode

Returns

sequence_tag (Tuple[str, str, ‘Z’]) – raw sequence tag identifier, sequence, SAM tag type (‘Z’ implies a string tag)
quality_tag (Tuple[str, str, ‘Z’]) – quality tag identifier, quality, SAM tag type (‘Z’ implies a string tag)
corrected_tag (Optional[Tuple[str, str, ‘Z’]]) – Whitelist verified sequence tag. Only present if the raw sequence tag is in the whitelist or within 1 hamming distance of one of its barcodes

property filenames: List[str]¶

select_record_indices(indices: Set) → Generator¶

Iterate over provided indices only, skipping other records.

Parameters: indices (Set[int]) – indices to include in the output
Yields: record, str – records from file corresponding to indices

property size: int¶: return the collective size of all files being read in bytes

sctools.fastq.EmbeddedBarcode¶: alias of sctools.fastq.Tag

class sctools.fastq.EmbeddedBarcodeGenerator(fastq_files, embedded_barcodes, *args, **kwargs)[source]¶

Bases: sctools.fastq.Reader

Generate barcodes from a FASTQ file(s) from positions defined by EmbeddedBarcode(s)

Extracted barcode objects are produced in a form that is consumable by pysam’s bam and sam set_tag methods.

Parameters

embedded_barcodes (Iterable[EmbeddedBarcode]) – tag objects defining start and end of the sequence containing the tag, and the tag identifiers for sequence and quality tags
fastq_files (str | List, optional) – FASTQ file or files to be read. (default = sys.stdin)
mode ({'r', 'rb'}, optional) – open mode for FASTQ files. If ‘r’, return string. If ‘rb’, return bytes (default = ‘r’)

property filenames: List[str]¶

select_record_indices(indices: Set) → Generator¶

Iterate over provided indices only, skipping other records.

Parameters: indices (Set[int]) – indices to include in the output
Yields: record, str – records from file corresponding to indices

property size: int¶: return the collective size of all files being read in bytes

class sctools.fastq.Reader(files='-', mode='r', header_comment_char=None)[source]¶

Bases: sctools.reader.Reader

Fastq Reader that defines some special methods for reading and summarizing FASTQ data.

Simple reader class that exposes an __iter__ and __len__ method

Examples

#todo add examples

See also

sctools.reader.Reader

References

https://en.wikipedia.org/wiki/FASTQ_format

property filenames: List[str]¶

select_record_indices(indices: Set) → Generator[source]¶

Iterate over provided indices only, skipping other records.

Parameters: indices (Set[int]) – indices to include in the output
Yields: record, str – records from file corresponding to indices

property size: int¶: return the collective size of all files being read in bytes

class sctools.fastq.Record(record: Iterable[AnyStr])[source]¶

Bases: object

Fastq Record.

Parameters: record (Iterable[bytes]) – Iterable of 4 bytes strings that comprise a fastq record

name¶

fastq record name

Type: bytes

sequence¶

fastq nucleotide sequence

Type: bytes

name2¶

second fastq record name field (rarely used)

Type: bytes

quality¶

base call quality for each nucleotide in sequence

Type: bytes

average_quality()[source]¶: The average quality of the fastq record

average_quality() → float[source]¶: return the average quality of this record

property name: AnyStr¶

property name2: AnyStr¶

property quality: AnyStr¶

property sequence: AnyStr¶

class sctools.fastq.StrRecord(record: Iterable[AnyStr])[source]¶

Bases: sctools.fastq.Record

Fastq Record.

Parameters: record (Iterable[str]) – Iterable of 4 bytes strings that comprise a FASTQ record

name¶

FASTQ record name

Type: str

sequence¶

FASTQ nucleotide sequence

Type: str

name2¶

second FASTQ record name field (rarely used)

Type: str

quality¶

base call quality for each nucleotide in sequence

Type: str

average_quality()[source]¶: The average quality of the FASTQ record

average_quality() → float[source]¶: return the average quality of this record

property name: str¶

property name2: AnyStr¶

property quality: AnyStr¶

property sequence: AnyStr¶

sctools.fastq.extract_barcode(record, embedded_barcode) → Tuple[Tuple[str, str, str], Tuple[str, str, str]][source]¶

Extracts barcodes from a FASTQ record at positions defined by an EmbeddedBarcode object.

Parameters

record (FastqRecord) – Record to extract from
embedded_barcode (EmbeddedBarcode) – Defines the barcode start and end positions and the tag name for the sequence and quality tags

Returns

sequence_tag (Tuple[str, str, ‘Z’]) – sequence tag identifier, sequence, SAM tag type (‘Z’ implies a string tag)
quality_tag (Tuple[str, str, ‘Z’]) – quality tag identifier, quality, SAM tag type (‘Z’ implies a string tag)

sctools.gtf module¶

GTF Records and Iterators¶

This module defines a GTF record class and a Reader class to iterate over GTF-format files

Classes¶

Record Data class that exposes GTF record fields by name Reader GTF file reader that yields GTF Records

References

https://useast.ensembl.org/info/website/upload/gff.html

class sctools.gtf.GTFRecord(record: str)[source]¶

Bases: object

Data class for storing and interacting with GTF records

Subclassed to produce exon, transcript, and gene-specific record types. A GTF record has 8 fixed fields which are followed by optional fields separated by ; , which are stored by this class in the attributes field and accessible by get_attribute. Fixed fields are accessible by name.

Parameters: record (str) – an unparsed GTF record

seqname¶

The name of the sequence (often chromosome) this record is found on.

Type: str

chromosome¶

Synonym for seqname.

Type: str

source¶

The group responsible for generating this annotation.

Type: str

feature¶

The type of record (e.g. gene, exon, …).

Type: str

start¶

The start position of this feature relative to the beginning of seqname.

Type: str

end¶

The end position of this feature relative to the beginning of seqname….

Type: str

score¶

The annotation score. Rarely used.

Type: str

strand¶

The strand of seqname that this annotation is found on

Type: {‘+’, ‘-‘}

frame¶

‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on

Type: {‘0’, ‘1’, ‘2’}

size¶

the number of nucleotides spanned by this feature

Type: int

get_attribute(key: str)[source]¶: attempt to retrieve a variable field with name equal to key

set_attribute(key: str, value: str)[source]¶: set variable field key equal to value. Overwrites key if already present.

property chromosome: str¶

property end: int¶

property feature: str¶

property frame: str¶

get_attribute(key) → str[source]¶

access an item from the attribute field of a GTF file.

Parameters: key (str) – Item to retrieve
Returns: value – Contents of variable attribute key
Return type: str
Raises: KeyError – if there is no variable attribute key associated with this record

property score: str¶

property seqname: str¶

set_attribute(key, value) → None[source]¶

Set variable attribute key equal to value

If attribute key is already set for this record, its contents are overwritten by value

Parameters

key (str) – attribute name
value (str) – attribute content

property size: int¶

property source: str¶

property start: int¶

property strand: str¶

class sctools.gtf.Reader(files='-', mode='r', header_comment_char='#')[source]¶

Bases: sctools.reader.Reader

GTF file iterator

Parameters

files (Union[str, List], optional) – File(s) to read. If ‘-‘, read sys.stdin (default = ‘-‘)
mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).
header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)

filter(retain_types: Iterable[str])[source]¶: Iterate over a GTF file, only yielding records in retain_types.

__iter__()[source]¶: iterate over GTF records in file, yielding Record objects

See also

sctools.reader.Reader

property filenames: List[str]¶

filter(retain_types: Iterable[str]) → Generator[source]¶

Iterate over a GTF file, returning only record whose feature type is in retain_types.

Features are stored in GTF field 2.

Parameters: retain_types (Iterable[str]) – Record feature types to retain.
Yields: gtf_record (Record) – gtf Record object

select_record_indices(indices: Set) → Generator[source]¶

Iterate over provided indices only, skipping other records.

Parameters: indices (Set[int]) – indices to include in the output
Yields: record, str – records from file corresponding to indices

property size: int¶: return the collective size of all files being read in bytes

sctools.gtf.extract_extended_gene_names(files: Union[str, List[str]] = '-', mode: str = 'r', header_comment_char: str = '#') → Dict[str, List[tuple]][source]¶

Extract extended gene names from GTF file(s) and returns a map from gene names to their corresponding occurrence locations the given file(s).

Parameters

files (Union[str, List], optional) – File(s) to read. If ‘-‘, read sys.stdin (default = ‘-‘)
mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).
header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)

Returns

A dictionary of chromosome names mapping to a List of tuples, each containing a range as the the first element and a gene name as the second. Dict[str, List(Tuple((start,end), gene)))

Return type

Dict[str, List[tuple]]

sctools.gtf.extract_gene_exons(files: Union[str, List[str]] = '-', mode: str = 'r', header_comment_char: str = '#') → Dict[str, List[tuple]][source]¶

Extract extended gene names from GTF file(s) and returns a map from gene names to the the list of exons in the ascending order of the start positions file(s).

Parameters

files (Union[str, List], optional) – File(s) to read. If ‘-‘, read sys.stdin (default = ‘-‘)
mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).
header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)

Returns

A dictionary of chromosome names mapping to a List of tuples, each containing a the exons in the ascending order of the start positions. Dict[str, List(Tuple((start,end), gene)))

Return type

Dict[str, List[tuple]]

sctools.gtf.extract_gene_names(files: Union[str, List[str]] = '-', mode: str = 'r', header_comment_char: str = '#') → Dict[str, int][source]¶

Extract gene names from GTF file(s) and returns a map from gene names to their corresponding occurrence orders in the given file(s).

Parameters

files (Union[str, List], optional) – File(s) to read. If ‘-‘, read sys.stdin (default = ‘-‘)
mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).
header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)

Returns

A map from gene names to their linear index

Return type

Dict[str, int]

sctools.gtf.get_mitochondrial_gene_names(files: Union[str, List[str]] = '-', mode: str = 'r', header_comment_char: str = '#') → Set[str][source]¶

Extract mitocholdrial gene names from GTF file(s) and returns a set of mitochondrial: gene id occurrence in the given file(s).

Parameters

files (Union[str, List], optional) – File(s) to read. If ‘-‘, read sys.stdin (default = ‘-‘)
mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).
header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)

Returns

A set of the mitochondrial gene ids

Return type

Set(str)

sctools.platform module¶

Command Line Interface for SC Tools:¶

This module defines the command line interface for SC Tools. Tools are separated into those that are specific to particular chemistries (e.g. Smart-seq 2) or experimental platforms (e.g. 10x Genomics v2) and those that are general across any sequencing experiment.

Currently, only general modules and those used for 10x v2 are implemented

Classes¶

GenericPlatform Class containing all general command line utilities TenXV2 Class containing 10x v2 specific command line utilities

class sctools.platform.BarcodePlatform[source]¶

Bases: sctools.platform.GenericPlatform

Command Line Interface for extracting and attaching barcodes with specified positions: generalizing TenXV2 attach barcodes

Sample, cell and/or molecule barcodes can be extracted and attached to an unmapped bam when the corresponding barcode’s start position and and length are provided. The sample barcode is extracted from the index i7 fastq file and the cell and molecule barcode are extracted from the r1 fastq file

This class defines several methods that are created as CLI tools when sctools is installed (see setup.py)

cell_barcode¶

A data class that defines the start and end position of the cell barcode and the tags to assign the sequence and quality of the cell barcode

Type: fastq.EmbeddedBarcode

molecule_barcode¶

A data class that defines the start and end position of the molecule barcode and the tags to assign the sequence and quality of the molecule barcode

Type: fastq.EmbeddedBarcode

sample_barcode¶

A data class that defines the start and end position of the sample barcode and the tags to assign the sequence and quality of the sample barcode

Type: fastq.EmbeddedBarcode

attach_barcodes()[source]¶: Attach barcodes from the forward (r1) and optionally index (i1) fastq files to the reverse (r2) bam file

classmethod attach_barcodes(args=None)[source]¶

Command line entrypoint for attaching barcodes to a bamfile.

Parameters: args (Iterable[str], optional) – arguments list, The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod bam_to_count_matrix(args: Optional[Iterable[str]] = None) → int¶

Command line entrypoint for constructing a count matrix from a tagged bam file.

Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod calculate_cell_metrics(args: Optional[Iterable[str]] = None) → int¶

Command line entrypoint for calculating cell metrics from a sorted bamfile.

Writes metrics to .csv

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod calculate_gene_metrics(args: Optional[Iterable[str]] = None) → int¶

Command line entrypoint for calculating gene metrics from a sorted bamfile.

Writes metrics to .csv

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

cell_barcode = None¶

classmethod get_tags(raw_tags: Optional[Sequence[str]]) → Iterable[str]¶

classmethod group_qc_outputs(args: Optional[Iterable[str]] = None) → int¶

Commandline entrypoint for parsing picard metrics files, hisat2 and rsem statistics log files. :param args: file_names: array of files

output_name: prefix of output file name. metrics_type: Picard, PicardTable, HISAT2, RSEM and Core.

Returns: return – return if the program completes successfully.
Return type: 0

classmethod merge_cell_metrics(args: Optional[Iterable[str]] = None) → int¶

Command line entrypoint for merging multiple cell metrics files.

Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod merge_count_matrices(args: Optional[Iterable[str]] = None) → int¶

Command line entrypoint for constructing a count matrix from a tagged bam file.

Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod merge_gene_metrics(args: Optional[Iterable[str]] = None) → int¶

Command line entrypoint for merging multiple gene metrics files.

Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

molecule_barcode = None¶

sample_barcode = None¶

classmethod split_bam(args: Optional[Iterable] = None) → int¶

Command line entrypoint for splitting a bamfile into subfiles of equal size.

prints filenames of chunks to stdout

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod tag_sort_bam(args: Optional[Iterable] = None) → int¶

Command line entrypoint for sorting a bam file by zero or more tags, followed by queryname.

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod verify_bam_sort(args: Optional[Iterable] = None) → int¶

Command line entrypoint for verifying bam is properly sorted by zero or more tags, followed by queryname.

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

class sctools.platform.GenericPlatform[source]¶

Bases: object

Platform-agnostic command line functions available in SC Tools.

tag_sort_bam():: sort a bam file by zero or more tags and then by queryname
verify_bam_sort():: verifies whether bam file is correctly sorted by given list of zero or more tags, then queryname
split_bam(): split a bam file into subfiles of equal size
calculate_gene_metrics(): calculate information about genes captured by a sequencing experiment
calculate_cell_metrics(): calculate information about cells captured by a sequencing experiment
merge_gene_metrics(): merge multiple gene metrics files into a single output
merge_cell_metrics(): merge multiple cell metrics files into a single output
bam_to_count(): construct a compressed sparse row count file from a tagged, aligned bam file
merge_count_matrices(): merge multiple csr-format count matrices into a single csr matrix
group_qc_outputs(): aggregate Picard, HISAT2 and RSME QC statisitics

classmethod bam_to_count_matrix(args: Optional[Iterable[str]] = None) → int[source]¶

Command line entrypoint for constructing a count matrix from a tagged bam file.

Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod calculate_cell_metrics(args: Optional[Iterable[str]] = None) → int[source]¶

Command line entrypoint for calculating cell metrics from a sorted bamfile.

Writes metrics to .csv

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod calculate_gene_metrics(args: Optional[Iterable[str]] = None) → int[source]¶

Command line entrypoint for calculating gene metrics from a sorted bamfile.

Writes metrics to .csv

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod get_tags(raw_tags: Optional[Sequence[str]]) → Iterable[str][source]¶

classmethod group_qc_outputs(args: Optional[Iterable[str]] = None) → int[source]¶

Commandline entrypoint for parsing picard metrics files, hisat2 and rsem statistics log files. :param args: file_names: array of files

output_name: prefix of output file name. metrics_type: Picard, PicardTable, HISAT2, RSEM and Core.

Returns: return – return if the program completes successfully.
Return type: 0

classmethod merge_cell_metrics(args: Optional[Iterable[str]] = None) → int[source]¶

Command line entrypoint for merging multiple cell metrics files.

Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod merge_count_matrices(args: Optional[Iterable[str]] = None) → int[source]¶

Command line entrypoint for constructing a count matrix from a tagged bam file.

Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod merge_gene_metrics(args: Optional[Iterable[str]] = None) → int[source]¶

Command line entrypoint for merging multiple gene metrics files.

Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod split_bam(args: Optional[Iterable] = None) → int[source]¶

Command line entrypoint for splitting a bamfile into subfiles of equal size.

prints filenames of chunks to stdout

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod tag_sort_bam(args: Optional[Iterable] = None) → int[source]¶

Command line entrypoint for sorting a bam file by zero or more tags, followed by queryname.

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod verify_bam_sort(args: Optional[Iterable] = None) → int[source]¶

Command line entrypoint for verifying bam is properly sorted by zero or more tags, followed by queryname.

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

class sctools.platform.TenXV2[source]¶

Bases: sctools.platform.GenericPlatform

Command Line Interface for 10x Genomics v2 RNA-sequencing programs

This class defines several methods that are created as CLI tools when sctools is installed (see setup.py)

cell_barcode¶

A data class that defines the start and end position of the cell barcode and the tags to assign the sequence and quality of the cell barcode

Type: fastq.EmbeddedBarcode

molecule_barcode¶

A data class that defines the start and end position of the molecule barcode and the tags to assign the sequence and quality of the molecule barcode

Type: fastq.EmbeddedBarcode

sample_barcode¶

A data class that defines the start and end position of the sample barcode and the tags to assign the sequence and quality of the sample barcode

Type: fastq.EmbeddedBarcode

attach_barcodes()[source]¶: Attach barcodes from the forward (r1) and optionally index (i1) fastq files to the reverse (r2) bam file

classmethod attach_barcodes(args=None)[source]¶

Command line entrypoint for attaching barcodes to a bamfile.

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod bam_to_count_matrix(args: Optional[Iterable[str]] = None) → int¶

Command line entrypoint for constructing a count matrix from a tagged bam file.

Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod calculate_cell_metrics(args: Optional[Iterable[str]] = None) → int¶

Command line entrypoint for calculating cell metrics from a sorted bamfile.

Writes metrics to .csv

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod calculate_gene_metrics(args: Optional[Iterable[str]] = None) → int¶

Command line entrypoint for calculating gene metrics from a sorted bamfile.

Writes metrics to .csv

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

cell_barcode = Tag(start=0, end=16, sequence_tag='CR', quality_tag='CY')¶

classmethod get_tags(raw_tags: Optional[Sequence[str]]) → Iterable[str]¶

classmethod group_qc_outputs(args: Optional[Iterable[str]] = None) → int¶

Commandline entrypoint for parsing picard metrics files, hisat2 and rsem statistics log files. :param args: file_names: array of files

output_name: prefix of output file name. metrics_type: Picard, PicardTable, HISAT2, RSEM and Core.

Returns: return – return if the program completes successfully.
Return type: 0

classmethod merge_cell_metrics(args: Optional[Iterable[str]] = None) → int¶

Command line entrypoint for merging multiple cell metrics files.

Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod merge_count_matrices(args: Optional[Iterable[str]] = None) → int¶

Command line entrypoint for constructing a count matrix from a tagged bam file.

Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod merge_gene_metrics(args: Optional[Iterable[str]] = None) → int¶

Command line entrypoint for merging multiple gene metrics files.

Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.

Parameters: args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

molecule_barcode = Tag(start=16, end=26, sequence_tag='UR', quality_tag='UY')¶

sample_barcode = Tag(start=0, end=8, sequence_tag='SR', quality_tag='SY')¶

classmethod split_bam(args: Optional[Iterable] = None) → int¶

Command line entrypoint for splitting a bamfile into subfiles of equal size.

prints filenames of chunks to stdout

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod tag_sort_bam(args: Optional[Iterable] = None) → int¶

Command line entrypoint for sorting a bam file by zero or more tags, followed by queryname.

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

classmethod verify_bam_sort(args: Optional[Iterable] = None) → int¶

Command line entrypoint for verifying bam is properly sorted by zero or more tags, followed by queryname.

Parameters: args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv
Returns: return_call – return call if the program completes successfully
Return type: 0

sctools.reader module¶

Sequence File Iterators¶

This module defines a general iterator and some helper functions for iterating over files that contain sequencing data

sctools.infer_open(file_: str, mode: str)¶: helper function that determines the compression type of a file without relying on its extension

sctools.zip_readers(*readers, indices=None)¶: helper function that iterates over one or more readers, optionally extracting only the records that correspond to indices

sctools.Classes()¶

-------

Reader Basic reader that loops over one or more input files.

class sctools.reader.Reader(files='-', mode='r', header_comment_char=None)[source]¶

Bases: object

Basic reader object that seamlessly loops over multiple input files.

Is subclassed to create readers for specific file types (e.g. fastq, gtf, etc.)

Parameters

files (Union[str, List], optional) – The file(s) to read. If ‘-‘, read sys.stdin (default = ‘-‘)
mode ({'r', 'rb'}, optional) – The open mode for files. If ‘r’, yield string data, if ‘rb’, yield bytes data (default = ‘r’).
header_comment_char (str, optional) – If not None, skip lines beginning with this character (default = None).

property filenames: List[str]¶

select_record_indices(indices: Set) → Generator[source]¶

Iterate over provided indices only, skipping other records.

Parameters: indices (Set[int]) – indices to include in the output
Yields: record, str – records from file corresponding to indices

property size: int¶: return the collective size of all files being read in bytes

sctools.reader.infer_open(file_: str, mode: str) → Callable[source]¶

Helper function to infer the correct compression type of an input file

Identifies files that are .gz or .bz2 compressed without requiring file extensions

Parameters

file (str) – the file to open
mode ({'r', 'rb'}) – the mode to open the file in. ‘r’ returns strings, ‘rb’ returns bytes

Returns

open_function – the correct open function for the file’s compression with mode pre-set through functools partial

Return type

Callable

sctools.reader.zip_readers(*readers, indices=None) → Generator[source]¶

Zip together multiple reader objects, yielding records simultaneously.

If indices is passed, only return lines in file that correspond to indices

Parameters

*readers (List[Reader]) – Reader objects to simultaneously iterate over
indices (Set[int], optional) – indices to include in the output

Yields

records (Tuple[str]) – one record per reader passed

sctools.stats module¶

Statistics Functions for Sequence Data Analysis¶

This module implements statistical modules for sequence analysis

sctools.base4_entropy(x: np.array, axis: int = 1)¶: calculate the entropy of a 4 x sequence length base frequency matrix

sctools.Classes()¶

-------

OnlineGaussianSuficientStatistic Empirical (online) calculation of mean and variance

class sctools.stats.OnlineGaussianSufficientStatistic[source]¶

Bases: object

Implementation of Welford’s online mean and variance algorithm

update(new_value: float)[source]¶: incorporate new_value into the online estimate of mean and variance

mean()¶: return the mean value

calculate_variance()[source]¶: calculate and return the variance

mean_and_variance()[source]¶: return both mean and variance

calculate_variance()[source]¶: calculate and return the variance

property mean: float¶: return the mean value

mean_and_variance() → Tuple[float, float][source]¶: calculate and return the mean and variance

update(new_value: float) → None[source]¶

sctools.stats.base4_entropy(x, axis=1)[source]¶

Calculate entropy in base four of a data matrix x

Useful for measuring DNA entropy (with 4 nucleotides) as the output is restricted to [0, 1]

Parameters

x (np.ndarray) – array of dimension one or more containing numeric types
axis (int, optional) – axis to calculate entropy across. Values in this axis are treated as observation frequencies

Returns

entropy – array of input dimension - 1 containin entropy values bounded in [0, 1]

Return type

np.ndarray