sctools package

Submodules

sctools.bam module

Tools for Manipulating SAM/BAM format files

This module provides functions and classes to subsample reads from bam files that correspond to specific chromosomes, split bam files into chunks, assign tags to bam files from paired fastq records, and iterate over sorted bam files by one or more tags

This module makes heavy use of the pysam wrapper for HTSlib, a high-performance c-library designed to manipulate sam files

iter_tag_groups                         function to iterate over reads by an arbitrary tag
iter_cell_barcodes                      wrapper for iter_tag_groups that iterates over cell barcode tags
iter_genes                              wrapper for iter_tag_groups that iterates over gene tags
iter_molecules                          wrapper for iter_tag_groups that iterates over molecule tags
sort_by_tags_and_queryname              sort bam by given list of zero or more tags, followed by query name
verify_sort                             verifies whether bam is correctly sorted by given list of tags, then query name
sctools.Classes()
-------
SubsetAlignments                        class to extract reads specific to requested chromosome(s)
Tagger                                  class to add tags to sam/bam records from paired fastq records
AlignmentSortOrder                      abstract class to represent alignment sort orders
QueryNameSortOrder                      alignment sort order by query name
TagSortableRecord                       class to facilitate sorting of pysam.AlignedSegments
SortError                               error raised when sorting is incorrect

References

htslib : https://github.com/samtools/htslib

class sctools.bam.AlignmentSortOrder[source]

Bases: object

The base class of alignment sort orders.

abstract property key_generator: Callable[pysam.libcalignedsegment.AlignedSegment, Any]

Returns a callable function that calculates a sort key from given pysam.AlignedSegment.

class sctools.bam.QueryNameSortOrder[source]

Bases: sctools.bam.AlignmentSortOrder

Alignment record sort order by query name.

static get_sort_key(alignment: pysam.libcalignedsegment.AlignedSegment) str[source]
property key_generator

Returns a callable function that calculates a sort key from given pysam.AlignedSegment.

exception sctools.bam.SortError[source]

Bases: Exception

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class sctools.bam.SubsetAlignments(alignment_file: str, open_mode: Optional[str] = None)[source]

Bases: object

Wrapper for pysam/htslib that extracts reads corresponding to requested chromosome(s)

Parameters
  • alignment_file (str) – sam or bam file

  • open_mode ({'r', 'rb', None}, optional) – open mode for pysam.AlignmentFile. ‘r’ indicates a sam file, ‘rb’ indicates a bam file, and None attempts to autodetect based on the file suffix (Default = None)

indices_by_chromosome()[source]

returns indices to line numbers containing the requested number of reads for a specified chromosome

Notes

samtools is a good general-purpose tool for that is capable of most subsampling tasks. It is a good idea to check the samtools documentation when approaching these types of tasks.

References

samtools documentation : http://www.htslib.org/doc/samtools.html

indices_by_chromosome(n_specific: int, chromosome: str, include_other: int = 0) Union[List[int], Tuple[List[int], List[int]]][source]

Return the list of first n_specific indices of reads aligned to chromosome.

Parameters
  • n_specific (int) – Number of aligned reads to return indices for

  • chromosome (str) – Only reads from this chromosome are considered valid

  • include_other (int, optional) – The number of reads to include that are NOT aligned to chromosome. These can be aligned or unaligned reads (default = 0).

Returns

  • chromosome_indices (List[int]) – list of indices to reads aligning to chromosome

  • other_indices (List[int], optional) – list of indices to reads NOT aligning to chromosome, only returned if include_other is not 0.

class sctools.bam.TagSortableRecord(tag_keys: Iterable[str], tag_values: Iterable[str], query_name: str, record: Optional[pysam.libcalignedsegment.AlignedSegment] = None)[source]

Bases: object

Wrapper for pysam.AlignedSegment that facilitates sorting by tags and query name.

classmethod from_aligned_segment(record: pysam.libcalignedsegment.AlignedSegment, tag_keys: Iterable[str]) sctools.bam.TagSortableRecord[source]

Create a TagSortableRecord from a pysam.AlignedSegment and list of tag keys

class sctools.bam.Tagger(bam_file: str)[source]

Bases: object

Add tags to a bam file from tag generators.

Parameters

bam_file (str) – Bam file that tags are to be added to.

tag()[source]

tag bam records given tag_generators (often generated from paired bam or fastq files) # todo this should probably be wrapped up in __init__ to make this more function-like

tag(output_bam_name: str, tag_generators) None[source]

Add tags to bam_file.

Given a bam file and tag generators derived from files sharing the same sort order, adds tags to the .bam file, and writes the resulting file to output_bam_name.

Parameters
  • output_bam_name (str) – Name of output tagged bam.

  • tag_generators (List[fastq.TagGenerator]) – list of generators that yield fastq.Tag objects

sctools.bam.get_barcode_for_alignment(alignment: pysam.libcalignedsegment.AlignedSegment, tags: List[str], raise_missing: bool) str[source]

Get the barcode for an Alignment

Parameters
  • alignment – pysam.AlignedSegment An Alignment from pysam.

  • tags – List[str] Tags in the bam that might contain barcodes. If multiple Tags are passed, will return the contents of the first tag that contains a barcode.

  • raise_missing – bool Raise an error if no barcodes can be found.

Returns

str A barcode for the alignment, or None if one is not found and raise_missing is False.

sctools.bam.get_barcodes_from_bam(in_bam: str, tags: List[str], raise_missing: bool) Set[str][source]

Get all the distinct barcodes from a bam

Parameters
  • in_bam – str Input bam file.

  • tags – List[str] Tags in the bam that might contain barcodes.

  • raise_missing – bool Raise an error if no barcodes can be found.

Returns

set A set of barcodes found in the bam This set will not contain a None value

sctools.bam.get_tag_or_default(alignment: pysam.libcalignedsegment.AlignedSegment, tag_key: str, default: Optional[str] = None) Optional[str][source]

Extracts the value associated to tag_key from alignment, and returns a default value if the tag is not present.

sctools.bam.iter_cell_barcodes(bam_iterator: Iterator[pysam.libcalignedsegment.AlignedSegment]) Generator[source]

Iterate over all the cells of a bam file sorted by cell.

Parameters

bam_iterator (Iterator[pysam.AlignedSegment]) – open bam file that can be iterated over

Yields
  • grouped_by_tag (Iterator[pysam.AlignedSegment]) – reads sharing a unique cell barcode tag

  • current_tag (str) – the cell barcode that reads in the group all share

sctools.bam.iter_genes(bam_iterator: Iterator[pysam.libcalignedsegment.AlignedSegment]) Generator[source]

Iterate over all the cells of a bam file sorted by gene.

Parameters

bam_iterator (Iterator[pysam.AlignedSegment]) – open bam file that can be iterated over

Yields
  • grouped_by_tag (Iterator[pysam.AlignedSegment]) – reads sharing a unique gene name tag

  • current_tag (str) – the gene id that reads in the group all share

sctools.bam.iter_molecule_barcodes(bam_iterator: Iterator[pysam.libcalignedsegment.AlignedSegment]) Generator[source]

Iterate over all the molecules of a bam file sorted by molecule.

Parameters

bam_iterator (Iterator[pysam.AlignedSegment]) – open bam file that can be iterated over

Yields
  • grouped_by_tag (Iterator[pysam.AlignedSegment]) – reads sharing a unique molecule barcode tag

  • current_tag (str) – the molecule barcode that records in the group all share

sctools.bam.iter_tag_groups(tag: str, bam_iterator: Iterator[pysam.libcalignedsegment.AlignedSegment], filter_null: bool = False) Generator[source]

Iterates over reads and yields them grouped by the provided tag value

Parameters
  • tag (str) – BAM tag to group over

  • bam_iterator (Iterator[pysam.AlignedSegment]) – open bam file that can be iterated over

  • filter_null (bool, optional) – If False, all reads that lack the requested tag are yielded together. Else, all reads that lack the tag will be discarded (default = False).

Yields
  • grouped_by_tag (Iterator[pysam.AlignedSegment]) – reads sharing a unique value of tag

  • current_tag (str) – the tag that reads in the group all share

sctools.bam.merge_bams(bams: List[str]) str[source]

Merge input bams using samtools.

This cannot be a local function within split because then Python “cannot pickle a local object”. :param bams: Name of the final bam + bams to merge.

Because of how its called using multiprocessing, the bam basename is the first element of the list.

Returns

The output bam name.

sctools.bam.sort_by_tags_and_queryname(records: Iterable[pysam.libcalignedsegment.AlignedSegment], tag_keys: Iterable[str]) Iterable[pysam.libcalignedsegment.AlignedSegment][source]

Sorts the given bam records by the given tags, followed by query name. If no tags are given, just sorts by query name.

sctools.bam.split(in_bams: List[str], out_prefix: str, tags: List[str], approx_mb_per_split: float = 1000, raise_missing: bool = True, num_processes: Optional[int] = None) List[str][source]

split in_bam by tag into files of approx_mb_per_split

Parameters
  • in_bams (str) – Input bam files.

  • out_prefix (str) – Prefix for all output files; output will be named as prefix_n where n is an integer equal to the chunk number.

  • tags (List[str]) – The bam tags to split on. The tags are checked in order, and sorting is done based on the first identified tag. Further tags are only checked if the first tag is missing. This is useful in cases where sorting is executed over a corrected barcode, but some records only have a raw barcode.

  • approx_mb_per_split (float) – The target file size for each chunk in mb

  • raise_missing (bool, optional) – if True, raise a RuntimeError if a record is encountered without a tag. Else silently discard the record (default = True)

  • num_processes (int, optional) – The number of processes to parallelize over. If not set, will use all available processes.

Returns

output_filenames – list of filenames of bam chunks

Return type

List[str]

Raises
  • ValueError – when tags is empty

  • RuntimeError – when raise_missing is true and any passed read contains no tags

sctools.bam.verify_sort(records: Iterable[sctools.bam.TagSortableRecord], tag_keys: Iterable[str]) None[source]

Raise AssertionError if the given records are not correctly sorted by the given tags and query name

sctools.bam.write_barcodes_to_bins(in_bam: str, tags: List[str], barcodes_to_bins: Dict[str, int], raise_missing: bool) List[str][source]

Write barcodes to appropriate bins as defined by barcodes_to_bins

Parameters
  • in_bam – str The bam file to read.

  • tags – List[str] Tags in the bam that might contain barcodes.

  • barcodes_to_bins – Dict[str, int] A Dict from barcode to bin. All barcodes of the same type need to be written to the same bin. These numbered bins are merged after parallelization so that all alignments with the same barcode are in the same bam.

  • raise_missing – bool Raise an error if no barcodes can be found.

Returns

A list of paths to the written bins.

sctools.barcode module

Nucleotide Barcode Manipulation Tools

This module contains tools to characterize oligonucleotide barcodes and a simple hamming-base error-correction approach which corrects barcodes within a specified distance of a “whitelist” of expected barcodes.

Classes

Barcodes Class to characterize a set of barcodes ErrorsToCorrectBarcodesMap Class to carry out error correction routines

class sctools.barcode.Barcodes(barcodes: Mapping[str, int], barcode_length: int)[source]

Bases: object

Container for a set of nucleotide barcodes.

Contained barcodes are encoded in 2bit representation for fast operations. Instances of this class can optionally be constructed from an iterable where barcodes can be present multiple times. In these cases, barcodes are analyzed based on their observed frequencies.

Parameters
  • barcodes (Mapping[str, int]) – dictionary-like mapping barcodes to the number of times they were observed

  • barcode_length (int) – the length of all barcodes in the set. Different-length barcodes are not supported.

base_frequency(weighted=False) numpy.ndarray[source]

return the frequency of each base at each position in the barcode set

Notes

weighting is currently not supported, and must be set to False or base_frequency will raise NotImplementedError # todo fix

Parameters

weighted (bool, optional) – if True, each barcode is counted once for each time it was observed (default = False)

Returns

frequencies – barcode_length x 4 2d numpy array

Return type

np.array

Raises

NotImplementedError – if weighted is True

effective_diversity(weighted=False) numpy.ndarray[source]

Returns the effective base diversity of the barcode set by position.

maximum diversity for each position is 1, and represents a perfect split of 25% per base at a given position.

Parameters

weighted (bool, optional) – if True, each barcode is counted once for each time it was observed (default = False)

Returns

effective_diversity – 1-d array of size barcode_length containing floats in [0, 1]

Return type

np.array[float]

classmethod from_iterable_bytes(iterable: Iterable[bytes], barcode_length: int)[source]

Construct an ObservedBarcodeSet from an iterable of bytes barcodes.

Parameters
  • iterable (Iterable[bytes]) – iterable of barcodes in bytes representation

  • barcode_length (int) – the length of the barcodes in iterable

Returns

barcodes – class object containing barcodes from a whitelist file

Return type

Barcodes

classmethod from_iterable_encoded(iterable: Iterable[int], barcode_length: int)[source]

Construct an ObservedBarcodeSet from an iterable of encoded barcodes.

Parameters
  • iterable (Iterable[int]) – iterable of barcodes encoded in TwoBit representation

  • barcode_length (int) – the length of the barcodes in iterable

Returns

barcodes – class object containing barcodes from a whitelist file

Return type

Barcodes

classmethod from_iterable_strings(iterable: Iterable[str], barcode_length: int)[source]

Construct an ObservedBarcodeSet from an iterable of string barcodes.

Parameters
  • iterable (Iterable[str]) – iterable of barcodes encoded in TwoBit representation

  • barcode_length (int) – the length of the barcodes in iterable

Returns

barcodes – class object containing barcodes from a whitelist file

Return type

Barcodes

classmethod from_whitelist(file_: str, barcode_length: int)[source]

Creates a barcode set from a whitelist file.

Parameters
  • file (str) – location of the whitelist file. Should be formatted one barcode per line. Barcodes should be encoded in plain text (UTF-8, ASCII), not bit-encoded. Each barcode will be assigned a count of 1.

  • barcode_length (int) – Length of the barcodes in the file.

Returns

barcodes – class object containing barcodes from a whitelist file

Return type

Barcodes

summarize_hamming_distances() Mapping[str, float][source]

Returns descriptive statistics on hamming distances between pairs of barcodes.

Returns

descriptive_statistics – minimum, 25th percentile, median, 75th percentile, maximum, and average hamming distance between all pairs of barcodes

Return type

Mapping[str, float]

References

https://en.wikipedia.org/wiki/Hamming_distance

class sctools.barcode.ErrorsToCorrectBarcodesMap(errors_to_barcodes: Mapping[str, str])[source]

Bases: object

Correct any barcode that is within one hamming distance of a whitelisted barcode

Parameters

errors_to_barcodes (Mapping[str, str]) – dict-like mapping 1-base errors to the whitelist barcode that they could be generated from

get_corrected_barcode(barcode: str)[source]

Return a barcode if it is whitelist, or the corrected version if within edit distance 1

correct_bam(bam_file: str, output_bam_file: str)[source]

correct barcodes in a bam file, given a whitelist

References

https://en.wikipedia.org/wiki/Hamming_distance

correct_bam(bam_file: str, output_bam_file: str) None[source]

Correct barcodes in a (potentially unaligned) bamfile, given a whitelist.

Parameters
  • bam_file (str) – BAM format file in same order as the fastq files

  • output_bam_file (str) – BAM format file containing cell, umi, and sample tags.

get_corrected_barcode(barcode: str) str[source]

Return a barcode if it is whitelist, or the corrected version if within edit distance 1

Parameters

barcode (str) – the barcode to return the corrected version of. If the barcode is in the whitelist, the input barcode is returned unchanged.

Returns

corrected_barcode – corrected version of the barcode

Return type

str

Raises

KeyError – if the passed barcode is not within 1 hamming distance of any whitelist barcode

References

https://en.wikipedia.org/wiki/Hamming_distance

classmethod single_hamming_errors_from_whitelist(whitelist_file: str)[source]

Factory method to generate instance of class from a file containing “correct” barcodes.

Parameters

whitelist_file (str) – Text file containing barcode per line.

Returns

errors_to_barcodes_map – instance of cls, built from whitelist

Return type

ErrorsToCorrectBarcodesMap

sctools.encodings module

Compressed Barcode Encoding Methods

This module defines several classes to encode DNA sequences in memory-efficient forms, using 2 bits to encode bases of a 4-letter DNA alphabet (ACGT) or 3 bits to encode a 5-letter DNA alphabet that includes the ambiguous call often included by Illumina base calling software (ACGTN). The classes also contain several methods useful for efficient querying and manipulation of the encoded sequence.

Classes

Encoding Encoder base class ThreeBit Three bit DNA encoder / decoder TwoBit Two bit DNA encoder / decoder

class sctools.encodings.Encoding[source]

Bases: object

encoding_map

Class that mimics a Mapping[bytes, str] where bytes must be a single byte encoded character (encoder)

Type

TwoBitEncodingMap

decoding_map

Dictionary that maps integers to bytes human-readable representations (decoder)

Type

Mapping[int, bytes]

bits_per_base

number of bits used to encode each base

Type

int

encode(bytes_encoded: bytes)[source]

encode a DNA string in a compressed representation

decode(integer_encoded: int)[source]

decode a compressed DNA string into a human readable bytes format

gc_content(integer_encoded: int)[source]

calculate the GC content of an encoded DNA string

hamming_distance(a: int, b: int)[source]

calculate the hamming distance between two encoded DNA strings

bits_per_base: int = NotImplemented
decode(integer_encoded: int) bytes[source]

Decode a DNA bytes string.

Parameters

integer_encoded (bytes) – Integer encoded DNA string

Returns

decoded – Bytes decoded DNA sequence

Return type

bytes

decoding_map: Mapping[int, AnyStr] = NotImplemented
classmethod encode(bytes_encoded: bytes) int[source]

Encode a DNA bytes string.

Parameters

bytes_encoded (bytes) – bytes DNA string

Returns

encoded – Encoded DNA sequence

Return type

int

encoding_map: Mapping[AnyStr, int] = NotImplemented
gc_content(integer_encoded: int) int[source]

Return the number of G or C nucleotides in integer_encoded

Parameters

integer_encoded (int) – Integer encoded DNA string

Returns

number of bases in integer_encoded input that are G or C.

Return type

gc_content, int

static hamming_distance(a, b) int[source]

Calculate the hamming distance between two DNA sequences

The hamming distance counts the number of bases that are not the same nucleotide

Parameters
  • a (int) – integer encoded

  • b (int) – integer encoded

Returns

d – hamming distance between a and b

Return type

int

class sctools.encodings.ThreeBit(*args, **kwargs)[source]

Bases: sctools.encodings.Encoding

Encode a DNA sequence using a 3-bit encoding.

Since no bases are encoded as 0, an empty triplet is interpreted as the end of the encoded string; Three-bit encoding can be used to encode and decode strings without knowledge of their length.

encoding_map

Class that mimics a Mapping[bytes, str] where bytes must be a single byte encoded character (encoder)

Type

TwoBitEncodingMap

decoding_map

Dictionary that maps integers to bytes human-readable representations (decoder)

Type

Mapping[int, bytes]

bits_per_base

number of bits used to encode each base

Type

int

encode(bytes_encoded: bytes)[source]

encode a DNA string in a compressed representation

decode(integer_encoded: int)[source]

decode a compressed DNA string into a human readable bytes format

gc_content(integer_encoded: int)[source]

calculate the GC content of an encoded DNA string

hamming_distance(a: int, b: int)[source]

calculate the hamming distance between two encoded DNA strings

class ThreeBitEncodingMap[source]

Bases: object

Dict-like class that maps bytes to 3-bit integer representations

All IUPAC ambiguous codes are treated as “N”

map_ = {65: 2, 67: 1, 71: 3, 78: 6, 84: 4, 97: 2, 99: 1, 103: 3, 110: 6, 116: 4}
bits_per_base: int = 3
classmethod decode(integer_encoded: int) bytes[source]

Decode a DNA bytes string.

Parameters

integer_encoded (bytes) – Integer encoded DNA string

Returns

decoded – Bytes decoded DNA sequence

Return type

bytes

decoding_map: Mapping[int, bytes] = {1: b'C', 2: b'A', 3: b'G', 4: b'T', 6: b'N'}
classmethod encode(bytes_encoded: bytes) int[source]

Encode a DNA bytes string.

Parameters

bytes_encoded (bytes) – bytes DNA string

Returns

encoded – Encoded DNA sequence

Return type

int

encoding_map: sctools.encodings.ThreeBit.ThreeBitEncodingMap = <sctools.encodings.ThreeBit.ThreeBitEncodingMap object>
classmethod gc_content(integer_encoded: int) int[source]

Return the number of G or C nucleotides in integer_encoded

Parameters

integer_encoded (int) – Integer encoded DNA string

Returns

number of bases in integer_encoded input that are G or C.

Return type

gc_content, int

static hamming_distance(a: int, b: int) int[source]

Calculate the hamming distance between two DNA sequences

The hamming distance counts the number of bases that are not the same nucleotide

Parameters
  • a (int) – integer encoded

  • b (int) – integer encoded

Returns

d – hamming distance between a and b

Return type

int

class sctools.encodings.TwoBit(sequence_length: int)[source]

Bases: sctools.encodings.Encoding

Encode a DNA sequence using a 2-bit encoding.

Two-bit encoding uses 0 for an encoded nucleotide. As such, it cannot distinguish between the end of sequence and trailing A nucleotides, and thus decoding these strings requires knowledge of their length. Therefore, it is only appropriate for encoding fixed sequence lengths

In addition, in order to encode in 2-bit, N-nucleotides must be randomized to one of A, C, G, and T.

Parameters

sequence_length (int) – number of nucleotides that are being encoded

encoding_map

Class that mimics a Mapping[bytes, str] where bytes must be a single byte encoded character (encoder)

Type

TwoBitEncodingMap

decoding_map

Dictionary that maps integers to bytes human-readable representations (decoder)

Type

Mapping[int, bytes]

bits_per_base

number of bits used to encode each base

Type

int

encode(bytes_encoded: bytes)[source]

encode a DNA string in a compressed representation

decode(integer_encoded: int)[source]

decode a compressed DNA string into a human readable bytes format

gc_content(integer_encoded: int)[source]

calculate the GC content of an encoded DNA string

hamming_distance(a: int, b: int)[source]

calculate the hamming distance between two encoded DNA strings

class TwoBitEncodingMap[source]

Bases: object

Dict-like class that maps bytes to 2-bit integer representations

Generates random nucleotides for ambiguous nucleotides e.g. N

iupac_ambiguous: Set[int] = {66, 68, 72, 75, 77, 78, 82, 83, 86, 87, 89, 98, 100, 104, 107, 109, 110, 114, 115, 118, 119, 121}
map_ = {65: 0, 67: 1, 71: 3, 84: 2, 97: 0, 99: 1, 103: 3, 116: 2}
bits_per_base: int = 2
decode(integer_encoded: int) bytes[source]

Decode a DNA bytes string.

Parameters

integer_encoded (bytes) – Integer encoded DNA string

Returns

decoded – Bytes decoded DNA sequence

Return type

bytes

decoding_map: Mapping[int, bytes] = {0: b'A', 1: b'C', 2: b'T', 3: b'G'}
classmethod encode(bytes_encoded: bytes) int[source]

Encode a DNA bytes string.

Parameters

bytes_encoded (bytes) – bytes DNA string

Returns

encoded – Encoded DNA sequence

Return type

int

encoding_map: sctools.encodings.TwoBit.TwoBitEncodingMap = <sctools.encodings.TwoBit.TwoBitEncodingMap object>
gc_content(integer_encoded: int) int[source]

Return the number of G or C nucleotides in integer_encoded

Parameters

integer_encoded (int) – Integer encoded DNA string

Returns

number of bases in integer_encoded input that are G or C.

Return type

gc_content, int

static hamming_distance(a: int, b: int) int[source]

Calculate the hamming distance between two DNA sequences

The hamming distance counts the number of bases that are not the same nucleotide

Parameters
  • a (int) – integer encoded

  • b (int) – integer encoded

Returns

d – hamming distance between a and b

Return type

int

sctools.fastq module

Efficient Fastq Iterators and Representations

This module implements classes for representing fastq records, reading and writing them, and extracting parts of fastq sequence for transformation into bam format tags

sctools.extract_barcode(record, embedded_barcode)

extract a barcode, defined by embedded_barcode from record

sctools.Classes()
-------
Record                                      Represents fastq records (input as bytes)
StrRecord                                   Represents fastq records (input as str)
Reader                                      Opens and iterates over fastq files
EmbeddedBarcodeGenerator                    Generates barcodes from a fastq file
BarcodeGeneratorWithCorrectedCellBarcodes   Generates (corrected) barcodes from a fastq file

References

https://en.wikipedia.org/wiki/FASTQ_format

class sctools.fastq.BarcodeGeneratorWithCorrectedCellBarcodes(fastq_files: Union[str, Iterable[str]], embedded_cell_barcode: sctools.fastq.Tag, whitelist: str, other_embedded_barcodes: Iterable[sctools.fastq.Tag] = (), *args, **kwargs)[source]

Bases: sctools.fastq.Reader

Generate barcodes from FASTQ file(s) from positions defined by EmbeddedBarcode(s)

Extracted barcode objects are produced in a form that is consumable by pysam’s bam and sam set_tag methods. In this class, one EmbeddedBarcode must be defined as an embedded_cell_barcode, which is checked against a whitelist and error corrected during generation

Parameters
  • fastq_files (str | List, optional) – FASTQ file or files to be read. (default = sys.stdin)

  • mode ({'r', 'rb'}, optional) – open mode for fastq files. If ‘r’, return string. If ‘rb’, return bytes (default = ‘r’)

  • whitelist (str) – whitelist file containing “correct” cell barcodes for an experiment

  • embedded_cell_barcodes (EmbeddedBarcode) – EmbeddedBarcode containing information about the position and names of cell barcode tags

  • other_embedded_barcodes (Iterable[EmbeddedBarcode], optional) – tag objects defining start and end of the sequence containing the tag, and the tag identifiers for sequence and quality tags (default = None)

extract_cell_barcode(record: Record, cb: str)[source]
extract_cell_barcode(record: Tuple[str], cb: sctools.fastq.Tag)[source]

Extract a cell barcode from a fastq record

Parameters
  • record (Tuple[str]) – fastq record comprised of four strings: name, sequence, name2, and quality

  • cb (EmbeddedBarcode) – defines the position and tag identifier for a call barcode

Returns

  • sequence_tag (Tuple[str, str, ‘Z’]) – raw sequence tag identifier, sequence, SAM tag type (‘Z’ implies a string tag)

  • quality_tag (Tuple[str, str, ‘Z’]) – quality tag identifier, quality, SAM tag type (‘Z’ implies a string tag)

  • corrected_tag (Optional[Tuple[str, str, ‘Z’]]) – Whitelist verified sequence tag. Only present if the raw sequence tag is in the whitelist or within 1 hamming distance of one of its barcodes

property filenames: List[str]
select_record_indices(indices: Set) Generator

Iterate over provided indices only, skipping other records.

Parameters

indices (Set[int]) – indices to include in the output

Yields

record, str – records from file corresponding to indices

property size: int

return the collective size of all files being read in bytes

sctools.fastq.EmbeddedBarcode

alias of sctools.fastq.Tag

class sctools.fastq.EmbeddedBarcodeGenerator(fastq_files, embedded_barcodes, *args, **kwargs)[source]

Bases: sctools.fastq.Reader

Generate barcodes from a FASTQ file(s) from positions defined by EmbeddedBarcode(s)

Extracted barcode objects are produced in a form that is consumable by pysam’s bam and sam set_tag methods.

Parameters
  • embedded_barcodes (Iterable[EmbeddedBarcode]) – tag objects defining start and end of the sequence containing the tag, and the tag identifiers for sequence and quality tags

  • fastq_files (str | List, optional) – FASTQ file or files to be read. (default = sys.stdin)

  • mode ({'r', 'rb'}, optional) – open mode for FASTQ files. If ‘r’, return string. If ‘rb’, return bytes (default = ‘r’)

property filenames: List[str]
select_record_indices(indices: Set) Generator

Iterate over provided indices only, skipping other records.

Parameters

indices (Set[int]) – indices to include in the output

Yields

record, str – records from file corresponding to indices

property size: int

return the collective size of all files being read in bytes

class sctools.fastq.Reader(files='-', mode='r', header_comment_char=None)[source]

Bases: sctools.reader.Reader

Fastq Reader that defines some special methods for reading and summarizing FASTQ data.

Simple reader class that exposes an __iter__ and __len__ method

Examples

#todo add examples

References

https://en.wikipedia.org/wiki/FASTQ_format

property filenames: List[str]
select_record_indices(indices: Set) Generator[source]

Iterate over provided indices only, skipping other records.

Parameters

indices (Set[int]) – indices to include in the output

Yields

record, str – records from file corresponding to indices

property size: int

return the collective size of all files being read in bytes

class sctools.fastq.Record(record: Iterable[AnyStr])[source]

Bases: object

Fastq Record.

Parameters

record (Iterable[bytes]) – Iterable of 4 bytes strings that comprise a fastq record

name

fastq record name

Type

bytes

sequence

fastq nucleotide sequence

Type

bytes

name2

second fastq record name field (rarely used)

Type

bytes

quality

base call quality for each nucleotide in sequence

Type

bytes

average_quality()[source]

The average quality of the fastq record

average_quality() float[source]

return the average quality of this record

property name: AnyStr
property name2: AnyStr
property quality: AnyStr
property sequence: AnyStr
class sctools.fastq.StrRecord(record: Iterable[AnyStr])[source]

Bases: sctools.fastq.Record

Fastq Record.

Parameters

record (Iterable[str]) – Iterable of 4 bytes strings that comprise a FASTQ record

name

FASTQ record name

Type

str

sequence

FASTQ nucleotide sequence

Type

str

name2

second FASTQ record name field (rarely used)

Type

str

quality

base call quality for each nucleotide in sequence

Type

str

average_quality()[source]

The average quality of the FASTQ record

average_quality() float[source]

return the average quality of this record

property name: str
property name2: AnyStr
property quality: AnyStr
property sequence: AnyStr
sctools.fastq.extract_barcode(record, embedded_barcode) Tuple[Tuple[str, str, str], Tuple[str, str, str]][source]

Extracts barcodes from a FASTQ record at positions defined by an EmbeddedBarcode object.

Parameters
  • record (FastqRecord) – Record to extract from

  • embedded_barcode (EmbeddedBarcode) – Defines the barcode start and end positions and the tag name for the sequence and quality tags

Returns

  • sequence_tag (Tuple[str, str, ‘Z’]) – sequence tag identifier, sequence, SAM tag type (‘Z’ implies a string tag)

  • quality_tag (Tuple[str, str, ‘Z’]) – quality tag identifier, quality, SAM tag type (‘Z’ implies a string tag)

sctools.gtf module

GTF Records and Iterators

This module defines a GTF record class and a Reader class to iterate over GTF-format files

Classes

Record Data class that exposes GTF record fields by name Reader GTF file reader that yields GTF Records

References

https://useast.ensembl.org/info/website/upload/gff.html

class sctools.gtf.GTFRecord(record: str)[source]

Bases: object

Data class for storing and interacting with GTF records

Subclassed to produce exon, transcript, and gene-specific record types. A GTF record has 8 fixed fields which are followed by optional fields separated by ; , which are stored by this class in the attributes field and accessible by get_attribute. Fixed fields are accessible by name.

Parameters

record (str) – an unparsed GTF record

seqname

The name of the sequence (often chromosome) this record is found on.

Type

str

chromosome

Synonym for seqname.

Type

str

source

The group responsible for generating this annotation.

Type

str

feature

The type of record (e.g. gene, exon, …).

Type

str

start

The start position of this feature relative to the beginning of seqname.

Type

str

end

The end position of this feature relative to the beginning of seqname….

Type

str

score

The annotation score. Rarely used.

Type

str

strand

The strand of seqname that this annotation is found on

Type

{‘+’, ‘-‘}

frame

‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on

Type

{‘0’, ‘1’, ‘2’}

size

the number of nucleotides spanned by this feature

Type

int

get_attribute(key: str)[source]

attempt to retrieve a variable field with name equal to key

set_attribute(key: str, value: str)[source]

set variable field key equal to value. Overwrites key if already present.

property chromosome: str
property end: int
property feature: str
property frame: str
get_attribute(key) str[source]

access an item from the attribute field of a GTF file.

Parameters

key (str) – Item to retrieve

Returns

value – Contents of variable attribute key

Return type

str

Raises

KeyError – if there is no variable attribute key associated with this record

property score: str
property seqname: str
set_attribute(key, value) None[source]

Set variable attribute key equal to value

If attribute key is already set for this record, its contents are overwritten by value

Parameters
  • key (str) – attribute name

  • value (str) – attribute content

property size: int
property source: str
property start: int
property strand: str
class sctools.gtf.Reader(files='-', mode='r', header_comment_char='#')[source]

Bases: sctools.reader.Reader

GTF file iterator

Parameters
  • files (Union[str, List], optional) – File(s) to read. If ‘-’, read sys.stdin (default = ‘-‘)

  • mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).

  • header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)

filter(retain_types: Iterable[str])[source]

Iterate over a GTF file, only yielding records in retain_types.

__iter__()[source]

iterate over GTF records in file, yielding Record objects

property filenames: List[str]
filter(retain_types: Iterable[str]) Generator[source]

Iterate over a GTF file, returning only record whose feature type is in retain_types.

Features are stored in GTF field 2.

Parameters

retain_types (Iterable[str]) – Record feature types to retain.

Yields

gtf_record (Record) – gtf Record object

select_record_indices(indices: Set) Generator[source]

Iterate over provided indices only, skipping other records.

Parameters

indices (Set[int]) – indices to include in the output

Yields

record, str – records from file corresponding to indices

property size: int

return the collective size of all files being read in bytes

sctools.gtf.extract_extended_gene_names(files: Union[str, List[str]] = '-', mode: str = 'r', header_comment_char: str = '#') Dict[str, List[tuple]][source]

Extract extended gene names from GTF file(s) and returns a map from gene names to their corresponding occurrence locations the given file(s).

Parameters
  • files (Union[str, List], optional) – File(s) to read. If ‘-’, read sys.stdin (default = ‘-‘)

  • mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).

  • header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)

Returns

A dictionary of chromosome names mapping to a List of tuples, each containing a range as the the first element and a gene name as the second. Dict[str, List(Tuple((start,end), gene)))

Return type

Dict[str, List[tuple]]

sctools.gtf.extract_gene_exons(files: Union[str, List[str]] = '-', mode: str = 'r', header_comment_char: str = '#') Dict[str, List[tuple]][source]

Extract extended gene names from GTF file(s) and returns a map from gene names to the the list of exons in the ascending order of the start positions file(s).

Parameters
  • files (Union[str, List], optional) – File(s) to read. If ‘-’, read sys.stdin (default = ‘-‘)

  • mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).

  • header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)

Returns

A dictionary of chromosome names mapping to a List of tuples, each containing a the exons in the ascending order of the start positions. Dict[str, List(Tuple((start,end), gene)))

Return type

Dict[str, List[tuple]]

sctools.gtf.extract_gene_names(files: Union[str, List[str]] = '-', mode: str = 'r', header_comment_char: str = '#') Dict[str, int][source]

Extract gene names from GTF file(s) and returns a map from gene names to their corresponding occurrence orders in the given file(s).

Parameters
  • files (Union[str, List], optional) – File(s) to read. If ‘-’, read sys.stdin (default = ‘-‘)

  • mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).

  • header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)

Returns

A map from gene names to their linear index

Return type

Dict[str, int]

sctools.gtf.get_mitochondrial_gene_names(files: Union[str, List[str]] = '-', mode: str = 'r', header_comment_char: str = '#') Set[str][source]
Extract mitocholdrial gene names from GTF file(s) and returns a set of mitochondrial

gene id occurrence in the given file(s).

Parameters
  • files (Union[str, List], optional) – File(s) to read. If ‘-’, read sys.stdin (default = ‘-‘)

  • mode ({'r', 'rb'}, optional) – Open mode. If ‘r’, read strings. If ‘rb’, read bytes (default = ‘r’).

  • header_comment_char (str, optional) – lines beginning with this character are skipped (default = ‘#’)

Returns

A set of the mitochondrial gene ids

Return type

Set(str)

sctools.platform module

Command Line Interface for SC Tools:

This module defines the command line interface for SC Tools. Tools are separated into those that are specific to particular chemistries (e.g. Smart-seq 2) or experimental platforms (e.g. 10x Genomics v2) and those that are general across any sequencing experiment.

Currently, only general modules and those used for 10x v2 are implemented

Classes

GenericPlatform Class containing all general command line utilities TenXV2 Class containing 10x v2 specific command line utilities

class sctools.platform.BarcodePlatform[source]

Bases: sctools.platform.GenericPlatform

Command Line Interface for extracting and attaching barcodes with specified positions

generalizing TenXV2 attach barcodes

Sample, cell and/or molecule barcodes can be extracted and attached to an unmapped bam when the corresponding barcode’s start position and and length are provided. The sample barcode is extracted from the index i7 fastq file and the cell and molecule barcode are extracted from the r1 fastq file

This class defines several methods that are created as CLI tools when sctools is installed (see setup.py)

cell_barcode

A data class that defines the start and end position of the cell barcode and the tags to assign the sequence and quality of the cell barcode

Type

fastq.EmbeddedBarcode

molecule_barcode

A data class that defines the start and end position of the molecule barcode and the tags to assign the sequence and quality of the molecule barcode

Type

fastq.EmbeddedBarcode

sample_barcode

A data class that defines the start and end position of the sample barcode and the tags to assign the sequence and quality of the sample barcode

Type

fastq.EmbeddedBarcode

attach_barcodes()[source]

Attach barcodes from the forward (r1) and optionally index (i1) fastq files to the reverse (r2) bam file

classmethod attach_barcodes(args=None)[source]

Command line entrypoint for attaching barcodes to a bamfile.

Parameters

args (Iterable[str], optional) – arguments list, The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod bam_to_count_matrix(args: Optional[Iterable[str]] = None) int

Command line entrypoint for constructing a count matrix from a tagged bam file.

Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod calculate_cell_metrics(args: Optional[Iterable[str]] = None) int

Command line entrypoint for calculating cell metrics from a sorted bamfile.

Writes metrics to .csv

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod calculate_gene_metrics(args: Optional[Iterable[str]] = None) int

Command line entrypoint for calculating gene metrics from a sorted bamfile.

Writes metrics to .csv

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

cell_barcode = None
classmethod get_tags(raw_tags: Optional[Sequence[str]]) Iterable[str]
classmethod group_qc_outputs(args: Optional[Iterable[str]] = None) int

Commandline entrypoint for parsing picard metrics files, hisat2 and rsem statistics log files. :param args: file_names: array of files

output_name: prefix of output file name. metrics_type: Picard, PicardTable, HISAT2, RSEM and Core.

Returns

return – return if the program completes successfully.

Return type

0

classmethod merge_cell_metrics(args: Optional[Iterable[str]] = None) int

Command line entrypoint for merging multiple cell metrics files.

Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod merge_count_matrices(args: Optional[Iterable[str]] = None) int

Command line entrypoint for constructing a count matrix from a tagged bam file.

Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod merge_gene_metrics(args: Optional[Iterable[str]] = None) int

Command line entrypoint for merging multiple gene metrics files.

Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

molecule_barcode = None
sample_barcode = None
classmethod split_bam(args: Optional[Iterable] = None) int

Command line entrypoint for splitting a bamfile into subfiles of equal size.

prints filenames of chunks to stdout

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod tag_sort_bam(args: Optional[Iterable] = None) int

Command line entrypoint for sorting a bam file by zero or more tags, followed by queryname.

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod verify_bam_sort(args: Optional[Iterable] = None) int

Command line entrypoint for verifying bam is properly sorted by zero or more tags, followed by queryname.

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

class sctools.platform.GenericPlatform[source]

Bases: object

Platform-agnostic command line functions available in SC Tools.

tag_sort_bam():

sort a bam file by zero or more tags and then by queryname

verify_bam_sort():

verifies whether bam file is correctly sorted by given list of zero or more tags, then queryname

split_bam()

split a bam file into subfiles of equal size

calculate_gene_metrics()

calculate information about genes captured by a sequencing experiment

calculate_cell_metrics()

calculate information about cells captured by a sequencing experiment

merge_gene_metrics()

merge multiple gene metrics files into a single output

merge_cell_metrics()

merge multiple cell metrics files into a single output

bam_to_count()

construct a compressed sparse row count file from a tagged, aligned bam file

merge_count_matrices()

merge multiple csr-format count matrices into a single csr matrix

group_qc_outputs()

aggregate Picard, HISAT2 and RSME QC statisitics

classmethod bam_to_count_matrix(args: Optional[Iterable[str]] = None) int[source]

Command line entrypoint for constructing a count matrix from a tagged bam file.

Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod calculate_cell_metrics(args: Optional[Iterable[str]] = None) int[source]

Command line entrypoint for calculating cell metrics from a sorted bamfile.

Writes metrics to .csv

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod calculate_gene_metrics(args: Optional[Iterable[str]] = None) int[source]

Command line entrypoint for calculating gene metrics from a sorted bamfile.

Writes metrics to .csv

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod get_tags(raw_tags: Optional[Sequence[str]]) Iterable[str][source]
classmethod group_qc_outputs(args: Optional[Iterable[str]] = None) int[source]

Commandline entrypoint for parsing picard metrics files, hisat2 and rsem statistics log files. :param args: file_names: array of files

output_name: prefix of output file name. metrics_type: Picard, PicardTable, HISAT2, RSEM and Core.

Returns

return – return if the program completes successfully.

Return type

0

classmethod merge_cell_metrics(args: Optional[Iterable[str]] = None) int[source]

Command line entrypoint for merging multiple cell metrics files.

Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod merge_count_matrices(args: Optional[Iterable[str]] = None) int[source]

Command line entrypoint for constructing a count matrix from a tagged bam file.

Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod merge_gene_metrics(args: Optional[Iterable[str]] = None) int[source]

Command line entrypoint for merging multiple gene metrics files.

Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod split_bam(args: Optional[Iterable] = None) int[source]

Command line entrypoint for splitting a bamfile into subfiles of equal size.

prints filenames of chunks to stdout

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod tag_sort_bam(args: Optional[Iterable] = None) int[source]

Command line entrypoint for sorting a bam file by zero or more tags, followed by queryname.

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod verify_bam_sort(args: Optional[Iterable] = None) int[source]

Command line entrypoint for verifying bam is properly sorted by zero or more tags, followed by queryname.

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

class sctools.platform.TenXV2[source]

Bases: sctools.platform.GenericPlatform

Command Line Interface for 10x Genomics v2 RNA-sequencing programs

This class defines several methods that are created as CLI tools when sctools is installed (see setup.py)

cell_barcode

A data class that defines the start and end position of the cell barcode and the tags to assign the sequence and quality of the cell barcode

Type

fastq.EmbeddedBarcode

molecule_barcode

A data class that defines the start and end position of the molecule barcode and the tags to assign the sequence and quality of the molecule barcode

Type

fastq.EmbeddedBarcode

sample_barcode

A data class that defines the start and end position of the sample barcode and the tags to assign the sequence and quality of the sample barcode

Type

fastq.EmbeddedBarcode

attach_barcodes()[source]

Attach barcodes from the forward (r1) and optionally index (i1) fastq files to the reverse (r2) bam file

classmethod attach_barcodes(args=None)[source]

Command line entrypoint for attaching barcodes to a bamfile.

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod bam_to_count_matrix(args: Optional[Iterable[str]] = None) int

Command line entrypoint for constructing a count matrix from a tagged bam file.

Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod calculate_cell_metrics(args: Optional[Iterable[str]] = None) int

Command line entrypoint for calculating cell metrics from a sorted bamfile.

Writes metrics to .csv

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod calculate_gene_metrics(args: Optional[Iterable[str]] = None) int

Command line entrypoint for calculating gene metrics from a sorted bamfile.

Writes metrics to .csv

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

cell_barcode = Tag(start=0, end=16, sequence_tag='CR', quality_tag='CY')
classmethod get_tags(raw_tags: Optional[Sequence[str]]) Iterable[str]
classmethod group_qc_outputs(args: Optional[Iterable[str]] = None) int

Commandline entrypoint for parsing picard metrics files, hisat2 and rsem statistics log files. :param args: file_names: array of files

output_name: prefix of output file name. metrics_type: Picard, PicardTable, HISAT2, RSEM and Core.

Returns

return – return if the program completes successfully.

Return type

0

classmethod merge_cell_metrics(args: Optional[Iterable[str]] = None) int

Command line entrypoint for merging multiple cell metrics files.

Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod merge_count_matrices(args: Optional[Iterable[str]] = None) int

Command line entrypoint for constructing a count matrix from a tagged bam file.

Constructs a count matrix from an aligned bam file sorted by cell barcode, molecule barcode, and gene id.

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod merge_gene_metrics(args: Optional[Iterable[str]] = None) int

Command line entrypoint for merging multiple gene metrics files.

Merges multiple metrics inputs into a single metrics file that matches the shape and order of the generated count matrix.

Parameters

args (Iterable[str], optional) – Arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

molecule_barcode = Tag(start=16, end=26, sequence_tag='UR', quality_tag='UY')
sample_barcode = Tag(start=0, end=8, sequence_tag='SR', quality_tag='SY')
classmethod split_bam(args: Optional[Iterable] = None) int

Command line entrypoint for splitting a bamfile into subfiles of equal size.

prints filenames of chunks to stdout

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod tag_sort_bam(args: Optional[Iterable] = None) int

Command line entrypoint for sorting a bam file by zero or more tags, followed by queryname.

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

classmethod verify_bam_sort(args: Optional[Iterable] = None) int

Command line entrypoint for verifying bam is properly sorted by zero or more tags, followed by queryname.

Parameters

args (Iterable[str], optional) – arguments list, for testing (see test/test_entrypoints.py for example). The default value of None, when passed to parser.parse_args causes the parser to read sys.argv

Returns

return_call – return call if the program completes successfully

Return type

0

sctools.reader module

Sequence File Iterators

This module defines a general iterator and some helper functions for iterating over files that contain sequencing data

sctools.infer_open(file_: str, mode: str)

helper function that determines the compression type of a file without relying on its extension

sctools.zip_readers(*readers, indices=None)

helper function that iterates over one or more readers, optionally extracting only the records that correspond to indices

sctools.Classes()
-------
Reader          Basic reader that loops over one or more input files.
class sctools.reader.Reader(files='-', mode='r', header_comment_char=None)[source]

Bases: object

Basic reader object that seamlessly loops over multiple input files.

Is subclassed to create readers for specific file types (e.g. fastq, gtf, etc.)

Parameters
  • files (Union[str, List], optional) – The file(s) to read. If ‘-’, read sys.stdin (default = ‘-‘)

  • mode ({'r', 'rb'}, optional) – The open mode for files. If ‘r’, yield string data, if ‘rb’, yield bytes data (default = ‘r’).

  • header_comment_char (str, optional) – If not None, skip lines beginning with this character (default = None).

property filenames: List[str]
select_record_indices(indices: Set) Generator[source]

Iterate over provided indices only, skipping other records.

Parameters

indices (Set[int]) – indices to include in the output

Yields

record, str – records from file corresponding to indices

property size: int

return the collective size of all files being read in bytes

sctools.reader.infer_open(file_: str, mode: str) Callable[source]

Helper function to infer the correct compression type of an input file

Identifies files that are .gz or .bz2 compressed without requiring file extensions

Parameters
  • file (str) – the file to open

  • mode ({'r', 'rb'}) – the mode to open the file in. ‘r’ returns strings, ‘rb’ returns bytes

Returns

open_function – the correct open function for the file’s compression with mode pre-set through functools partial

Return type

Callable

sctools.reader.zip_readers(*readers, indices=None) Generator[source]

Zip together multiple reader objects, yielding records simultaneously.

If indices is passed, only return lines in file that correspond to indices

Parameters
  • *readers (List[Reader]) – Reader objects to simultaneously iterate over

  • indices (Set[int], optional) – indices to include in the output

Yields

records (Tuple[str]) – one record per reader passed

sctools.stats module

Statistics Functions for Sequence Data Analysis

This module implements statistical modules for sequence analysis

sctools.base4_entropy(x: np.array, axis: int = 1)

calculate the entropy of a 4 x sequence length base frequency matrix

sctools.Classes()
-------
OnlineGaussianSuficientStatistic        Empirical (online) calculation of mean and variance
class sctools.stats.OnlineGaussianSufficientStatistic[source]

Bases: object

Implementation of Welford’s online mean and variance algorithm

update(new_value: float)[source]

incorporate new_value into the online estimate of mean and variance

mean()

return the mean value

calculate_variance()[source]

calculate and return the variance

mean_and_variance()[source]

return both mean and variance

calculate_variance()[source]

calculate and return the variance

property mean: float

return the mean value

mean_and_variance() Tuple[float, float][source]

calculate and return the mean and variance

update(new_value: float) None[source]
sctools.stats.base4_entropy(x, axis=1)[source]

Calculate entropy in base four of a data matrix x

Useful for measuring DNA entropy (with 4 nucleotides) as the output is restricted to [0, 1]

Parameters
  • x (np.ndarray) – array of dimension one or more containing numeric types

  • axis (int, optional) – axis to calculate entropy across. Values in this axis are treated as observation frequencies

Returns

entropy – array of input dimension - 1 containin entropy values bounded in [0, 1]

Return type

np.ndarray