sctools.metrics package

Submodules

sctools.metrics.aggregator module

Sequence Metric Aggregators

This module provides classes useful for aggregating metric information for individual cells or genes. These classes consume BAM files that have been pre-sorted such that all sequencing reads that correspond to the molecules of a cell (CellMetrics) or the molecules of a gene (GeneMetrics) are yielded sequentially.

Classes

Notes

This module can be rewritten with dataclass when python 3.7 stabilizes, see https://www.python.org/dev/peps/pep-0557/

class sctools.metrics.aggregator.CellMetrics[source]

Bases: MetricAggregator

Cell Metric Aggregator

Aggregator that captures metric information about a cell by parsing all of the molecules in an experiment that were annotated with a specific cell barcode, as recorded in the CB tag.

perfect_cell_barcodes

The number of reads whose cell barcodes contain no errors (tag CB == CR)

Type: int

reads_mapped_intergenic

The number of reads mapped to an intergenic region for this cell

Type: int

reads_mapped_too_many_loci

The number of reads that were mapped to too many loci across the genome and as a consequence, are reported unmapped by the aligner

Type: int

cell_barcode_fraction_bases_above_30_variance

The variance of the fraction of Illumina base calls for the cell barcode sequence that are greater than 30, across molecules

Type: float

cell_barcode_fraction_bases_above_30_mean

The average fraction of Illumina base calls for the cell barcode sequence that are greater than 30, across molecules

Type: float

n_genes

The number of genes detected by this cell

Type: int

genes_detected_multiple_observations

The number of genes that are observed by more than one read in this cell

Type: int

n_mitochondrial_genes

The number of mitochondrial genes detected by this cell

Type: int

n_mitochondrial_molecules

The number of molecules from mitochondrial genes detected for this cell

Type: int

pct_mitochondrial_molecules

The percentage of molecules from mitochondrial genes detected for this cell

Type: int

Metric Aggregator Base Class

The ``MetricAggregator`` class defines a set of metrics that can be extracted from an

aligned bam file. It defines all the metrics that are general across genes and cells. This

class is subclassed by ``GeneMetrics`` and ``CellMetrics``, which define data-specific metrics

in the ``parse_extra_fields`` method. An instance of ``GeneMetrics`` or ``CellMetrics`` is

instantiated for each gene or molecule in a bam file, respectively.

n_reads

The number of reads associated with this entity

Type: int

noise_reads

Number of reads that are categorized by 10x genomics cellranger as “noise”. Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides

Type: int, NotImplemented

perfect_molecule_barcodes

The number of reads with molecule barcodes that have no errors (cell barcode tag == raw barcode tag)

Type: int

reads_mapped_exonic

The number of reads for this entity that are mapped to exons

Type: int

reads_mapped_intronic

The number of reads for this entity that are mapped to introns

Type: int

reads_mapped_utr

The number of reads for this entity that are mapped to 3’ untranslated regions (UTRs)

Type: int

reads_mapped_uniquely

The number of reads mapped to a single unambiguous location in the genome

Type: int

reads_mapped_multiple

The number of reads mapped to multiple genomic positions with equal confidence # todo make sure equal confidence is accurate

Type: int

duplicate_reads

The number of reads that are duplicates (see README.md for defition of a duplicate)

Type: int

spliced_reads

The number of reads that overlap splicing junctions

Type: int

antisense_reads

The number of reads that are mapped to the antisense strand instead of the transcribed strand

Type: int

molecule_barcode_fraction_bases_above_30_mean

The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity

Type: float

molecule_barcode_fraction_bases_above_30_variance

The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity

Type: float

genomic_reads_fraction_bases_quality_above_30_mean

The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)

Type: float

genomic_reads_fraction_bases_quality_above_30_variance

The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)

Type: float

genomic_read_quality_mean

Average quality of Illumina base calls in the genomic reads corresponding to this entity

Type: float

genomic_read_quality_variance

Variance in quality of Illumina base calls in the genomic reads corresponding to this entity

Type: float

n_molecules

Number of molecules corresponding to this entity. See README.md for the definition of a Molecule

Type: float

n_fragments

Number of fragments corresponding to this entity. See README.md for the definition of a Fragment

Type: float

reads_per_molecule

The average number of reads associated with each molecule in this entity

Type: float

reads_per_fragment

The average number of reads associated with each fragment in this entity

Type: float

fragments_per_molecule

The average number of fragments associated with each molecule in this entity

Type: float

fragments_with_single_read_evidence

The number of fragments associated with this entity that are observed by only one read

Type: int

molecules_with_single_read_evidence

The number of molecules associated with this entity that are observed by only one read

Type: int

parse_extra_fields(tags, record), NotImplemented: Abstract method that must be implemented by subclasses. Called by parse_molecule() to gather information for subclass-specific metrics

parse_molecule(tags, record): Extract information from a set of sequencing reads that correspond to a molecule and store the data in the MetricAggregator class.

finalize()[source]: Some metrics cannot be calculated until all the information for an entity has been aggregated, for example, the number of fragments_per_molecule. Finalize calculates all such higher-order metrics

Examples

# todo implement me

See also

GeneMetrics

extra_docs = '\n Examples\n --------\n # todo implement me\n\n See Also\n --------\n GeneMetrics\n\n '

finalize(mitochondrial_genes={})[source]

Calculate metrics that require information from all molecules of an entity

finalize() replaces attributes in-place that were initialized by the constructor as None with a value calculated across all molecule data that has been aggregated.

parse_extra_fields(tags: Sequence[str], record: AlignedSegment) → None[source]

Parses a record to extract gene-specific information

Gene-specific metric data is stored in-place in the MetricAggregator

Parameters

tags (Sequence[str]) – The GE, UB and CB tags that define this molecule
record (pysam.AlignedSegment) – SAM record to be parsed

parse_molecule(tags: Sequence[str], records: Iterable[AlignedSegment]) → None

Parse information from all records of a molecule.

The parsed information is stored in the MetricAggregator in-place.

Parameters

tags (Sequence[str]) – all the tags that define this molecule. one of {[CB, GE, UB], [GE, CB, UB]}
records (Iterable[pysam.AlignedSegment]) – the sam records associated with the molecule

class sctools.metrics.aggregator.GeneMetrics[source]

Bases: MetricAggregator

Gene Metric Aggregator

Aggregator that captures metric information about a gene by parsing all of the molecules in an experiment that were annotated with a specific gene ID, as recorded in the GE tag.

number_cells_detected_multiple

The number of cells which observe more than one read of this gene

Type: int

number_cells_expressing

The number of cells that detect this gene

Type: int

Metric Aggregator Base Class

The ``MetricAggregator`` class defines a set of metrics that can be extracted from an

aligned bam file. It defines all the metrics that are general across genes and cells. This

class is subclassed by ``GeneMetrics`` and ``CellMetrics``, which define data-specific metrics

in the ``parse_extra_fields`` method. An instance of ``GeneMetrics`` or ``CellMetrics`` is

instantiated for each gene or molecule in a bam file, respectively.

n_reads

The number of reads associated with this entity

Type: int

noise_reads

Number of reads that are categorized by 10x genomics cellranger as “noise”. Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides

Type: int, NotImplemented

perfect_molecule_barcodes

The number of reads with molecule barcodes that have no errors (cell barcode tag == raw barcode tag)

Type: int

reads_mapped_exonic

The number of reads for this entity that are mapped to exons

Type: int

reads_mapped_intronic

The number of reads for this entity that are mapped to introns

Type: int

reads_mapped_utr

The number of reads for this entity that are mapped to 3’ untranslated regions (UTRs)

Type: int

reads_mapped_uniquely

The number of reads mapped to a single unambiguous location in the genome

Type: int

reads_mapped_multiple

The number of reads mapped to multiple genomic positions with equal confidence # todo make sure equal confidence is accurate

Type: int

duplicate_reads

The number of reads that are duplicates (see README.md for defition of a duplicate)

Type: int

spliced_reads

The number of reads that overlap splicing junctions

Type: int

antisense_reads

The number of reads that are mapped to the antisense strand instead of the transcribed strand

Type: int

molecule_barcode_fraction_bases_above_30_mean

The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity

Type: float

molecule_barcode_fraction_bases_above_30_variance

The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity

Type: float

genomic_reads_fraction_bases_quality_above_30_mean

The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)

Type: float

genomic_reads_fraction_bases_quality_above_30_variance

The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)

Type: float

genomic_read_quality_mean

Average quality of Illumina base calls in the genomic reads corresponding to this entity

Type: float

genomic_read_quality_variance

Variance in quality of Illumina base calls in the genomic reads corresponding to this entity

Type: float

n_molecules

Number of molecules corresponding to this entity. See README.md for the definition of a Molecule

Type: float

n_fragments

Number of fragments corresponding to this entity. See README.md for the definition of a Fragment

Type: float

reads_per_molecule

The average number of reads associated with each molecule in this entity

Type: float

reads_per_fragment

The average number of reads associated with each fragment in this entity

Type: float

fragments_per_molecule

The average number of fragments associated with each molecule in this entity

Type: float

fragments_with_single_read_evidence

The number of fragments associated with this entity that are observed by only one read

Type: int

molecules_with_single_read_evidence

The number of molecules associated with this entity that are observed by only one read

Type: int

parse_extra_fields(tags, record), NotImplemented: Abstract method that must be implemented by subclasses. Called by parse_molecule() to gather information for subclass-specific metrics

parse_molecule(tags, record): Extract information from a set of sequencing reads that correspond to a molecule and store the data in the MetricAggregator class.

finalize()[source]: Some metrics cannot be calculated until all the information for an entity has been aggregated, for example, the number of fragments_per_molecule. Finalize calculates all such higher-order metrics

Examples

# todo implement me

See also

CellMetrics

extra_docs = '\n Examples\n --------\n # todo implement me\n\n See Also\n --------\n CellMetrics\n\n '

finalize()[source]

Calculate metrics that require information from all molecules of an entity

finalize() replaces attributes in-place that were initialized by the constructor as None with a value calculated across all molecule data that has been aggregated.

parse_extra_fields(tags: Sequence[str], record: AlignedSegment) → None[source]

Parses a record to extract cell-specific information

Cell-specific metric data is stored in-place in the MetricAggregator

Parameters

tags (Sequence[str]) – The CB, UB and GE tags that define this molecule
record (pysam.AlignedSegment) – SAM record to be parsed

parse_molecule(tags: Sequence[str], records: Iterable[AlignedSegment]) → None

Parse information from all records of a molecule.

The parsed information is stored in the MetricAggregator in-place.

Parameters

tags (Sequence[str]) – all the tags that define this molecule. one of {[CB, GE, UB], [GE, CB, UB]}
records (Iterable[pysam.AlignedSegment]) – the sam records associated with the molecule

class sctools.metrics.aggregator.MetricAggregator[source]

Bases: object

Metric Aggregator Base Class

The MetricAggregator class defines a set of metrics that can be extracted from an aligned bam file. It defines all the metrics that are general across genes and cells. This class is subclassed by GeneMetrics and CellMetrics, which define data-specific metrics in the parse_extra_fields method. An instance of GeneMetrics or CellMetrics is instantiated for each gene or molecule in a bam file, respectively.

n_reads

The number of reads associated with this entity

Type: int

noise_reads

Number of reads that are categorized by 10x genomics cellranger as “noise”. Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides

Type: int, NotImplemented

perfect_molecule_barcodes

The number of reads with molecule barcodes that have no errors (cell barcode tag == raw barcode tag)

Type: int

reads_mapped_exonic

The number of reads for this entity that are mapped to exons

Type: int

reads_mapped_intronic

The number of reads for this entity that are mapped to introns

Type: int

reads_mapped_utr

The number of reads for this entity that are mapped to 3’ untranslated regions (UTRs)

Type: int

reads_mapped_uniquely

The number of reads mapped to a single unambiguous location in the genome

Type: int

reads_mapped_multiple

The number of reads mapped to multiple genomic positions with equal confidence # todo make sure equal confidence is accurate

Type: int

duplicate_reads

The number of reads that are duplicates (see README.md for defition of a duplicate)

Type: int

spliced_reads

The number of reads that overlap splicing junctions

Type: int

antisense_reads

The number of reads that are mapped to the antisense strand instead of the transcribed strand

Type: int

molecule_barcode_fraction_bases_above_30_mean

The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity

Type: float

molecule_barcode_fraction_bases_above_30_variance

The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity

Type: float

genomic_reads_fraction_bases_quality_above_30_mean

The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)

Type: float

genomic_reads_fraction_bases_quality_above_30_variance

The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)

Type: float

genomic_read_quality_mean

Average quality of Illumina base calls in the genomic reads corresponding to this entity

Type: float

genomic_read_quality_variance

Variance in quality of Illumina base calls in the genomic reads corresponding to this entity

Type: float

n_molecules

Number of molecules corresponding to this entity. See README.md for the definition of a Molecule

Type: float

n_fragments

Number of fragments corresponding to this entity. See README.md for the definition of a Fragment

Type: float

reads_per_molecule

The average number of reads associated with each molecule in this entity

Type: float

reads_per_fragment

The average number of reads associated with each fragment in this entity

Type: float

fragments_per_molecule

The average number of fragments associated with each molecule in this entity

Type: float

fragments_with_single_read_evidence

The number of fragments associated with this entity that are observed by only one read

Type: int

molecules_with_single_read_evidence

The number of molecules associated with this entity that are observed by only one read

Type: int

parse_extra_fields(tags, record), NotImplemented: Abstract method that must be implemented by subclasses. Called by parse_molecule() to gather information for subclass-specific metrics

parse_molecule(tags, record)[source]: Extract information from a set of sequencing reads that correspond to a molecule and store the data in the MetricAggregator class.

finalize()[source]: Some metrics cannot be calculated until all the information for an entity has been aggregated, for example, the number of fragments_per_molecule. Finalize calculates all such higher-order metrics

finalize() → None[source]

Calculate metrics that require information from all molecules of an entity

finalize() replaces attributes in-place that were initialized by the constructor as None with a value calculated across all molecule data that has been aggregated.

parse_extra_fields(tags: Sequence[str], record: AlignedSegment) → None[source]: Defined by subclasses to extract class-specific information from molecules

parse_molecule(tags: Sequence[str], records: Iterable[AlignedSegment]) → None[source]

Parse information from all records of a molecule.

The parsed information is stored in the MetricAggregator in-place.

Parameters

tags (Sequence[str]) – all the tags that define this molecule. one of {[CB, GE, UB], [GE, CB, UB]}
records (Iterable[pysam.AlignedSegment]) – the sam records associated with the molecule

sctools.metrics.gatherer module

Sequence Metric Gatherers

..currentmodule:: sctools.metrics

This module defines classes to gather metrics across the cells or genes of an experiment and write them to gzip-compressed csv files

Classes

`MetricGatherer`(bam_file, output_stem[, ...])	Gathers Metrics from an experiment
`GatherCellMetrics`(bam_file, output_stem[, ...])	Sequence Metric Gatherers
`GatherGeneMetrics`(bam_file, output_stem[, ...])	Sequence Metric Gatherers

class sctools.metrics.gatherer.GatherCellMetrics(bam_file: str, output_stem: str, mitochondrial_gene_ids: Set[str] = {}, compress: bool = True)[source]

Bases: MetricGatherer

Sequence Metric Gatherers

..currentmodule:: sctools.metrics

This module defines classes to gather metrics across the cells or genes of an experiment and write them to gzip-compressed csv files

Classes

`MetricGatherer`(bam_file, output_stem[, ...])	Gathers Metrics from an experiment
`GatherCellMetrics`(bam_file, output_stem[, ...])	Sequence Metric Gatherers
`GatherGeneMetrics`(bam_file, output_stem[, ...])	Sequence Metric Gatherers

bam_file must be sorted by gene (GE), molecule (UB), and cell (CB), where gene varies fastest.

>>> from sctools.metrics.gatherer import GatherCellMetrics
>>> import os, tempfile

>>> # example data
>>> bam_file = os.path.abspath(__file__) + '../test/data/test.bam'
>>> temp_dir = tempfile.mkdtemp()
>>> g = GatherCellMetrics(bam_file=bam_file, output_stem=temp_dir + 'test', compress=True)
>>> g.extract_metrics()

GatherGeneMetrics

property bam_file: str: the bam file that metrics are generated from

extra_docs = "\n Notes\n -----\n ``bam_file`` must be sorted by gene (``GE``), molecule (``UB``), and cell (``CB``), where gene\n varies fastest.\n\n Examples\n --------\n >>> from sctools.metrics.gatherer import GatherCellMetrics\n >>> import os, tempfile\n\n >>> # example data\n >>> bam_file = os.path.abspath(__file__) + '../test/data/test.bam'\n >>> temp_dir = tempfile.mkdtemp()\n >>> g = GatherCellMetrics(bam_file=bam_file, output_stem=temp_dir + 'test', compress=True)\n >>> g.extract_metrics()\n\n See Also\n --------\n GatherGeneMetrics\n\n "

extract_metrics(mode: str = 'rb') → None[source]

Extract cell metrics from self.bam_file

Parameters: mode (str, optional) – Open mode for self.bam. ‘r’ -> sam, ‘rb’ -> bam (default = ‘rb’).

class sctools.metrics.gatherer.GatherGeneMetrics(bam_file: str, output_stem: str, mitochondrial_gene_ids: Set[str] = {}, compress: bool = True)[source]

Bases: MetricGatherer

Sequence Metric Gatherers

..currentmodule:: sctools.metrics

This module defines classes to gather metrics across the cells or genes of an experiment and write them to gzip-compressed csv files

Classes

`MetricGatherer`(bam_file, output_stem[, ...])	Gathers Metrics from an experiment
`GatherCellMetrics`(bam_file, output_stem[, ...])	Sequence Metric Gatherers
`GatherGeneMetrics`(bam_file, output_stem[, ...])	Sequence Metric Gatherers

bam_file must be sorted by molecule (UB), cell (CB), and gene (GE), where molecule varies fastest.

>>> from sctools.metrics.gatherer import GatherCellMetrics
>>> import os, tempfile

>>> # example data
>>> bam_file = os.path.abspath(__file__) + '../test/data/test.bam'
>>> temp_dir = tempfile.mkdtemp()
>>> g = GatherCellMetrics(bam_file=bam_file, output_stem=temp_dir + 'test', compress=True)
>>> g.extract_metrics()

GatherGeneMetrics

property bam_file: str: the bam file that metrics are generated from

extra_docs = "\n Notes\n -----\n ``bam_file`` must be sorted by molecule (``UB``), cell (``CB``), and gene (``GE``), where\n molecule varies fastest.\n\n Examples\n --------\n >>> from sctools.metrics.gatherer import GatherCellMetrics\n >>> import os, tempfile\n\n >>> # example data\n >>> bam_file = os.path.abspath(__file__) + '../test/data/test.bam'\n >>> temp_dir = tempfile.mkdtemp()\n >>> g = GatherCellMetrics(bam_file=bam_file, output_stem=temp_dir + 'test', compress=True)\n >>> g.extract_metrics()\n\n See Also\n --------\n GatherGeneMetrics\n\n "

extract_metrics(mode: str = 'rb') → None[source]

Extract gene metrics from self.bam_file

Parameters: mode (str, optional) – Open mode for self.bam. ‘r’ -> sam, ‘rb’ -> bam (default = ‘rb’).

class sctools.metrics.gatherer.MetricGatherer(bam_file: str, output_stem: str, mitochondrial_gene_ids: Set[str] = {}, compress: bool = True)[source]

Bases: object

Gathers Metrics from an experiment

Because molecules tend to have relatively small numbers of reads, the memory footprint of this method is typically small (tens of megabytes).

Parameters

bam_file (str) – the bam file containing the reads that metrics should be calculated from. Can be a chunk of cells or an entire experiment
output_stem (str) – the file stem for the gzipped csv output

extract_metrics()[source]: extracts metrics from bam_file and writes them to output_stem.csv.gz

property bam_file: str: the bam file that metrics are generated from

extract_metrics(mode='rb') → None[source]

extract metrics from the provided bam file and write the results to csv.

Parameters: mode ({'r', 'rb'}, default 'rb') – the open mode for pysam.AlignmentFile. ‘r’ indicates the input is a sam file, and ‘rb’ indicates a bam file.

sctools.metrics.merge module

Merge Sequence Metrics

..currentmodule:: sctools.metrics

This module defines classes to merge multiple metrics files that have been gathered from bam files containing disjoint sets of cells. This is a common use pattern, as sequencing datasets are often chunked to enable horizontal scaling using scatter-gather patterns.

Classes

MergeMetrics Merge Metrics base class MergeCellMetrics Class to merge cell metrics MergeGeneMetrics Class to merge gene metrics

class sctools.metrics.merge.MergeCellMetrics(metric_files: Sequence[str], output_file: str)[source]

Bases: MergeMetrics

execute() → None[source]

Concatenate input cell metric files

Since bam files that metrics are calculated from contain disjoint sets of cells, cell metrics can simply be concatenated together.

class sctools.metrics.merge.MergeGeneMetrics(metric_files: Sequence[str], output_file: str)[source]

Bases: MergeMetrics

execute() → None[source]

Merge input gene metric files

The bam files that metrics are calculated from contain disjoint sets of cells, each of which can measure the same genes. As a result, the metric values must be summed (count based metrics) averaged over (fractional, averge, or variance metrics) or recalculated (metrics that depend on other metrics).

class sctools.metrics.merge.MergeMetrics(metric_files: Sequence[str], output_file: str)[source]

Bases: object

Merges multiple metrics files into a single gzip compressed csv file

Parameters

metric_files (Sequence[str]) – metrics files to merge
output_file (str) – file name for the merged output

execute()[source]: merge metrics files # todo this should probably be wrapped into __init__ to make this more like a function

execute() → None[source]

sctools.metrics.writer module

Metric Writers

..currentmodule:: sctools.metrics

This module defines a class to write metrics to csv as the data is generated, cell by cell or gene by gene. This strategy keeps memory usage low, as no more than a single molecule’s worth of sam records and one cell or gene’s worth of metric data are in-memory at a time.

Classes

MetricCSVWriter Class to write metrics to file

class sctools.metrics.writer.MetricCSVWriter(output_stem: str, compress=True)[source]

Bases: object

Writes metric information iteratively to (optionally compressed) csv.

Parameters

output_stem (str) – File stem for the output file.
compress (bool, optional) – Whether or not to compress the output file (default = True).

write_header()[source]: Write the metric header to file.

write()[source]: Write an array of cell or gene metrics to file.

close()[source]: Close the metric file.

close() → None[source]: Close the metrics file.

property filename: str: filename with correct suffix added

write(index: str, record: Mapping[str, Number]) → None[source]

Write the array of metric values for a cell or gene to file.

Parameters

index (str) – The name of the cell or gene that these metrics summarize
record (Mapping[str, Number]) – Output of vars() called on an sctools.metrics.aggregator.MetricAggregator instance, producing a dictionary of keys to metric values.

write_header(record: Mapping[str, Any]) → None[source]

Write the metric keys to file, producing the header line of the csv file.

Parameters: record (Mapping[str, Any]) – Output of vars() called on an sctools.metrics.aggregator.MetricAggregator instance, producing a dictionary of keys to metric values.