sctools.metrics package

Submodules

sctools.metrics.aggregator module

Sequence Metric Aggregators

This module provides classes useful for aggregating metric information for individual cells or genes. These classes consume BAM files that have been pre-sorted such that all sequencing reads that correspond to the molecules of a cell (CellMetrics) or the molecules of a gene (GeneMetrics) are yielded sequentially.

Classes

Notes

This module can be rewritten with dataclass when python 3.7 stabilizes, see https://www.python.org/dev/peps/pep-0557/

class sctools.metrics.aggregator.CellMetrics[source]

Bases: MetricAggregator

Cell Metric Aggregator

Aggregator that captures metric information about a cell by parsing all of the molecules in an experiment that were annotated with a specific cell barcode, as recorded in the CB tag.

perfect_cell_barcodes

The number of reads whose cell barcodes contain no errors (tag CB == CR)

Type

int

reads_mapped_intergenic

The number of reads mapped to an intergenic region for this cell

Type

int

reads_mapped_too_many_loci

The number of reads that were mapped to too many loci across the genome and as a consequence, are reported unmapped by the aligner

Type

int

cell_barcode_fraction_bases_above_30_variance

The variance of the fraction of Illumina base calls for the cell barcode sequence that are greater than 30, across molecules

Type

float

cell_barcode_fraction_bases_above_30_mean

The average fraction of Illumina base calls for the cell barcode sequence that are greater than 30, across molecules

Type

float

n_genes

The number of genes detected by this cell

Type

int

genes_detected_multiple_observations

The number of genes that are observed by more than one read in this cell

Type

int

n_mitochondrial_genes

The number of mitochondrial genes detected by this cell

Type

int

n_mitochondrial_molecules

The number of molecules from mitochondrial genes detected for this cell

Type

int

pct_mitochondrial_molecules

The percentage of molecules from mitochondrial genes detected for this cell

Type

int

Metric Aggregator Base Class
The ``MetricAggregator`` class defines a set of metrics that can be extracted from an
aligned bam file. It defines all the metrics that are general across genes and cells. This
class is subclassed by ``GeneMetrics`` and ``CellMetrics``, which define data-specific metrics
in the ``parse_extra_fields`` method. An instance of ``GeneMetrics`` or ``CellMetrics`` is
instantiated for each gene or molecule in a bam file, respectively.
n_reads

The number of reads associated with this entity

Type

int

noise_reads

Number of reads that are categorized by 10x genomics cellranger as “noise”. Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides

Type

int, NotImplemented

perfect_molecule_barcodes

The number of reads with molecule barcodes that have no errors (cell barcode tag == raw barcode tag)

Type

int

reads_mapped_exonic

The number of reads for this entity that are mapped to exons

Type

int

reads_mapped_intronic

The number of reads for this entity that are mapped to introns

Type

int

reads_mapped_utr

The number of reads for this entity that are mapped to 3’ untranslated regions (UTRs)

Type

int

reads_mapped_uniquely

The number of reads mapped to a single unambiguous location in the genome

Type

int

reads_mapped_multiple

The number of reads mapped to multiple genomic positions with equal confidence # todo make sure equal confidence is accurate

Type

int

duplicate_reads

The number of reads that are duplicates (see README.md for defition of a duplicate)

Type

int

spliced_reads

The number of reads that overlap splicing junctions

Type

int

antisense_reads

The number of reads that are mapped to the antisense strand instead of the transcribed strand

Type

int

molecule_barcode_fraction_bases_above_30_mean

The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity

Type

float

molecule_barcode_fraction_bases_above_30_variance

The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity

Type

float

genomic_reads_fraction_bases_quality_above_30_mean

The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)

Type

float

genomic_reads_fraction_bases_quality_above_30_variance

The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)

Type

float

genomic_read_quality_mean

Average quality of Illumina base calls in the genomic reads corresponding to this entity

Type

float

genomic_read_quality_variance

Variance in quality of Illumina base calls in the genomic reads corresponding to this entity

Type

float

n_molecules

Number of molecules corresponding to this entity. See README.md for the definition of a Molecule

Type

float

n_fragments

Number of fragments corresponding to this entity. See README.md for the definition of a Fragment

Type

float

reads_per_molecule

The average number of reads associated with each molecule in this entity

Type

float

reads_per_fragment

The average number of reads associated with each fragment in this entity

Type

float

fragments_per_molecule

The average number of fragments associated with each molecule in this entity

Type

float

fragments_with_single_read_evidence

The number of fragments associated with this entity that are observed by only one read

Type

int

molecules_with_single_read_evidence

The number of molecules associated with this entity that are observed by only one read

Type

int

parse_extra_fields(tags, record), NotImplemented

Abstract method that must be implemented by subclasses. Called by parse_molecule() to gather information for subclass-specific metrics

parse_molecule(tags, record)

Extract information from a set of sequencing reads that correspond to a molecule and store the data in the MetricAggregator class.

finalize()[source]

Some metrics cannot be calculated until all the information for an entity has been aggregated, for example, the number of fragments_per_molecule. Finalize calculates all such higher-order metrics

Examples

# todo implement me

See also

GeneMetrics

extra_docs = '\n    Examples\n    --------\n    # todo implement me\n\n    See Also\n    --------\n    GeneMetrics\n\n    '
finalize(mitochondrial_genes={})[source]

Calculate metrics that require information from all molecules of an entity

finalize() replaces attributes in-place that were initialized by the constructor as None with a value calculated across all molecule data that has been aggregated.

parse_extra_fields(tags: Sequence[str], record: AlignedSegment) None[source]

Parses a record to extract gene-specific information

Gene-specific metric data is stored in-place in the MetricAggregator

Parameters
  • tags (Sequence[str]) – The GE, UB and CB tags that define this molecule

  • record (pysam.AlignedSegment) – SAM record to be parsed

parse_molecule(tags: Sequence[str], records: Iterable[AlignedSegment]) None

Parse information from all records of a molecule.

The parsed information is stored in the MetricAggregator in-place.

Parameters
  • tags (Sequence[str]) – all the tags that define this molecule. one of {[CB, GE, UB], [GE, CB, UB]}

  • records (Iterable[pysam.AlignedSegment]) – the sam records associated with the molecule

class sctools.metrics.aggregator.GeneMetrics[source]

Bases: MetricAggregator

Gene Metric Aggregator

Aggregator that captures metric information about a gene by parsing all of the molecules in an experiment that were annotated with a specific gene ID, as recorded in the GE tag.

number_cells_detected_multiple

The number of cells which observe more than one read of this gene

Type

int

number_cells_expressing

The number of cells that detect this gene

Type

int

Metric Aggregator Base Class
The ``MetricAggregator`` class defines a set of metrics that can be extracted from an
aligned bam file. It defines all the metrics that are general across genes and cells. This
class is subclassed by ``GeneMetrics`` and ``CellMetrics``, which define data-specific metrics
in the ``parse_extra_fields`` method. An instance of ``GeneMetrics`` or ``CellMetrics`` is
instantiated for each gene or molecule in a bam file, respectively.
n_reads

The number of reads associated with this entity

Type

int

noise_reads

Number of reads that are categorized by 10x genomics cellranger as “noise”. Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides

Type

int, NotImplemented

perfect_molecule_barcodes

The number of reads with molecule barcodes that have no errors (cell barcode tag == raw barcode tag)

Type

int

reads_mapped_exonic

The number of reads for this entity that are mapped to exons

Type

int

reads_mapped_intronic

The number of reads for this entity that are mapped to introns

Type

int

reads_mapped_utr

The number of reads for this entity that are mapped to 3’ untranslated regions (UTRs)

Type

int

reads_mapped_uniquely

The number of reads mapped to a single unambiguous location in the genome

Type

int

reads_mapped_multiple

The number of reads mapped to multiple genomic positions with equal confidence # todo make sure equal confidence is accurate

Type

int

duplicate_reads

The number of reads that are duplicates (see README.md for defition of a duplicate)

Type

int

spliced_reads

The number of reads that overlap splicing junctions

Type

int

antisense_reads

The number of reads that are mapped to the antisense strand instead of the transcribed strand

Type

int

molecule_barcode_fraction_bases_above_30_mean

The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity

Type

float

molecule_barcode_fraction_bases_above_30_variance

The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity

Type

float

genomic_reads_fraction_bases_quality_above_30_mean

The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)

Type

float

genomic_reads_fraction_bases_quality_above_30_variance

The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)

Type

float

genomic_read_quality_mean

Average quality of Illumina base calls in the genomic reads corresponding to this entity

Type

float

genomic_read_quality_variance

Variance in quality of Illumina base calls in the genomic reads corresponding to this entity

Type

float

n_molecules

Number of molecules corresponding to this entity. See README.md for the definition of a Molecule

Type

float

n_fragments

Number of fragments corresponding to this entity. See README.md for the definition of a Fragment

Type

float

reads_per_molecule

The average number of reads associated with each molecule in this entity

Type

float

reads_per_fragment

The average number of reads associated with each fragment in this entity

Type

float

fragments_per_molecule

The average number of fragments associated with each molecule in this entity

Type

float

fragments_with_single_read_evidence

The number of fragments associated with this entity that are observed by only one read

Type

int

molecules_with_single_read_evidence

The number of molecules associated with this entity that are observed by only one read

Type

int

parse_extra_fields(tags, record), NotImplemented

Abstract method that must be implemented by subclasses. Called by parse_molecule() to gather information for subclass-specific metrics

parse_molecule(tags, record)

Extract information from a set of sequencing reads that correspond to a molecule and store the data in the MetricAggregator class.

finalize()[source]

Some metrics cannot be calculated until all the information for an entity has been aggregated, for example, the number of fragments_per_molecule. Finalize calculates all such higher-order metrics

Examples

# todo implement me

See also

CellMetrics

extra_docs = '\n    Examples\n    --------\n    # todo implement me\n\n    See Also\n    --------\n    CellMetrics\n\n    '
finalize()[source]

Calculate metrics that require information from all molecules of an entity

finalize() replaces attributes in-place that were initialized by the constructor as None with a value calculated across all molecule data that has been aggregated.

parse_extra_fields(tags: Sequence[str], record: AlignedSegment) None[source]

Parses a record to extract cell-specific information

Cell-specific metric data is stored in-place in the MetricAggregator

Parameters
  • tags (Sequence[str]) – The CB, UB and GE tags that define this molecule

  • record (pysam.AlignedSegment) – SAM record to be parsed

parse_molecule(tags: Sequence[str], records: Iterable[AlignedSegment]) None

Parse information from all records of a molecule.

The parsed information is stored in the MetricAggregator in-place.

Parameters
  • tags (Sequence[str]) – all the tags that define this molecule. one of {[CB, GE, UB], [GE, CB, UB]}

  • records (Iterable[pysam.AlignedSegment]) – the sam records associated with the molecule

class sctools.metrics.aggregator.MetricAggregator[source]

Bases: object

Metric Aggregator Base Class

The MetricAggregator class defines a set of metrics that can be extracted from an aligned bam file. It defines all the metrics that are general across genes and cells. This class is subclassed by GeneMetrics and CellMetrics, which define data-specific metrics in the parse_extra_fields method. An instance of GeneMetrics or CellMetrics is instantiated for each gene or molecule in a bam file, respectively.

n_reads

The number of reads associated with this entity

Type

int

noise_reads

Number of reads that are categorized by 10x genomics cellranger as “noise”. Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides

Type

int, NotImplemented

perfect_molecule_barcodes

The number of reads with molecule barcodes that have no errors (cell barcode tag == raw barcode tag)

Type

int

reads_mapped_exonic

The number of reads for this entity that are mapped to exons

Type

int

reads_mapped_intronic

The number of reads for this entity that are mapped to introns

Type

int

reads_mapped_utr

The number of reads for this entity that are mapped to 3’ untranslated regions (UTRs)

Type

int

reads_mapped_uniquely

The number of reads mapped to a single unambiguous location in the genome

Type

int

reads_mapped_multiple

The number of reads mapped to multiple genomic positions with equal confidence # todo make sure equal confidence is accurate

Type

int

duplicate_reads

The number of reads that are duplicates (see README.md for defition of a duplicate)

Type

int

spliced_reads

The number of reads that overlap splicing junctions

Type

int

antisense_reads

The number of reads that are mapped to the antisense strand instead of the transcribed strand

Type

int

molecule_barcode_fraction_bases_above_30_mean

The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity

Type

float

molecule_barcode_fraction_bases_above_30_variance

The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity

Type

float

genomic_reads_fraction_bases_quality_above_30_mean

The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)

Type

float

genomic_reads_fraction_bases_quality_above_30_variance

The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)

Type

float

genomic_read_quality_mean

Average quality of Illumina base calls in the genomic reads corresponding to this entity

Type

float

genomic_read_quality_variance

Variance in quality of Illumina base calls in the genomic reads corresponding to this entity

Type

float

n_molecules

Number of molecules corresponding to this entity. See README.md for the definition of a Molecule

Type

float

n_fragments

Number of fragments corresponding to this entity. See README.md for the definition of a Fragment

Type

float

reads_per_molecule

The average number of reads associated with each molecule in this entity

Type

float

reads_per_fragment

The average number of reads associated with each fragment in this entity

Type

float

fragments_per_molecule

The average number of fragments associated with each molecule in this entity

Type

float

fragments_with_single_read_evidence

The number of fragments associated with this entity that are observed by only one read

Type

int

molecules_with_single_read_evidence

The number of molecules associated with this entity that are observed by only one read

Type

int

parse_extra_fields(tags, record), NotImplemented

Abstract method that must be implemented by subclasses. Called by parse_molecule() to gather information for subclass-specific metrics

parse_molecule(tags, record)[source]

Extract information from a set of sequencing reads that correspond to a molecule and store the data in the MetricAggregator class.

finalize()[source]

Some metrics cannot be calculated until all the information for an entity has been aggregated, for example, the number of fragments_per_molecule. Finalize calculates all such higher-order metrics

finalize() None[source]

Calculate metrics that require information from all molecules of an entity

finalize() replaces attributes in-place that were initialized by the constructor as None with a value calculated across all molecule data that has been aggregated.

parse_extra_fields(tags: Sequence[str], record: AlignedSegment) None[source]

Defined by subclasses to extract class-specific information from molecules

parse_molecule(tags: Sequence[str], records: Iterable[AlignedSegment]) None[source]

Parse information from all records of a molecule.

The parsed information is stored in the MetricAggregator in-place.

Parameters
  • tags (Sequence[str]) – all the tags that define this molecule. one of {[CB, GE, UB], [GE, CB, UB]}

  • records (Iterable[pysam.AlignedSegment]) – the sam records associated with the molecule

sctools.metrics.gatherer module

Sequence Metric Gatherers

..currentmodule:: sctools.metrics

This module defines classes to gather metrics across the cells or genes of an experiment and write them to gzip-compressed csv files

Classes

MetricGatherer(bam_file, output_stem[, ...])

Gathers Metrics from an experiment

GatherCellMetrics(bam_file, output_stem[, ...])

Sequence Metric Gatherers

GatherGeneMetrics(bam_file, output_stem[, ...])

Sequence Metric Gatherers

class sctools.metrics.gatherer.GatherCellMetrics(bam_file: str, output_stem: str, mitochondrial_gene_ids: Set[str] = {}, compress: bool = True)[source]

Bases: MetricGatherer

..currentmodule:: sctools.metrics

This module defines classes to gather metrics across the cells or genes of an experiment and write them to gzip-compressed csv files

MetricGatherer(bam_file, output_stem[, ...])

Gathers Metrics from an experiment

GatherCellMetrics(bam_file, output_stem[, ...])

Sequence Metric Gatherers

GatherGeneMetrics(bam_file, output_stem[, ...])

Sequence Metric Gatherers

See also

sctools.metrics.aggregator, sctools.metrics.merge, sctools.metrics.writer

bam_file must be sorted by gene (GE), molecule (UB), and cell (CB), where gene varies fastest.

>>> from sctools.metrics.gatherer import GatherCellMetrics
>>> import os, tempfile
>>> # example data
>>> bam_file = os.path.abspath(__file__) + '../test/data/test.bam'
>>> temp_dir = tempfile.mkdtemp()
>>> g = GatherCellMetrics(bam_file=bam_file, output_stem=temp_dir + 'test', compress=True)
>>> g.extract_metrics()

GatherGeneMetrics

property bam_file: str

the bam file that metrics are generated from

extra_docs = "\n    Notes\n    -----\n    ``bam_file`` must be sorted by gene (``GE``), molecule (``UB``), and cell (``CB``), where gene\n    varies fastest.\n\n    Examples\n    --------\n    >>> from sctools.metrics.gatherer import GatherCellMetrics\n    >>> import os, tempfile\n\n    >>> # example data\n    >>> bam_file = os.path.abspath(__file__) + '../test/data/test.bam'\n    >>> temp_dir = tempfile.mkdtemp()\n    >>> g = GatherCellMetrics(bam_file=bam_file, output_stem=temp_dir + 'test', compress=True)\n    >>> g.extract_metrics()\n\n    See Also\n    --------\n    GatherGeneMetrics\n\n    "
extract_metrics(mode: str = 'rb') None[source]

Extract cell metrics from self.bam_file

Parameters

mode (str, optional) – Open mode for self.bam. ‘r’ -> sam, ‘rb’ -> bam (default = ‘rb’).

class sctools.metrics.gatherer.GatherGeneMetrics(bam_file: str, output_stem: str, mitochondrial_gene_ids: Set[str] = {}, compress: bool = True)[source]

Bases: MetricGatherer

..currentmodule:: sctools.metrics

This module defines classes to gather metrics across the cells or genes of an experiment and write them to gzip-compressed csv files

MetricGatherer(bam_file, output_stem[, ...])

Gathers Metrics from an experiment

GatherCellMetrics(bam_file, output_stem[, ...])

Sequence Metric Gatherers

GatherGeneMetrics(bam_file, output_stem[, ...])

Sequence Metric Gatherers

See also

sctools.metrics.aggregator, sctools.metrics.merge, sctools.metrics.writer

bam_file must be sorted by molecule (UB), cell (CB), and gene (GE), where molecule varies fastest.

>>> from sctools.metrics.gatherer import GatherCellMetrics
>>> import os, tempfile
>>> # example data
>>> bam_file = os.path.abspath(__file__) + '../test/data/test.bam'
>>> temp_dir = tempfile.mkdtemp()
>>> g = GatherCellMetrics(bam_file=bam_file, output_stem=temp_dir + 'test', compress=True)
>>> g.extract_metrics()

GatherGeneMetrics

property bam_file: str

the bam file that metrics are generated from

extra_docs = "\n    Notes\n    -----\n    ``bam_file`` must be sorted by molecule (``UB``), cell (``CB``), and gene (``GE``), where\n    molecule varies fastest.\n\n    Examples\n    --------\n    >>> from sctools.metrics.gatherer import GatherCellMetrics\n    >>> import os, tempfile\n\n    >>> # example data\n    >>> bam_file = os.path.abspath(__file__) + '../test/data/test.bam'\n    >>> temp_dir = tempfile.mkdtemp()\n    >>> g = GatherCellMetrics(bam_file=bam_file, output_stem=temp_dir + 'test', compress=True)\n    >>> g.extract_metrics()\n\n    See Also\n    --------\n    GatherGeneMetrics\n\n    "
extract_metrics(mode: str = 'rb') None[source]

Extract gene metrics from self.bam_file

Parameters

mode (str, optional) – Open mode for self.bam. ‘r’ -> sam, ‘rb’ -> bam (default = ‘rb’).

class sctools.metrics.gatherer.MetricGatherer(bam_file: str, output_stem: str, mitochondrial_gene_ids: Set[str] = {}, compress: bool = True)[source]

Bases: object

Gathers Metrics from an experiment

Because molecules tend to have relatively small numbers of reads, the memory footprint of this method is typically small (tens of megabytes).

Parameters
  • bam_file (str) – the bam file containing the reads that metrics should be calculated from. Can be a chunk of cells or an entire experiment

  • output_stem (str) – the file stem for the gzipped csv output

extract_metrics()[source]

extracts metrics from bam_file and writes them to output_stem.csv.gz

property bam_file: str

the bam file that metrics are generated from

extract_metrics(mode='rb') None[source]

extract metrics from the provided bam file and write the results to csv.

Parameters

mode ({'r', 'rb'}, default 'rb') – the open mode for pysam.AlignmentFile. ‘r’ indicates the input is a sam file, and ‘rb’ indicates a bam file.

sctools.metrics.merge module

Merge Sequence Metrics

..currentmodule:: sctools.metrics

This module defines classes to merge multiple metrics files that have been gathered from bam files containing disjoint sets of cells. This is a common use pattern, as sequencing datasets are often chunked to enable horizontal scaling using scatter-gather patterns.

Classes

MergeMetrics Merge Metrics base class MergeCellMetrics Class to merge cell metrics MergeGeneMetrics Class to merge gene metrics

class sctools.metrics.merge.MergeCellMetrics(metric_files: Sequence[str], output_file: str)[source]

Bases: MergeMetrics

execute() None[source]

Concatenate input cell metric files

Since bam files that metrics are calculated from contain disjoint sets of cells, cell metrics can simply be concatenated together.

class sctools.metrics.merge.MergeGeneMetrics(metric_files: Sequence[str], output_file: str)[source]

Bases: MergeMetrics

execute() None[source]

Merge input gene metric files

The bam files that metrics are calculated from contain disjoint sets of cells, each of which can measure the same genes. As a result, the metric values must be summed (count based metrics) averaged over (fractional, averge, or variance metrics) or recalculated (metrics that depend on other metrics).

class sctools.metrics.merge.MergeMetrics(metric_files: Sequence[str], output_file: str)[source]

Bases: object

Merges multiple metrics files into a single gzip compressed csv file

Parameters
  • metric_files (Sequence[str]) – metrics files to merge

  • output_file (str) – file name for the merged output

execute()[source]

merge metrics files # todo this should probably be wrapped into __init__ to make this more like a function

execute() None[source]

sctools.metrics.writer module

Metric Writers

..currentmodule:: sctools.metrics

This module defines a class to write metrics to csv as the data is generated, cell by cell or gene by gene. This strategy keeps memory usage low, as no more than a single molecule’s worth of sam records and one cell or gene’s worth of metric data are in-memory at a time.

Classes

MetricCSVWriter Class to write metrics to file

class sctools.metrics.writer.MetricCSVWriter(output_stem: str, compress=True)[source]

Bases: object

Writes metric information iteratively to (optionally compressed) csv.

Parameters
  • output_stem (str) – File stem for the output file.

  • compress (bool, optional) – Whether or not to compress the output file (default = True).

write_header()[source]

Write the metric header to file.

write()[source]

Write an array of cell or gene metrics to file.

close()[source]

Close the metric file.

close() None[source]

Close the metrics file.

property filename: str

filename with correct suffix added

write(index: str, record: Mapping[str, Number]) None[source]

Write the array of metric values for a cell or gene to file.

Parameters
  • index (str) – The name of the cell or gene that these metrics summarize

  • record (Mapping[str, Number]) – Output of vars() called on an sctools.metrics.aggregator.MetricAggregator instance, producing a dictionary of keys to metric values.

write_header(record: Mapping[str, Any]) None[source]

Write the metric keys to file, producing the header line of the csv file.

Parameters

record (Mapping[str, Any]) – Output of vars() called on an sctools.metrics.aggregator.MetricAggregator instance, producing a dictionary of keys to metric values.