sctools.metrics package¶
Submodules¶
sctools.metrics.aggregator module¶
Sequence Metric Aggregators¶
This module provides classes useful for aggregating metric information for individual cells or genes. These classes consume BAM files that have been pre-sorted such that all sequencing reads that correspond to the molecules of a cell (CellMetrics) or the molecules of a gene (GeneMetrics) are yielded sequentially.
Classes¶
Notes
This module can be rewritten with dataclass when python 3.7 stabilizes, see https://www.python.org/dev/peps/pep-0557/
- class sctools.metrics.aggregator.CellMetrics[source]¶
Bases:
sctools.metrics.aggregator.MetricAggregator
Cell Metric Aggregator
Aggregator that captures metric information about a cell by parsing all of the molecules in an experiment that were annotated with a specific cell barcode, as recorded in the
CB
tag.- perfect_cell_barcodes¶
The number of reads whose cell barcodes contain no errors (tag
CB
==CR
)- Type
int
- reads_mapped_intergenic¶
The number of reads mapped to an intergenic region for this cell
- Type
int
- reads_mapped_too_many_loci¶
The number of reads that were mapped to too many loci across the genome and as a consequence, are reported unmapped by the aligner
- Type
int
- cell_barcode_fraction_bases_above_30_variance¶
The variance of the fraction of Illumina base calls for the cell barcode sequence that are greater than 30, across molecules
- Type
float
- cell_barcode_fraction_bases_above_30_mean¶
The average fraction of Illumina base calls for the cell barcode sequence that are greater than 30, across molecules
- Type
float
- n_genes¶
The number of genes detected by this cell
- Type
int
- genes_detected_multiple_observations¶
The number of genes that are observed by more than one read in this cell
- Type
int
- n_mitochondrial_genes¶
The number of mitochondrial genes detected by this cell
- Type
int
- n_mitochondrial_molecules¶
The number of molecules from mitochondrial genes detected for this cell
- Type
int
- pct_mitochondrial_molecules¶
The percentage of molecules from mitochondrial genes detected for this cell
- Type
int
- Metric Aggregator Base Class
- The ``MetricAggregator`` class defines a set of metrics that can be extracted from an
- aligned bam file. It defines all the metrics that are general across genes and cells. This
- class is subclassed by ``GeneMetrics`` and ``CellMetrics``, which define data-specific metrics
- in the ``parse_extra_fields`` method. An instance of ``GeneMetrics`` or ``CellMetrics`` is
- instantiated for each gene or molecule in a bam file, respectively.
- n_reads¶
The number of reads associated with this entity
- Type
int
- noise_reads¶
Number of reads that are categorized by 10x genomics cellranger as “noise”. Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides
- Type
int, NotImplemented
- perfect_molecule_barcodes¶
The number of reads with molecule barcodes that have no errors (cell barcode tag == raw barcode tag)
- Type
int
- reads_mapped_exonic¶
The number of reads for this entity that are mapped to exons
- Type
int
- reads_mapped_intronic¶
The number of reads for this entity that are mapped to introns
- Type
int
- reads_mapped_utr¶
The number of reads for this entity that are mapped to 3’ untranslated regions (UTRs)
- Type
int
- reads_mapped_uniquely¶
The number of reads mapped to a single unambiguous location in the genome
- Type
int
- reads_mapped_multiple¶
The number of reads mapped to multiple genomic positions with equal confidence # todo make sure equal confidence is accurate
- Type
int
- duplicate_reads¶
The number of reads that are duplicates (see README.md for defition of a duplicate)
- Type
int
- spliced_reads¶
The number of reads that overlap splicing junctions
- Type
int
- antisense_reads¶
The number of reads that are mapped to the antisense strand instead of the transcribed strand
- Type
int
- molecule_barcode_fraction_bases_above_30_mean¶
The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity
- Type
float
- molecule_barcode_fraction_bases_above_30_variance¶
The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity
- Type
float
- genomic_reads_fraction_bases_quality_above_30_mean¶
The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)
- Type
float
- genomic_reads_fraction_bases_quality_above_30_variance¶
The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)
- Type
float
- genomic_read_quality_mean¶
Average quality of Illumina base calls in the genomic reads corresponding to this entity
- Type
float
- genomic_read_quality_variance¶
Variance in quality of Illumina base calls in the genomic reads corresponding to this entity
- Type
float
- n_molecules¶
Number of molecules corresponding to this entity. See README.md for the definition of a Molecule
- Type
float
- n_fragments¶
Number of fragments corresponding to this entity. See README.md for the definition of a Fragment
- Type
float
- reads_per_molecule¶
The average number of reads associated with each molecule in this entity
- Type
float
- reads_per_fragment¶
The average number of reads associated with each fragment in this entity
- Type
float
- fragments_per_molecule¶
The average number of fragments associated with each molecule in this entity
- Type
float
- fragments_with_single_read_evidence¶
The number of fragments associated with this entity that are observed by only one read
- Type
int
- molecules_with_single_read_evidence¶
The number of molecules associated with this entity that are observed by only one read
- Type
int
- parse_extra_fields(tags, record), NotImplemented
Abstract method that must be implemented by subclasses. Called by
parse_molecule()
to gather information for subclass-specific metrics
- parse_molecule(tags, record)¶
Extract information from a set of sequencing reads that correspond to a molecule and store the data in the MetricAggregator class.
- finalize()[source]¶
Some metrics cannot be calculated until all the information for an entity has been aggregated, for example, the number of fragments_per_molecule. Finalize calculates all such higher-order metrics
Examples
# todo implement me
See also
- extra_docs = '\n Examples\n --------\n # todo implement me\n\n See Also\n --------\n GeneMetrics\n\n '¶
- finalize(mitochondrial_genes={})[source]¶
Calculate metrics that require information from all molecules of an entity
finalize()
replaces attributes in-place that were initialized by the constructor asNone
with a value calculated across all molecule data that has been aggregated.
- parse_extra_fields(tags: Sequence[str], record: pysam.libcalignedsegment.AlignedSegment) None [source]¶
Parses a record to extract gene-specific information
Gene-specific metric data is stored in-place in the MetricAggregator
- Parameters
tags (Sequence[str]) – The GE, UB and CB tags that define this molecule
record (pysam.AlignedSegment) – SAM record to be parsed
- parse_molecule(tags: Sequence[str], records: Iterable[pysam.libcalignedsegment.AlignedSegment]) None ¶
Parse information from all records of a molecule.
The parsed information is stored in the MetricAggregator in-place.
- Parameters
tags (Sequence[str]) – all the tags that define this molecule. one of {[CB, GE, UB], [GE, CB, UB]}
records (Iterable[pysam.AlignedSegment]) – the sam records associated with the molecule
- class sctools.metrics.aggregator.GeneMetrics[source]¶
Bases:
sctools.metrics.aggregator.MetricAggregator
Gene Metric Aggregator
Aggregator that captures metric information about a gene by parsing all of the molecules in an experiment that were annotated with a specific gene ID, as recorded in the
GE
tag.- number_cells_detected_multiple¶
The number of cells which observe more than one read of this gene
- Type
int
- number_cells_expressing¶
The number of cells that detect this gene
- Type
int
- Metric Aggregator Base Class
- The ``MetricAggregator`` class defines a set of metrics that can be extracted from an
- aligned bam file. It defines all the metrics that are general across genes and cells. This
- class is subclassed by ``GeneMetrics`` and ``CellMetrics``, which define data-specific metrics
- in the ``parse_extra_fields`` method. An instance of ``GeneMetrics`` or ``CellMetrics`` is
- instantiated for each gene or molecule in a bam file, respectively.
- n_reads¶
The number of reads associated with this entity
- Type
int
- noise_reads¶
Number of reads that are categorized by 10x genomics cellranger as “noise”. Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides
- Type
int, NotImplemented
- perfect_molecule_barcodes¶
The number of reads with molecule barcodes that have no errors (cell barcode tag == raw barcode tag)
- Type
int
- reads_mapped_exonic¶
The number of reads for this entity that are mapped to exons
- Type
int
- reads_mapped_intronic¶
The number of reads for this entity that are mapped to introns
- Type
int
- reads_mapped_utr¶
The number of reads for this entity that are mapped to 3’ untranslated regions (UTRs)
- Type
int
- reads_mapped_uniquely¶
The number of reads mapped to a single unambiguous location in the genome
- Type
int
- reads_mapped_multiple¶
The number of reads mapped to multiple genomic positions with equal confidence # todo make sure equal confidence is accurate
- Type
int
- duplicate_reads¶
The number of reads that are duplicates (see README.md for defition of a duplicate)
- Type
int
- spliced_reads¶
The number of reads that overlap splicing junctions
- Type
int
- antisense_reads¶
The number of reads that are mapped to the antisense strand instead of the transcribed strand
- Type
int
- molecule_barcode_fraction_bases_above_30_mean¶
The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity
- Type
float
- molecule_barcode_fraction_bases_above_30_variance¶
The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity
- Type
float
- genomic_reads_fraction_bases_quality_above_30_mean¶
The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)
- Type
float
- genomic_reads_fraction_bases_quality_above_30_variance¶
The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)
- Type
float
- genomic_read_quality_mean¶
Average quality of Illumina base calls in the genomic reads corresponding to this entity
- Type
float
- genomic_read_quality_variance¶
Variance in quality of Illumina base calls in the genomic reads corresponding to this entity
- Type
float
- n_molecules¶
Number of molecules corresponding to this entity. See README.md for the definition of a Molecule
- Type
float
- n_fragments¶
Number of fragments corresponding to this entity. See README.md for the definition of a Fragment
- Type
float
- reads_per_molecule¶
The average number of reads associated with each molecule in this entity
- Type
float
- reads_per_fragment¶
The average number of reads associated with each fragment in this entity
- Type
float
- fragments_per_molecule¶
The average number of fragments associated with each molecule in this entity
- Type
float
- fragments_with_single_read_evidence¶
The number of fragments associated with this entity that are observed by only one read
- Type
int
- molecules_with_single_read_evidence¶
The number of molecules associated with this entity that are observed by only one read
- Type
int
- parse_extra_fields(tags, record), NotImplemented
Abstract method that must be implemented by subclasses. Called by
parse_molecule()
to gather information for subclass-specific metrics
- parse_molecule(tags, record)¶
Extract information from a set of sequencing reads that correspond to a molecule and store the data in the MetricAggregator class.
- finalize()[source]¶
Some metrics cannot be calculated until all the information for an entity has been aggregated, for example, the number of fragments_per_molecule. Finalize calculates all such higher-order metrics
Examples
# todo implement me
See also
- extra_docs = '\n Examples\n --------\n # todo implement me\n\n See Also\n --------\n CellMetrics\n\n '¶
- finalize()[source]¶
Calculate metrics that require information from all molecules of an entity
finalize()
replaces attributes in-place that were initialized by the constructor asNone
with a value calculated across all molecule data that has been aggregated.
- parse_extra_fields(tags: Sequence[str], record: pysam.libcalignedsegment.AlignedSegment) None [source]¶
Parses a record to extract cell-specific information
Cell-specific metric data is stored in-place in the MetricAggregator
- Parameters
tags (Sequence[str]) – The CB, UB and GE tags that define this molecule
record (pysam.AlignedSegment) – SAM record to be parsed
- parse_molecule(tags: Sequence[str], records: Iterable[pysam.libcalignedsegment.AlignedSegment]) None ¶
Parse information from all records of a molecule.
The parsed information is stored in the MetricAggregator in-place.
- Parameters
tags (Sequence[str]) – all the tags that define this molecule. one of {[CB, GE, UB], [GE, CB, UB]}
records (Iterable[pysam.AlignedSegment]) – the sam records associated with the molecule
- class sctools.metrics.aggregator.MetricAggregator[source]¶
Bases:
object
Metric Aggregator Base Class
The
MetricAggregator
class defines a set of metrics that can be extracted from an aligned bam file. It defines all the metrics that are general across genes and cells. This class is subclassed byGeneMetrics
andCellMetrics
, which define data-specific metrics in theparse_extra_fields
method. An instance ofGeneMetrics
orCellMetrics
is instantiated for each gene or molecule in a bam file, respectively.- n_reads¶
The number of reads associated with this entity
- Type
int
- noise_reads¶
Number of reads that are categorized by 10x genomics cellranger as “noise”. Refers to long polymers, or reads with high numbers of N (ambiguous) nucleotides
- Type
int, NotImplemented
- perfect_molecule_barcodes¶
The number of reads with molecule barcodes that have no errors (cell barcode tag == raw barcode tag)
- Type
int
- reads_mapped_exonic¶
The number of reads for this entity that are mapped to exons
- Type
int
- reads_mapped_intronic¶
The number of reads for this entity that are mapped to introns
- Type
int
- reads_mapped_utr¶
The number of reads for this entity that are mapped to 3’ untranslated regions (UTRs)
- Type
int
- reads_mapped_uniquely¶
The number of reads mapped to a single unambiguous location in the genome
- Type
int
- reads_mapped_multiple¶
The number of reads mapped to multiple genomic positions with equal confidence # todo make sure equal confidence is accurate
- Type
int
- duplicate_reads¶
The number of reads that are duplicates (see README.md for defition of a duplicate)
- Type
int
- spliced_reads¶
The number of reads that overlap splicing junctions
- Type
int
- antisense_reads¶
The number of reads that are mapped to the antisense strand instead of the transcribed strand
- Type
int
- molecule_barcode_fraction_bases_above_30_mean¶
The average fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity
- Type
float
- molecule_barcode_fraction_bases_above_30_variance¶
The variance in the fraction of bases in molecule barcodes that receive quality scores greater than 30 across the reads of this entity
- Type
float
- genomic_reads_fraction_bases_quality_above_30_mean¶
The average fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)
- Type
float
- genomic_reads_fraction_bases_quality_above_30_variance¶
The variance in the fraction of bases in the genomic read that receive quality scores greater than 30 across the reads of this entity (included for 10x cell ranger count comparison)
- Type
float
- genomic_read_quality_mean¶
Average quality of Illumina base calls in the genomic reads corresponding to this entity
- Type
float
- genomic_read_quality_variance¶
Variance in quality of Illumina base calls in the genomic reads corresponding to this entity
- Type
float
- n_molecules¶
Number of molecules corresponding to this entity. See README.md for the definition of a Molecule
- Type
float
- n_fragments¶
Number of fragments corresponding to this entity. See README.md for the definition of a Fragment
- Type
float
- reads_per_molecule¶
The average number of reads associated with each molecule in this entity
- Type
float
- reads_per_fragment¶
The average number of reads associated with each fragment in this entity
- Type
float
- fragments_per_molecule¶
The average number of fragments associated with each molecule in this entity
- Type
float
- fragments_with_single_read_evidence¶
The number of fragments associated with this entity that are observed by only one read
- Type
int
- molecules_with_single_read_evidence¶
The number of molecules associated with this entity that are observed by only one read
- Type
int
- parse_extra_fields(tags, record), NotImplemented
Abstract method that must be implemented by subclasses. Called by
parse_molecule()
to gather information for subclass-specific metrics
- parse_molecule(tags, record)[source]¶
Extract information from a set of sequencing reads that correspond to a molecule and store the data in the MetricAggregator class.
- finalize()[source]¶
Some metrics cannot be calculated until all the information for an entity has been aggregated, for example, the number of fragments_per_molecule. Finalize calculates all such higher-order metrics
- finalize() None [source]¶
Calculate metrics that require information from all molecules of an entity
finalize()
replaces attributes in-place that were initialized by the constructor asNone
with a value calculated across all molecule data that has been aggregated.
- parse_extra_fields(tags: Sequence[str], record: pysam.libcalignedsegment.AlignedSegment) None [source]¶
Defined by subclasses to extract class-specific information from molecules
- parse_molecule(tags: Sequence[str], records: Iterable[pysam.libcalignedsegment.AlignedSegment]) None [source]¶
Parse information from all records of a molecule.
The parsed information is stored in the MetricAggregator in-place.
- Parameters
tags (Sequence[str]) – all the tags that define this molecule. one of {[CB, GE, UB], [GE, CB, UB]}
records (Iterable[pysam.AlignedSegment]) – the sam records associated with the molecule
sctools.metrics.gatherer module¶
Sequence Metric Gatherers¶
..currentmodule:: sctools.metrics
This module defines classes to gather metrics across the cells or genes of an experiment and write them to gzip-compressed csv files
Classes¶
|
Gathers Metrics from an experiment |
|
Sequence Metric Gatherers |
|
Sequence Metric Gatherers |
- class sctools.metrics.gatherer.GatherCellMetrics(bam_file: str, output_stem: str, mitochondrial_gene_ids: Set[str] = {}, compress: bool = True)[source]¶
Bases:
sctools.metrics.gatherer.MetricGatherer
..currentmodule:: sctools.metrics
This module defines classes to gather metrics across the cells or genes of an experiment and write them to gzip-compressed csv files
MetricGatherer
(bam_file, output_stem, …)Gathers Metrics from an experiment
GatherCellMetrics
(bam_file, output_stem, …)Sequence Metric Gatherers
GatherGeneMetrics
(bam_file, output_stem, …)Sequence Metric Gatherers
See also
sctools.metrics.aggregator
,sctools.metrics.merge
,sctools.metrics.writer
bam_file
must be sorted by gene (GE
), molecule (UB
), and cell (CB
), where gene varies fastest.>>> from sctools.metrics.gatherer import GatherCellMetrics >>> import os, tempfile
>>> # example data >>> bam_file = os.path.abspath(__file__) + '../test/data/test.bam' >>> temp_dir = tempfile.mkdtemp() >>> g = GatherCellMetrics(bam_file=bam_file, output_stem=temp_dir + 'test', compress=True) >>> g.extract_metrics()
GatherGeneMetrics
- property bam_file: str¶
the bam file that metrics are generated from
- extra_docs = "\n Notes\n -----\n ``bam_file`` must be sorted by gene (``GE``), molecule (``UB``), and cell (``CB``), where gene\n varies fastest.\n\n Examples\n --------\n >>> from sctools.metrics.gatherer import GatherCellMetrics\n >>> import os, tempfile\n\n >>> # example data\n >>> bam_file = os.path.abspath(__file__) + '../test/data/test.bam'\n >>> temp_dir = tempfile.mkdtemp()\n >>> g = GatherCellMetrics(bam_file=bam_file, output_stem=temp_dir + 'test', compress=True)\n >>> g.extract_metrics()\n\n See Also\n --------\n GatherGeneMetrics\n\n "¶
- class sctools.metrics.gatherer.GatherGeneMetrics(bam_file: str, output_stem: str, mitochondrial_gene_ids: Set[str] = {}, compress: bool = True)[source]¶
Bases:
sctools.metrics.gatherer.MetricGatherer
..currentmodule:: sctools.metrics
This module defines classes to gather metrics across the cells or genes of an experiment and write them to gzip-compressed csv files
MetricGatherer
(bam_file, output_stem, …)Gathers Metrics from an experiment
GatherCellMetrics
(bam_file, output_stem, …)Sequence Metric Gatherers
GatherGeneMetrics
(bam_file, output_stem, …)Sequence Metric Gatherers
See also
sctools.metrics.aggregator
,sctools.metrics.merge
,sctools.metrics.writer
bam_file
must be sorted by molecule (UB
), cell (CB
), and gene (GE
), where molecule varies fastest.>>> from sctools.metrics.gatherer import GatherCellMetrics >>> import os, tempfile
>>> # example data >>> bam_file = os.path.abspath(__file__) + '../test/data/test.bam' >>> temp_dir = tempfile.mkdtemp() >>> g = GatherCellMetrics(bam_file=bam_file, output_stem=temp_dir + 'test', compress=True) >>> g.extract_metrics()
GatherGeneMetrics
- property bam_file: str¶
the bam file that metrics are generated from
- extra_docs = "\n Notes\n -----\n ``bam_file`` must be sorted by molecule (``UB``), cell (``CB``), and gene (``GE``), where\n molecule varies fastest.\n\n Examples\n --------\n >>> from sctools.metrics.gatherer import GatherCellMetrics\n >>> import os, tempfile\n\n >>> # example data\n >>> bam_file = os.path.abspath(__file__) + '../test/data/test.bam'\n >>> temp_dir = tempfile.mkdtemp()\n >>> g = GatherCellMetrics(bam_file=bam_file, output_stem=temp_dir + 'test', compress=True)\n >>> g.extract_metrics()\n\n See Also\n --------\n GatherGeneMetrics\n\n "¶
- class sctools.metrics.gatherer.MetricGatherer(bam_file: str, output_stem: str, mitochondrial_gene_ids: Set[str] = {}, compress: bool = True)[source]¶
Bases:
object
Gathers Metrics from an experiment
Because molecules tend to have relatively small numbers of reads, the memory footprint of this method is typically small (tens of megabytes).
- Parameters
bam_file (str) – the bam file containing the reads that metrics should be calculated from. Can be a chunk of cells or an entire experiment
output_stem (str) – the file stem for the gzipped csv output
- property bam_file: str¶
the bam file that metrics are generated from
sctools.metrics.merge module¶
Merge Sequence Metrics¶
..currentmodule:: sctools.metrics
This module defines classes to merge multiple metrics files that have been gathered from bam files containing disjoint sets of cells. This is a common use pattern, as sequencing datasets are often chunked to enable horizontal scaling using scatter-gather patterns.
Classes¶
MergeMetrics Merge Metrics base class MergeCellMetrics Class to merge cell metrics MergeGeneMetrics Class to merge gene metrics
- class sctools.metrics.merge.MergeCellMetrics(metric_files: Sequence[str], output_file: str)[source]¶
- class sctools.metrics.merge.MergeGeneMetrics(metric_files: Sequence[str], output_file: str)[source]¶
Bases:
sctools.metrics.merge.MergeMetrics
- execute() None [source]¶
Merge input gene metric files
The bam files that metrics are calculated from contain disjoint sets of cells, each of which can measure the same genes. As a result, the metric values must be summed (count based metrics) averaged over (fractional, averge, or variance metrics) or recalculated (metrics that depend on other metrics).
- class sctools.metrics.merge.MergeMetrics(metric_files: Sequence[str], output_file: str)[source]¶
Bases:
object
Merges multiple metrics files into a single gzip compressed csv file
- Parameters
metric_files (Sequence[str]) – metrics files to merge
output_file (str) – file name for the merged output
sctools.metrics.writer module¶
Metric Writers¶
..currentmodule:: sctools.metrics
This module defines a class to write metrics to csv as the data is generated, cell by cell or gene by gene. This strategy keeps memory usage low, as no more than a single molecule’s worth of sam records and one cell or gene’s worth of metric data are in-memory at a time.
Classes¶
MetricCSVWriter Class to write metrics to file
- class sctools.metrics.writer.MetricCSVWriter(output_stem: str, compress=True)[source]¶
Bases:
object
Writes metric information iteratively to (optionally compressed) csv.
- Parameters
output_stem (str) – File stem for the output file.
compress (bool, optional) – Whether or not to compress the output file (default = True).
- property filename: str¶
filename with correct suffix added
- write(index: str, record: Mapping[str, numbers.Number]) None [source]¶
Write the array of metric values for a cell or gene to file.
- Parameters
index (str) – The name of the cell or gene that these metrics summarize
record (Mapping[str, Number]) – Output of
vars()
called on an sctools.metrics.aggregator.MetricAggregator instance, producing a dictionary of keys to metric values.
- write_header(record: Mapping[str, Any]) None [source]¶
Write the metric keys to file, producing the header line of the csv file.
- Parameters
record (Mapping[str, Any]) – Output of
vars()
called on an sctools.metrics.aggregator.MetricAggregator instance, producing a dictionary of keys to metric values.