Filter Low-Abundance Transcripts in a TSENATAnalysis Object
Source:R/s4_functions.R
filter_analysis.RdS4 wrapper for .filter_se() that filters low-abundance transcripts
directly within a TSENATAnalysis object. This maintains the consistent
S4 workflow pattern where functions accept and return analysis objects.
Usage
filter_analysis(
analysis,
min_tpm = 1,
tpm_assay_name = NULL,
min_samples = 5L,
stringency = "medium",
pair_col = NULL,
min_tx_per_gene = 2L,
min_isoform_abundance = NULL,
assay_name = "counts",
subset_n_genes = NULL,
subset_genes = NULL,
subset_n_samples = NULL,
subset_samples = NULL,
subset_select_by = c("variance", "mean", "random"),
subset_seed = 42,
subset_min_count = NULL,
verbose = FALSE
)Arguments
- analysis
A
TSENATAnalysisS4 object containing theSummarizedExperimentto be filtered.- min_tpm
Numeric TPM threshold (default 1.0). Keeps transcripts with TPM >=
min_tpmin >=min_samplessamples. Ignored ifstringencyis specified.- tpm_assay_name
Character; name of assay containing TPM data (default: NULL). If NULL, searches for TPM assay automatically.
- min_samples
Numeric. Minimum number of samples in which a transcript must be present (default: 5). Ignored if
stringencyis specified.- stringency
Character. Filtering stringency level: 'soft' (permissive), 'medium' (balanced), or 'severe' (stringent). When specified, auto-estimates:
min_samples,min_tpm,min_tx_per_gene, andmin_isoform_abundancefrom data. Requirespair_colin colData for paired designs. User-provided values for any parameter override stringency defaults. Default: 'medium' (balanced filtering recommended for most analyses).- pair_col
Character; column name in colData containing pair IDs for paired designs. Default: NULL (auto-detect if needed).
- min_tx_per_gene
Integer minimum number of transcripts per gene required (default 2L). Single-transcript genes are always kept. Ignored if
stringencyis specified; when specified, automatically adjusted based on stringency level.- min_isoform_abundance
Numeric in [0, 1]; minimum relative abundance threshold for isoforms within each gene. Implements Soneson et al. (2016) filtering. Default behavior: - If
stringencyis specified: uses stringency-based default (soft: 0. 01, medium: 0. 05, severe: 0. 15) - Ifstringencyis NULL: uses default 0.05 (5 - If explicitly provided: overrides any stringency default Set to 0 or NULL (post-stringency processing) to skip isoform-level filtering.- assay_name
Character; name or index of the assay to use for filtering (default: 'counts'). Deprecated: use
tpm_assay_nameinstead.- subset_n_genes
Integer; optional number of genes to retain after filtering. If provided, genes are selected based on
subset_select_by. Default: NULL.- subset_genes
Character vector; optional specific genes to retain after filtering. Default: NULL.
- subset_n_samples
Integer; optional number of samples to retain after filtering. If provided, samples are selected (balanced by condition if available). Default: NULL.
- subset_samples
Character vector; optional specific samples to retain after filtering. Default: NULL.
- subset_select_by
Character; gene selection method for
subset_n_genes: 'variance' (highest variance), 'mean' (highest mean expression), or 'random'. Default: 'variance'.- subset_seed
Integer; random seed for reproducibility when
subset_select_by = 'random'. Default: 42.- subset_min_count
Numeric; optional minimum count threshold applied during subsetting. Default: NULL.
- verbose
Logical. If TRUE, print filtering progress and summary statistics (default: FALSE).
Value
Invisibly returns the modified analysis object with filtered
SummarizedExperiment in the @se slot. The filtering operation
modifies the analysis object in-place while maintaining all other slots
(results, metadata, etc.).
Details
This wrapper applies .filter_se() to the SummarizedExperiment within
the TSENATAnalysis object, optionally followed by subsetting parameters.
The filtering and subsetting operations are applied in sequence:
1. Extracts the SE from analysis@se
2. Filters using .filter_se() with specified filtering parameters (default: 'medium' stringency)
3. If any subset parameters are provided, applies gene/sample selection
to select specific genes and/or samples
4. Stores the filtered/subsetted SE back in analysis@se
5. Returns the modified analysis object invisibly
**Default Filtering (stringency = 'medium'):** By default, filtering applies balanced stringency: requires transcripts in >= 50 isoform abundance of 5 noise reduction with preservation of isoform diversity for reliable entropy calculations.
**Important:** Filtering should be performed BEFORE computing diversity, divergence, or LM interaction results. If called after analysis results have been computed, those results will be based on unfiltered data and may not align with the filtered SE dimensions.
See also
build_analysis for creating a new analysis object
Examples
# Create test analysis and filter
data(readcounts)
readcounts <- as.matrix(readcounts)
mode(readcounts) <- 'numeric'
metadata_df <- read.table(
system.file('extdata', 'metadata.tsv', package = 'TSENAT'),
header = TRUE, sep = '\t'
)
gff3_dataset <- system.file('extdata', 'annotation.gff3.gz', package =
'TSENAT')
config <- TSENAT_config(sample_col = 'sample', condition_col = 'condition')
analysis <- build_analysis(readcounts = readcounts, tx2gene =
gff3_dataset, metadata = metadata_df, config = config,
tpm = tpm, effective_length = effective_length)
analysis <- filter_analysis(analysis, stringency = 'medium')