genetools package

Submodules

genetools.helpers module

Pandas/Numpy common recipes.

genetools.helpers.barcode_split(obs_names, separator='-', colname_barcode='barcode', colname_library='library_id')[source]

Split single cell barcodes such as ATGC-1 into a barcode column with value “ATGC” and a library ID column with value 1.

Recommended usage with scanpy: adata.obs = horizontal_concat(adata.obs, barcode_split(adata.obs_names))

Parameters:
  • obs_names (pandas.Series or pandas.Index) – Cell barcodes with a library ID suffix.
  • separator (str, optional) – library ID separator, defaults to ‘-‘
  • colname_barcode (str, optional) – output column name containing barcode without library ID suffix, defaults to ‘barcode’
  • colname_library (str, optional) – output column name containing library ID suffix as an int, defaults to ‘library_id’
Returns:

Two-column dataframe containing barcode prefix and library ID suffix.

Return type:

pandas.DataFrame

genetools.helpers.get_off_diagonal_values(arr)[source]

Get off-diagonal values of a numpy 2d array as a flattened 1d array.

Parameters:arr (numpy.ndarray) – input numpy 2d array
Returns:flattened 1d array of non-diagonal values only
Return type:numpy.ndarray
genetools.helpers.horizontal_concat(df_left, df_right)[source]

Concatenate df_right horizontally to df_left, with no checks for whether the indexes match, but confirming final shape.

Parameters:
  • df_left (pandas.DataFrame or pandas.Series) – Left data
  • df_right (pandas.DataFrame or pandas.Series) – Right data
Returns:

Copied dataframe with df_right’s columns glued onto the right side of df_left’s columns

Return type:

pandas.DataFrame

genetools.helpers.make_slurm_command(script, job_name, log_path, env=None, options={}, job_group_name='', wrap_script=True)[source]

Generate slurm sbatch command. Should be pipe-able straight to bash.

Automatic log filenames will take the format:
  • {{ log_path }}/{{ job_group_name (optional) }}/{{ job_name }}.out for stdout
  • {{ log_path }}/{{ job_group_name (optional) }}/{{ job_name }}.err for stderr

You can override automatic log filenames by manually supplying “output” and “error” values in the options dict.

Parameters:
  • script (str) – path to an executable script, or inline script (if wrap_script is True)
  • job_name (str) – job name, used for naming log files
  • log_path (str) – destination for log files.
  • env (dict, optional) – any environment variables to pass to script, defaults to None
  • options (dict, optional) – any CLI options for sbatch, defaults to {}
  • job_group_name (str, optional) – optional group name for this job and related jobs, used for naming log files, defaults to “”
  • wrap_script (bool, optional) – whether the script is inline as opposed to a file on disk, defaults to True
Returns:

an sbatch command

Return type:

str

genetools.helpers.merge_into_left(left, right, **kwargs)[source]

Defensively merge [right] series or dataframe into [left] by index, preserving [left]’s index exactly. [right] data will be reordered to match [left] index.

Parameters:
  • left (pandas.DataFrame or pandas.Series) – left data whose index will be preserved
  • right (pandas.DataFrame or pandas.Series) – right data which will be reordered based on left index.
  • **kwargs – passed to pandas.merge
Returns:

left-merged DataFrame with [left]’s index

Return type:

pandas.DataFrame

genetools.helpers.rename_duplicates(series, delim='-')[source]

Rename duplicate values to be unique. [‘a’, ‘a’] will become [‘a’, ‘a-1’], for example.

Parameters:
  • series (pandas.Series) – series with values to rename
  • delim (str, optional) – delimeter before duplicate-number index, defaults to “-“
Returns:

series where original duplicates have been renamed to -1, -2, etc.

Return type:

pandas.Series

genetools.helpers.vertical_concat(df_top, df_bottom, reset_index=False)[source]

Concatenate df_bottom vertically to df_top, with no checks for whether the columns match, but confirming final shape.

Parameters:
  • df_top (pandas.DataFrame) – Top data
  • df_bottom (pandas.DataFrame) – Bottom data
  • reset_index (bool, optional) – Reset index values after concat, defaults to False
Returns:

Copied dataframe with df_bottom’s rows glued onto the bottom of df_top’s rows

Return type:

pandas.DataFrame

genetools.plots module

genetools.plots.horizontal_stacked_bar_plot(data, index_key, hue_key, value_key, palette=None, figsize=(8, 8), normalize=True)[source]

Horizontal stacked bar chart.

Note, figure size will grow beyond the figsize parameter setting, because the legend is pulled out of figure. So you must use fig.savefig(‘filename’, bbox_inches=’tight’). This is provided automatically by genetools.plots.savefig(fig, ‘filename’)

See https://observablehq.com/@d3/stacked-normalized-horizontal-bar for inspiration and colors.

Parameters:
  • data (pandas.DataFrame) – Plot data containing at minimum the columns identified by [index_key], [hue_key], and [value_key].
  • index_key (str) – Column name defining the rows.
  • hue_key (str) – Column name defining the horizontal bar categories.
  • value_key (str) – Column name defining the bar sizes.
  • palette (matplotlib palette name, list of colors, or dict mapping hue values to colors, optional) – Color palette, defaults to None (in which case default palette used)
  • figsize (tuple, optional) – figure size, defaults to (8, 8)
  • normalize (bool, optional) – Normalize each row’s frequencies to sum to 1, defaults to True
Raises:

ValueError – Must specify correct number of colors if supplying a custom palette

Returns:

matplotlib figure and axes

Return type:

(matplotlib.Figure, matplotlib.Axes)

genetools.plots.savefig(fig, *args, **kwargs)[source]

Save figure with tight bounding box. From https://github.com/mwaskom/seaborn/blob/master/seaborn/axisgrid.py#L33

genetools.plots.umap_scatter(data, umap_1_key, umap_2_key, hue_key, continuous_hue=False, label_key=None, marker_size=15, figsize=(8, 8), discrete_palette=None, continuous_cmap='viridis', label_z_order=10, label_color='k', label_alpha=0.5, label_size=20)[source]

Simple umap scatter plot, with legend outside figure.

Note, for discrete hues (continuous_hue=False): Figure size will grow beyond the figsize parameter setting, because the legend is pulled out of figure. So you must use fig.savefig(‘filename’, bbox_inches=’tight’). This is provided automatically by genetools.plots.savefig(fig, ‘filename’)

If using with scanpy, to get umap data from adata.obsm into adata.obs, try: > data = helpers.horizontal_concat(adata.obs, adata.obsm.to_df()[[‘X_umap1’, ‘X_umap2’]])

Parameters:
  • data (pandas.DataFrame) – input data, e.g. anndata.obs
  • umap_1_key (string) – column name with first dimension of UMAP
  • umap_2_key (string) – column name with second dimension of UMAP
  • hue_key (string) – column name with hue that will be used to color points
  • continuous_hue (bool, optional) – whether hue column takes continuous values and colorbar should be shown, defaults to False
  • label_key (string, optional) – column name with optional cluster labels, defaults to None
  • marker_size (int, optional) – marker size, defaults to 15
  • figsize (tuple, optional) – figure size, defaults to (8, 8)
  • discrete_palette (matplotlib palette name, list of colors, or dict mapping hue values to colors, optional) – color palette for discrete hues, defaults to None
  • continuous_cmap (matplotlib.colors.Colormap, optional) – colormap for continuous hues, defaults to None
  • label_z_order (int, optional) – z-index for cluster labels, defaults to 10
  • label_color (str, optional) – color for cluster labels, defaults to ‘k’
  • label_alpha (float, optional) – opacity for cluster labels, defaults to 0.5
  • label_size (int, optional) – size of cluster labels, defaults to 20
Returns:

matplotlib figure and axes

Return type:

(matplotlib.Figure, matplotlib.Axes)

genetools.scanpy_helpers module

Scanpy common recipes.

genetools.scanpy_helpers.clr_normalize(adata, axis=0, inplace=True)[source]

Centered log ratio transformation for Cite-seq data, normalizing:

  • each protein’s count vectors across cells (axis=0, normalizing each column of the cells x proteins matrix, default)
  • or the antibody count vector for each cell (axis=1, normalizing each row of the cells x proteins matrix)

This is a wrapper of genetools.stats.clr_normalize(matrix, axis).

Parameters:
  • adata (anndata.AnnData) – Protein counts anndata
  • axis (int, optional) – normalize each antibody independently (axis=0) or normalize each cell independently (axis=1), defaults to 0
  • inplace (bool, optional) – whether to modify input anndata, defaults to True
Returns:

Transformed anndata

Return type:

anndata.AnnData

genetools.scanpy_helpers.find_all_markers(adata, cluster_key, pval_cutoff=0.05, log2fc_min=0.25, key_added='rank_genes_groups', test='wilcoxon', use_raw=True)[source]

Find differentially expressed marker genes for each group of cells.

Parameters:
  • adata (anndata.AnnData) – Scanpy/anndata object
  • cluster_key (str) – The adata.obs column name that defines groups for finding distinguishing marker genes.
  • pval_cutoff (float, optional) – Only return markers that have an adjusted p-value below this threshold. Defaults to 0.05. Set to None to disable filtering.
  • log2fc_min (float, optional) – Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. Defaults to 0.25. Set to None to disable filtering.
  • key_added (str, optional) – The key in adata.uns information is saved to, defaults to “rank_genes_groups”
  • test (str, optional) – Statistical test to use, defaults to “wilcoxon” (Wilcoxon rank-sum test), see scanpy.tl.rank_genes_groups documentation for other options
  • use_raw (bool, optional) – Use raw attribute of adata if present, defaults to True
Returns:

Dataframe with ranked marker genes for each cluster. Important columns: gene, rank, [cluster_key] (same as argument value)

Return type:

pandas.DataFrame

genetools.stats module

genetools.stats.accept_series(func)[source]

Decorator to seamlessly accept pandas Series in place of a numpy array, and returns with original Series index.

genetools.stats.clr_normalize(mat, axis=0)[source]

Centered log ratio transformation for Cite-seq data, normalizing:

  • each protein’s count vectors across cells (axis=0, normalizing each column of the cells x proteins matrix, default)
  • or the antibody count vector for each cell (axis=1, normalizing each row of the cells x proteins matrix)

To use with anndata: genetools.scanpy_helpers.clr_normalize(adata, axis)

Parameters:
  • mat (numpy array or scipy sparse matrix) – Counts matrix (cells x proteins)
  • axis (int, optional) – normalize each antibody independently (axis=0) or normalize each cell independently (axis=1), defaults to 0
Returns:

Transformed counts matrix

Return type:

numpy array

Notes:

> log1p(x = x / (exp(x = sum(log1p(x = x[x > 0]), na.rm = TRUE) / length(x = x))))

This is almost the same as log(x) - 1/D * sum( log(product of x’s) ), which is the same as log(x) - log ( [ product of x’s] ^ (1/D) ), where D = len(x)

The general definition is: > from scipy.stats.mstats import gmean > return np.log(x) - np.log(gmean(x))

But geometric mean only applies to positive numbers (otherwise the inner product will be 0). So you want to use pseudocounts or drop 0 counts. That’s what Seurat’s modification does.

  • See also https://github.com/theislab/scanpy/pull/1117 for other approaches.

  • Do you run this normalization cell-wise or gene-wise (i.e. protein-wise)? [See discussion here](https://github.com/satijalab/seurat/issues/871#issuecomment-431414099):

    > Unfortunately there is not a single answer. In some cases, cell-based normalization fails. This is because cell-normalization makes an assumption that the total ADT counts should be constant across cells. That can become a significant issue if you have cell populations in your data, but did not add protein markers for them (this is also an issue for scRNA-seq, but is significantly mitigated because at least you measure many genes). > > However, gene-based normalization can fail when there is significant heterogeneity in sequencing depth, or cell size. The optimal strategy depends on the AB panel, and heterogeneity of your sample.

In this implementation, protein-wise is axis=0 and cell-wise is axis=1. Seurat’s default is protein-wise, i.e. axis=0.

The default is “protein-wise” (axis=0), i.e. normalize each protein independently.

genetools.stats.coclustering(cluster_ids_1, cluster_ids_2)[source]

Compute coclustering percentage between two sets of cluster IDs for the same cells: Of the cell pairs clustered together by either or both methods, what percentage are clustered together by both methods?

(The clusters are allowed to have different names across methods, and don’t necessarily need to be ints.)

Parameters:
  • cluster_ids_1 (numpy array-like) – One set of cluster IDs.
  • cluster_ids_2 (numpy array-like) – Another set of cluster IDs.
Returns:

Percentage of cell pairs clustered together by one or both methods that are also clustered together by the other method.

Return type:

float

genetools.stats.intersect_marker_genes(reference_data, query_data, low_confidence_threshold=0.035, low_confidence_suffix='?')[source]

Map cluster marker genes against reference lists to find top hits.

query_data and reference_data should both be dictionaries where:
  • keys are cluster names or IDs
  • values are lists of genes associated with that cluster

Or if you have a dataframe where each row contains a cluster ID and a gene name, you can convert to dict with df.groupby('cluster')['gene'].apply(list).to_dict()

Usage with an anndata/scanpy object on groups defined by adata.obs['louvain']:

# find marker genes for all clusters
cluster_markers_df = genetools.scanpy_helpers.find_all_markers(adata, cluster_key='louvain')

# convert to dict of clusters -> gene lists mapping
cluster_marker_lists = cluster_markers_df.groupby('louvain')['gene'].apply(list).to_dict()

# intersect with known marker gene lists
results, label_map, low_confidence_percentage = genetools.stats.intersect_marker_genes(reference_marker_lists, cluster_marker_lists)

# rename clusters in your anndata/scanpy object
adata.obs['louvain_annotated'] = adata.obs['louvain'].copy().cat.rename_categories(label_map)
Behavior:
  • Intersection scores are normalized to marker gene list sizes.
  • Resulting duplicate cluster names are renamed, ensuring that N original query clusters will map to N renamed clusters.
Parameters:
  • reference_data (dict) – reference marker gene lists
  • query_data (dict) – query marker gene lists
  • low_confidence_threshold (float, optional) – Minimal difference between top and subsequent hits for a confident call, defaults to 0.035
  • low_confidence_suffix (str, optional) – Suffix for low-confidence cluster renamings, defaults to “?”
Returns:

dataframe with cluster mapping details, a dictionary for renaming query cluster names, and percentage of low-confidence calls.

Return type:

(pandas.DataFrame, dict, float) tuple

genetools.stats.normalize_columns(df)[source]

Make columns sum to 1.

Parameters:df (pandas.DataFrame) – dataframe
Returns:column-normalized dataframe
Return type:pandas.DataFrame
genetools.stats.normalize_rows(df)[source]

Make rows sum to 1.

Parameters:df (pandas.DataFrame) – dataframe
Returns:row-normalized dataframe
Return type:pandas.DataFrame
genetools.stats.percentile_normalize(values)[source]

Percentile normalize.

Parameters:values (numpy.ndarray or pandas.Series) – values to normalize
Returns:percentile-normalized values
Return type:numpy.ndarray or pandas.Series
genetools.stats.rank_normalize(values)[source]

Rank normalize, starting with rank 1. All ranks must be unique.

Parameters:values (numpy.ndarray or pandas.Series) – values to normalize
Returns:rank-normalized values
Return type:numpy.ndarray or pandas.Series

Module contents

Top-level package for genetools.