genetools package

Submodules

genetools.helpers module

Pandas/Numpy common recipes.

genetools.helpers.barcode_split(obs_names, separator='-', colname_barcode='barcode', colname_library='library_id')[source]

Split single cell barcodes such as ATGC-1 into a barcode column with value “ATGC” and a library ID column with value 1.

Recommended usage with scanpy:

adata.obs = genetools.helpers.horizontal_concat(
    adata.obs,
    genetools.helpers.barcode_split(adata.obs_names)
)
Parameters
  • obs_names (pandas.Series or pandas.Index) – Cell barcodes with a library ID suffix.

  • separator (str, optional) – library ID separator, defaults to ‘-’

  • colname_barcode (str, optional) – output column name containing barcode without library ID suffix, defaults to ‘barcode’

  • colname_library (str, optional) – output column name containing library ID suffix as an int, defaults to ‘library_id’

Returns

Two-column dataframe containing barcode prefix and library ID suffix.

Return type

pandas.DataFrame

genetools.helpers.get_off_diagonal_values(arr)[source]

Get off-diagonal values of a numpy 2d array as a flattened 1d array.

Parameters

arr (numpy.ndarray) – input numpy 2d array

Returns

flattened 1d array of non-diagonal values only

Return type

numpy.ndarray

genetools.helpers.horizontal_concat(df_left, df_right)[source]

Concatenate df_right horizontally to df_left, with no checks for whether the indexes match, but confirming final shape.

Parameters
  • df_left (pandas.DataFrame or pandas.Series) – Left data

  • df_right (pandas.DataFrame or pandas.Series) – Right data

Returns

Copied dataframe with df_right’s columns glued onto the right side of df_left’s columns

Return type

pandas.DataFrame

genetools.helpers.make_slurm_command(script, job_name, log_path, env=None, options={}, job_group_name='', wrap_script=True)[source]

Generate slurm sbatch command. Should be pipe-able straight to bash.

Automatic log filenames will take the format:
  • {{ log_path }}/{{ job_group_name (optional) }}/{{ job_name }}.out for stdout

  • {{ log_path }}/{{ job_group_name (optional) }}/{{ job_name }}.err for stderr

You can override automatic log filenames by manually supplying “output” and “error” values in the options dict.

Parameters
  • script (str) – path to an executable script, or inline script (if wrap_script is True)

  • job_name (str) – job name, used for naming log files

  • log_path (str) – destination for log files.

  • env (dict, optional) – any environment variables to pass to script, defaults to None

  • options (dict, optional) – any CLI options for sbatch, defaults to {}

  • job_group_name (str, optional) – optional group name for this job and related jobs, used for naming log files, defaults to “”

  • wrap_script (bool, optional) – whether the script is inline as opposed to a file on disk, defaults to True

Returns

an sbatch command

Return type

str

genetools.helpers.merge_into_left(left, right, **kwargs)[source]

Defensively merge right series or dataframe into left by index, preserving left’s index exactly. right data will be reordered to match left index.

Parameters
  • left (pandas.DataFrame or pandas.Series) – left data whose index will be preserved

  • right (pandas.DataFrame or pandas.Series) – right data which will be reordered based on left index.

  • **kwargs – passed to pandas.merge

Returns

left-merged DataFrame with left’s index

Return type

pandas.DataFrame

genetools.helpers.rename_duplicates(series, delim='-')[source]

Rename duplicate values to be unique. ['a', 'a'] will become ['a', 'a-1'], for example.

Parameters
  • series (pandas.Series) – series with values to rename

  • delim (str, optional) – delimeter before duplicate-number index, defaults to “-”

Returns

series where original duplicates have been renamed to -1, -2, etc.

Return type

pandas.Series

genetools.helpers.vertical_concat(df_top, df_bottom, reset_index=False)[source]

Concatenate df_bottom vertically to df_top, with no checks for whether the columns match, but confirming final shape.

Parameters
  • df_top (pandas.DataFrame) – Top data

  • df_bottom (pandas.DataFrame) – Bottom data

  • reset_index (bool, optional) – Reset index values after concat, defaults to False

Returns

Copied dataframe with df_bottom’s rows glued onto the bottom of df_top’s rows

Return type

pandas.DataFrame

genetools.palette module

class genetools.palette.HueValueStyle(color: str, marker: Optional[str] = None, marker_size_scale_factor: float = 1.0, legend_size_scale_factor: float = 1.0, facecolors: Optional[str] = None, edgecolors: Optional[str] = None, linewidths: Optional[float] = None, zorder: int = 1, alpha: Optional[float] = None, hatch: Optional[str] = None)[source]

Bases: object

Describes how to style a particular value (category) of a categorical hue column.

Use palettes mapping hue values to HueValueStyles to make plots with different marker shapes, transparencies, z-orders, etc. for different groups.

The plotting functions accept a hue_key, which identifies a dataframe column that contains hue values. They also accept a palette mapping each hue value to a HueValueStyle that defines not just the color to use for that hue value, but also other styles:

  • Scatterplot marker shape, primary color, face color, edge color, line width, transparency, and line width.

  • Rectangle/barplot color and hatch pattern.

  • Size scale factor for scatterplot markers and legend entries. (The palette of HueValueStyles is defined separately from choosing marker size, and can be plotted at any selected base marker size.)

Here’s an example of assigning a custom HueValueStyle to a hue value in a color palette. This defines a custom unfilled shape, a custom z-order, and more:

palette = {
    "group_A": genetools.palette.HueValueStyle(
        color=sns.color_palette("bright")[0],
        edgecolors=sns.color_palette("bright")[0],
        facecolors="none",
        marker="^",
        marker_size_scale_factor=1.5,
        linewidths=1.5,
        zorder=10,
    ),
    ...
}

For face and edge colors, None is the default value; to disable them, set to string 'none'.

alpha: float = None[source]
apply_defaults(defaults: genetools.palette.HueValueStyle)[source]

Returns new HueValueStyle that applies defaults: Modifies this style to fill any missing values with the values from another HueValueStyle.

Use case: supply global style defaults for an entire scatterplot, then override with customizations in any individual hue value style.

color: str[source]
edgecolors: str = None[source]
facecolors: str = None[source]
classmethod from_color(s)[source]

Construct from color string only; keep all other marker parameters set to defaults. If already a HueValueStyle, pass through without modification.

hatch: str = None[source]
static huestyles_to_colors_dict(d: dict) dict[source]

Cast any HueValueStyle values in dict to be color strings.

legend_size_scale_factor: float = 1.0[source]
linewidths: float = None[source]
marker: str = None[source]
marker_size_scale_factor: float = 1.0[source]
render_rectangle_props()[source]

Returns kwargs to pass to ax.bar() to apply this style.

render_scatter_continuous_props(marker_size=None)[source]

Returns kwargs to pass to ax.scatter() to apply this style, in the context of continuous cmap scatterplots.

render_scatter_legend_props()[source]

Returns kwargs to pass to ax.legend() to apply this style.

render_scatter_props(marker_size=None)[source]

Returns kwargs to pass to ax.scatter() to apply this style.

zorder: int = 1[source]
genetools.palette.convert_palette_list_to_dict(palette, hue_names, sort_hues=True)[source]

If palette is a list, convert it to a dict, assigning a color to each value in hue_names (with sort enabled by default).

If palette is already a dict, pass it through with no changes.

genetools.plots module

genetools.plots.add_sample_size_to_labels(labels: list, data: pandas.core.frame.DataFrame, hue_key: str) list[source]

Add sample size to tick labels on any plot with categorical groups.

Sample size for each label is extracted from the hue_key column of dataframe data.

Pairs well with genetools.plots.wrap_tick_labels(ax).

Example usage:

ax.set_xticklabels(
    genetools.plots.add_sample_size_to_labels(
        ax.get_xticklabels(),
        df,
        "Group"
    )
)
Parameters
  • labels (list) – list of tick labels corresponding to groups in data[hue_key]

  • data (pd.DataFrame) – dataset with categorical groups

  • hue_key (str) – column name specifying categorical groups in dataset data

Returns

modified tick labels with group sample sizes attached

Return type

list

genetools.plots.savefig(fig: matplotlib.figure.Figure, *args, **kwargs)[source]

Save figure with smart defaults:

  • Tight bounding box – necessary for legends outside of figure

  • Determinsistic PDF output by fixing SOURCE_DATE_EPOCH to Jan 1, 2000

  • Editable text objects when outputing a vector PDF

Example usage: genetools.plots.savefig(fig, "my_plot.png", dpi=300).

Any positional or keyword arguments are passed to matplotlib.pyplot.savefig.

Parameters

fig (matplotlib.figure.Figure) – Figure to save.

genetools.plots.scatterplot(data, x_axis_key, y_axis_key, hue_key, continuous_hue=False, continuous_cmap='viridis', discrete_palette: Optional[Union[Dict[str, Union[genetools.palette.HueValueStyle, str]], List[Union[genetools.palette.HueValueStyle, str]]]] = None, ax: Optional[matplotlib.axes._axes.Axes] = None, figsize=(8, 8), marker_size=25, alpha=1.0, na_color='lightgray', marker='o', marker_edge_color='none', enable_legend=True, legend_hues=None, legend_title=None, sort_legend_hues=True, autoscale=True, equal_aspect_ratio=False, plotnonfinite=False, label_key=None, label_z_order=100, label_color='k', label_alpha=0.8, label_size=15, remove_x_ticks=False, remove_y_ticks=False, tight_layout=True, despine=True, **kwargs) Tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes][source]

Scatterplot colored by a discrete or continuous “hue” grouping variable.

For discrete hues, pass continuous_hue=False and a dictionary of colors and/or HueValueStyle objects in discrete_palette.

Figure size will grow beyond the figsize parameter setting, because the legend is pulled out of figure. So you must use fig.savefig('filename', bbox_inches='tight'). This is provided automatically by genetools.plots.savefig(fig, 'filename').

If using with scanpy, to get umap data from adata.obsm into adata.obs, try:

data = adata.obs.assign(umap_1=adata.obsm["X_umap"][:, 0], umap_2=adata.obsm["X_umap"][:, 1])
Parameters
  • data (pandas.DataFrame) – Input data, e.g. anndata.obs

  • x_axis_key (str) – Column name to plot on X axis

  • y_axis_key (str) – Column name to plot on Y axis

  • hue_key (str) – Column name with hue groups that will be used to color points

  • continuous_hue (bool, optional) – Whether the hue column takes continuous or discrete/categorical values, defaults to False.

  • continuous_cmap (str, optional) – Colormap to use for plotting continuous hue grouping variable, defaults to “viridis”

  • discrete_palette (Union[ Dict[str, Union[HueValueStyle, str]], List[Union[HueValueStyle, str]] ], optional) – Palette of colors and/or HueValueStyle objects to use for plotting discrete/categorical hue groups, defaults to None. Supply a matplotlib palette name, list of colors, or dict mapping hue values to colors or to HueValueStyle objects (or a mix of the two).

  • ax (matplotlib.axes.Axes, optional) – Existing matplotlib Axes to plot on, defaults to None

  • figsize (tuple, optional) – Size of figure to generate if no existing ax was provided, defaults to (8, 8)

  • marker_size (int, optional) – Base marker size. Maybe scaled by individual HueValueStyles. Defaults to 25

  • alpha (float, optional) – Default point transparency, unless overriden by a HueValueStyle, defaults to 1.0

  • na_color (str, optional) – Fallback color to use for discrete hue categories that do not have an assigned style in discrete_palette, defaults to “lightgray”

  • marker (str, optional) – Default marker style, unless overriden by a HueValueStyle, defaults to “o”. For plots with many points, try “.” instead.

  • marker_edge_color (str, optional) – Default marker edge color, unless overriden by a HueValueStyle, defaults to “none” (no edge border drawn). Another common choice is “face”, so the edge color matches the face color.

  • enable_legend (bool, optional) – Whether legend (or colorbar if continuous_hue) should be drawn. Defaults to True. May want to disable if plotting multiple subplots/panels.

  • legend_hues (list, optional) – Optionally override the list of hue values to include in legend, e.g. to add any hue values missing from the plotted subset of data; defaults to None

  • legend_title (str, optional) – Specify a title for the legend. Defaults to None, in which case the hue_key is used.

  • sort_legend_hues (bool, optional) – Enable sorting of legend hues, defaults to True

  • autoscale (bool, optional) – Enable automatic zoom in, defaults to True

  • equal_aspect_ratio (bool, optional) – Plot with equal aspect ratio, defaults to False

  • plotnonfinite (bool, optional) – For continuous hues, whether to plot points with inf or nan value, defaults to False

  • label_key (str, optional) – Optional column name specifying group text labels to superimpose on plot, defaults to None

  • label_z_order (int, optional) – Z-index for superimposed group text labels, defaults to 100

  • label_color (str, optional) – Color for superimposed group text labels, defaults to “k”

  • label_alpha (float, optional) – Opacity for superimposed group text labels, defaults to 0.8

  • label_size (int, optional) – Text size of superimposed group labels, defaults to 15

  • remove_x_ticks (bool, optional) – Remove X axis tick marks and labels, defaults to False

  • remove_y_ticks (bool, optional) – Remove Y axis tick marks and labels, defaults to False

  • tight_layout (bool, optional) – whether to format the figure with tight_layout, defaults to True

  • despine (bool, optional) – whether to despine (remove the top and right figure borders), defaults to True

Raises

ValueError – Must specify correct number of colors if supplying a custom palette

Returns

Matplotlib Figure and Axes

Return type

Tuple[matplotlib.figure.Figure, matplotlib.axes.Axes]

genetools.plots.stacked_bar_plot(data, index_key, hue_key, value_key=None, ax: Optional[matplotlib.axes._axes.Axes] = None, figsize=(8, 8), normalize=True, vertical=False, palette: Optional[Union[Dict[str, Union[genetools.palette.HueValueStyle, str]], List[Union[genetools.palette.HueValueStyle, str]]]] = None, na_color='lightgray', hue_order=None, axis_label='Frequency', enable_legend=True, legend_title=None) Tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes][source]

Stacked bar chart.

The index_key groups form the bars, and the hue_key groups subdivide the bars. The value_key determines the subdivision sizes, and is computed automatically if not provided.

See https://observablehq.com/@d3/stacked-normalized-horizontal-bar for inspiration and colors.

Figure size will grow beyond the figsize parameter setting, because the legend is pulled out of figure. So you must use fig.savefig('filename', bbox_inches='tight'). This is provided automatically by genetools.plots.savefig(fig, 'filename').

Parameters
  • data (pandas.DataFrame) – Plot data containing at minimum the columns identified by index_key, hue_key, and optionally value_key.

  • index_key (str) – Column name defining the rows.

  • hue_key (str) – Column name defining the horizontal bar categories.

  • value_key (str, optional.) – Column name defining the bar sizes. If not supplied, this method will calculate group frequencies automatically

  • ax (matplotlib.axes.Axes, optional) – Existing matplotlib Axes to plot on, defaults to None

  • figsize (tuple, optional) – Size of figure to generate if no existing ax was provided, defaults to (8, 8)

  • normalize (bool, optional) – Normalize each row’s frequencies to sum to 1, defaults to True

  • vertical (bool, optional) – Plot stacked bars vertically, defaults to False (horizontal)

  • palette (Union[ Dict[str, Union[HueValueStyle, str]], List[Union[HueValueStyle, str]] ], optional) – Palette of colors and/or HueValuStyle objects to style the bars corresponding to each hue value, defaults to None (in which case default palette used). Supply a matplotlib palette name, list of colors, or dict mapping hue values to colors or to HueValueStyle objects (or a mix of the two).

  • na_color (str, optional) – Fallback color to use for hue values that do not have an assigned style in palette, defaults to “lightgray”

  • hue_order (list, optional) – Optionally specify order of bar subdivisions. This order is applied from the beginning (bottom or left) to the end (top or right) of the bar. Defaults to None

  • axis_label (str, optional) – Label for the axis along which the frequency values are drawn, defaults to “Frequency”

  • enable_legend (bool, optional) – Whether legend should be drawn. Defaults to True. May want to disable if plotting multiple subplots/panels.

  • legend_title (str, optional) – Specify a title for the legend. Defaults to None, in which case the hue_key is used.

Raises

ValueError – Must specify correct number of colors if supplying a custom palette

Returns

Matplotlib Figure and Axes

Return type

(matplotlib.figure.Figure, matplotlib.axes.Axes)

genetools.plots.wrap_tick_labels(ax: matplotlib.axes._axes.Axes, wrap_x_axis=True, wrap_y_axis=True, wrap_amount=20) matplotlib.axes._axes.Axes[source]

Add text wrapping to tick labels on x and/or y axes on any plot.

May override existing line breaks in tick labels.

Parameters
  • ax (matplotlib.axes.Axes) – existing plot with tick labels to be wrapped

  • wrap_x_axis (bool, optional) – whether to wrap x-axis tick labels, defaults to True

  • wrap_y_axis (bool, optional) – whether to wrap y-axis tick labels, defaults to True

  • wrap_amount (int, optional) – length of each line of text, defaults to 20

Returns

plot with modified tick labels

Return type

matplotlib.axes.Axes

genetools.scanpy_helpers module

Scanpy common recipes.

genetools.scanpy_helpers.clr_normalize(adata, axis=0, inplace=True)[source]

Centered log ratio transformation for Cite-seq data, normalizing:

  • each protein’s count vectors across cells (axis=0, normalizing each column of the cells x proteins matrix, default)

  • or the antibody count vector for each cell (axis=1, normalizing each row of the cells x proteins matrix)

This is a wrapper of genetools.stats.clr_normalize(matrix, axis).

Parameters
  • adata (anndata.AnnData) – Protein counts anndata

  • axis (int, optional) – normalize each antibody independently (axis=0) or normalize each cell independently (axis=1), defaults to 0

  • inplace (bool, optional) – whether to modify input anndata, defaults to True

Returns

Transformed anndata

Return type

anndata.AnnData

genetools.scanpy_helpers.find_all_markers(adata, cluster_key, pval_cutoff=0.05, log2fc_min=0.25, key_added='rank_genes_groups', test='wilcoxon', use_raw=True)[source]

Find differentially expressed marker genes for each group of cells.

Parameters
  • adata (anndata.AnnData) – Scanpy/anndata object

  • cluster_key (str) – The adata.obs column name that defines groups for finding distinguishing marker genes.

  • pval_cutoff (float, optional) – Only return markers that have an adjusted p-value below this threshold. Defaults to 0.05. Set to None to disable filtering.

  • log2fc_min (float, optional) – Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. Defaults to 0.25. Set to None to disable filtering.

  • key_added (str, optional) – The key in adata.uns information is saved to, defaults to “rank_genes_groups”

  • test (str, optional) – Statistical test to use, defaults to “wilcoxon” (Wilcoxon rank-sum test), see scanpy.tl.rank_genes_groups documentation for other options

  • use_raw (bool, optional) – Use raw attribute of adata if present, defaults to True

Returns

Dataframe with ranked marker genes for each cluster. Important columns: gene, rank, [cluster_key] (same as argument value)

Return type

pandas.DataFrame

genetools.stats module

genetools.stats.accept_series(func)[source]

Decorator to seamlessly accept pandas Series in place of a numpy array, and returns with original Series index.

genetools.stats.clr_normalize(mat, axis=0)[source]

Centered log ratio transformation for Cite-seq data, normalizing:

  • each protein’s count vectors across cells (axis=0, normalizing each column of the cells x proteins matrix, default)

  • or the antibody count vector for each cell (axis=1, normalizing each row of the cells x proteins matrix)

To use with anndata: genetools.scanpy_helpers.clr_normalize(adata, axis)

Notes:

This is almost the same as log(x) - 1/D * sum( log(product of x's) ), which is the same as log(x) - log ( [ product of x's] ^ (1/D) ), where D = len(x).

The general definition is:

from scipy.stats.mstats import gmean
return np.log(x) - np.log(gmean(x))

But geometric mean only applies to positive numbers (otherwise the inner product will be 0). So you want to use pseudocounts or drop 0 counts. That’s what Seurat’s modification does.

Unfortunately there is not a single answer. In some cases, cell-based normalization fails. This is because cell-normalization makes an assumption that the total ADT counts should be constant across cells. That can become a significant issue if you have cell populations in your data, but did not add protein markers for them (this is also an issue for scRNA-seq, but is significantly mitigated because at least you measure many genes).

However, gene-based normalization can fail when there is significant heterogeneity in sequencing depth, or cell size. The optimal strategy depends on the AB panel, and heterogeneity of your sample.

In this implementation, protein-wise is axis=0 and cell-wise is axis=1. Seurat’s default is protein-wise, i.e. axis=0.

The default is “protein-wise” (axis=0), i.e. normalize each protein independently.

Parameters
  • mat (numpy array or scipy sparse matrix) – Counts matrix (cells x proteins)

  • axis (int, optional) – normalize each antibody independently (axis=0) or normalize each cell independently (axis=1), defaults to 0

Returns

Transformed counts matrix

Return type

numpy array

genetools.stats.coclustering(cluster_ids_1, cluster_ids_2)[source]

Compute coclustering percentage between two sets of cluster IDs for the same cells: Of the cell pairs clustered together by either or both methods, what percentage are clustered together by both methods?

(The clusters are allowed to have different names across methods, and don’t necessarily need to be ints.)

Parameters
  • cluster_ids_1 (numpy array-like) – One set of cluster IDs.

  • cluster_ids_2 (numpy array-like) – Another set of cluster IDs.

Returns

Percentage of cell pairs clustered together by one or both methods that are also clustered together by the other method.

Return type

float

genetools.stats.intersect_marker_genes(reference_data, query_data, low_confidence_threshold=0.035, low_confidence_suffix='?')[source]

Map cluster marker genes against reference lists to find top hits.

query_data and reference_data should both be dictionaries where:
  • keys are cluster names or IDs

  • values are lists of genes associated with that cluster

Or if you have a dataframe where each row contains a cluster ID and a gene name, you can convert to dict with df.groupby('cluster')['gene'].apply(list).to_dict()

Usage with an anndata/scanpy object on groups defined by adata.obs['louvain']:

# find marker genes for all clusters
cluster_markers_df = genetools.scanpy_helpers.find_all_markers(adata, cluster_key='louvain')

# convert to dict of clusters -> gene lists mapping
cluster_marker_lists = cluster_markers_df.groupby('louvain')['gene'].apply(list).to_dict()

# intersect with known marker gene lists
results, label_map, low_confidence_percentage = genetools.stats.intersect_marker_genes(reference_marker_lists, cluster_marker_lists)

# rename clusters in your anndata/scanpy object
adata.obs['louvain_annotated'] = adata.obs['louvain'].copy().cat.rename_categories(label_map)
Behavior:
  • Intersection scores are normalized to marker gene list sizes.

  • Resulting duplicate cluster names are renamed, ensuring that N original query clusters will map to N renamed clusters.

Parameters
  • reference_data (dict) – reference marker gene lists

  • query_data (dict) – query marker gene lists

  • low_confidence_threshold (float, optional) – Minimal difference between top and subsequent hits for a confident call, defaults to 0.035

  • low_confidence_suffix (str, optional) – Suffix for low-confidence cluster renamings, defaults to “?”

Returns

dataframe with cluster mapping details, a dictionary for renaming query cluster names, and percentage of low-confidence calls.

Return type

(pandas.DataFrame, dict, float) tuple

genetools.stats.normalize_columns(df)[source]

Make columns sum to 1.

Parameters

df (pandas.DataFrame) – dataframe

Returns

column-normalized dataframe

Return type

pandas.DataFrame

genetools.stats.normalize_rows(df)[source]

Make rows sum to 1.

Parameters

df (pandas.DataFrame) – dataframe

Returns

row-normalized dataframe

Return type

pandas.DataFrame

genetools.stats.percentile_normalize(values)[source]

Percentile normalize.

Parameters

values (numpy.ndarray or pandas.Series) – values to normalize

Returns

percentile-normalized values

Return type

numpy.ndarray or pandas.Series

genetools.stats.rank_normalize(values)[source]

Rank normalize, starting with rank 1. All ranks must be unique.

Parameters

values (numpy.ndarray or pandas.Series) – values to normalize

Returns

rank-normalized values

Return type

numpy.ndarray or pandas.Series

Module contents

Top-level package for genetools.