genetools package#

Submodules#

genetools.arrays module#

genetools.arrays.convert_matrix_to_one_element_per_row(arr: ndarray) DataFrame[source]#

record each element of 2d matrix as one entry in dataframe, with the row and column ids stored as well as the value

genetools.arrays.get_top_n(df: DataFrame, col: str, n: int) DataFrame[source]#

Get top amount n of a dataframe df by a specific column col

genetools.arrays.get_top_n_percent(df: DataFrame, col: str, fraction: float) DataFrame[source]#

Get top fraction n of a dataframe df by a specific column col

genetools.arrays.get_trim_both_sides_mask(a: ndarray | DataFrame, proportiontocut: float, axis: int = 0) ndarray[source]#

returns mask that applies consistent trim-both-sides learned on one array.

suppose you have a data array and a weights array. you want to trimboth() the data array but keep the element weights aligned.

solution:

trimming_mask = genetools.arrays.get_trim_both_sides_mask(data, proportiontocut=0.1)
return data[trimming_mask], weights[trimming_mask]
genetools.arrays.groupby_apply_weighted_value_counts(df: DataFrame, *groupby_args, category_column_name: str, weight_column_name: str, normalize: bool = False, **groupby_kwargs) Series[source]#

This is to be used instead of:

df.groupby(["columnA", "columnB"], observed=True)
    .apply(
        lambda grp: malid.external.genetools_arrays.weighted_value_counts(
            grp,
            "category_column",
            "weight_column",
            normalize=True,
        )
    )

The preferred call is:

genetools.arrays.groupby_apply_weighted_value_counts(
    df,
    ["columnA", "columnB"],
    observed=True,
    category_column_name="category_column",
    weight_column_name="weight_column",
    normalize=True
)

It’s the same behavior with extra checks to make sure the output format has the expected shape: it should be a Series, like in standard groupby-value_counts.

(Sometimes Pandas returns a DataFrame instead of a Series.)

genetools.arrays.make_consensus_sequence(sequences: ndarray | Series | List[str], frequencies: ndarray | Series | List[int]) str[source]#

Get weighted mode for each character across a set of equal-length input strings.

genetools.arrays.make_consensus_vector(matrix: ndarray, frequencies: ndarray | List[int] | Series) ndarray[source]#

Get weighted mode for each position across a set of equal-length vectors.

genetools.arrays.make_dummy_variables_in_specific_order(values: Series | List[str], expected_list: List[str], allow_missing_entries: bool) DataFrame[source]#

Create dummy variables in a defined order. All the values are confirmed to be in the “expected order” list. If an entry from “expected order” list is not present, still include as a dummy variable with all 0s if allow_missing_entires is True, or throw error otherwise.

genetools.arrays.masked_argmax(masked_arr: MaskedArray, axis: int | None = None) ndarray | float | int[source]#

argmax on masked array. return nan for row/column (depending on axis setting) of all nans

genetools.arrays.masked_argmin(masked_arr: MaskedArray, axis: int | None = None) ndarray | float | int[source]#

argmin on masked array. return nan for row/column (depending on axis setting) of all nans

genetools.arrays.numeric_vectors_to_character_arrays(arr: ndarray) ndarray[source]#

Reverse operation of strings_to_numeric_vectors

genetools.arrays.strings_to_character_arrays(strs: ndarray | List[str] | Series, validate_equal_lengths: bool = True) ndarray[source]#

Create character matrix by “viewing” strings as 1-character string arrays, then reshaping

genetools.arrays.strings_to_numeric_vectors(strs: ndarray | List[str] | Series, validate_equal_lengths: bool = True) ndarray[source]#

Convert strings to numeric vectors (one entry per character)

genetools.arrays.weighted_median(values: ndarray, weights: ndarray) int | float[source]#

Weighted median: factor in the weights when finding center of the array.

genetools.arrays.weighted_mode(arr: list | ndarray | Series, weights: List[int] | ndarray | Series) Any[source]#

Get weighted mode (most common value) in array. Faster than sklearn.utils.extmath.weighted_mode but does not support axis vectorization.

genetools.arrays.weighted_value_counts(df: DataFrame, category_column_name: str, weight_column_name: str, normalize: bool = False, **groupby_kwargs) Series[source]#

Weighted value counts. Basically a sum of weight_column_name within each category_column_name group.

If normalize is True (default False), the returned value counts will sum to 1.

The optional groupby_kwargs are passed to the groupby procedure. For example, if category_column_name is a Categorical, passing observed=True will make sure that any unused categories are not included in the value counts.

genetools.helpers module#

Pandas/Numpy common recipes.

genetools.helpers.apply_pipeline_transforms(fit_pipeline: Pipeline, data: ndarray | DataFrame) ndarray | DataFrame[source]#

apply all transformations in an already-fit sklearn Pipeline, except the final estimator

only use for pipelines that have an estimator (e.g. classifier) at the last step. in these cases, we cannot call pipeline.transform(). use this method instead to apply all steps except the final classifier/regressor/estimator.

otherwise, if you have a pipeline of all transformers, you should just call .transform() on the pipeline.

genetools.helpers.barcode_split(obs_names, separator='-', colname_barcode='barcode', colname_library='library_id')[source]#

Split single cell barcodes such as ATGC-1 into a barcode column with value “ATGC” and a library ID column with value 1.

Recommended usage with scanpy:

adata.obs = genetools.helpers.horizontal_concat(
    adata.obs,
    genetools.helpers.barcode_split(adata.obs_names)
)
Parameters:
  • obs_names (pandas.Series or pandas.Index) – Cell barcodes with a library ID suffix.

  • separator (str, optional) – library ID separator, defaults to ‘-’

  • colname_barcode (str, optional) – output column name containing barcode without library ID suffix, defaults to ‘barcode’

  • colname_library (str, optional) – output column name containing library ID suffix as an int, defaults to ‘library_id’

Returns:

Two-column dataframe containing barcode prefix and library ID suffix.

Return type:

pandas.DataFrame

genetools.helpers.get_off_diagonal_values(arr)[source]#

Get off-diagonal values of a numpy 2d array as a flattened 1d array.

Parameters:

arr (numpy.ndarray) – input numpy 2d array

Returns:

flattened 1d array of non-diagonal values only

Return type:

numpy.ndarray

genetools.helpers.horizontal_concat(df_left, df_right)[source]#

Concatenate df_right horizontally to df_left, with no checks for whether the indexes match, but confirming final shape.

Parameters:
  • df_left (pandas.DataFrame or pandas.Series) – Left data

  • df_right (pandas.DataFrame or pandas.Series) – Right data

Returns:

Copied dataframe with df_right’s columns glued onto the right side of df_left’s columns

Return type:

pandas.DataFrame

genetools.helpers.make_slurm_command(script, job_name, log_path, env=None, options={}, job_group_name='', wrap_script=True)[source]#

Generate slurm sbatch command. Should be pipe-able straight to bash.

Automatic log filenames will take the format:
  • {{ log_path }}/{{ job_group_name (optional) }}/{{ job_name }}.out for stdout

  • {{ log_path }}/{{ job_group_name (optional) }}/{{ job_name }}.err for stderr

You can override automatic log filenames by manually supplying “output” and “error” values in the options dict.

Parameters:
  • script (str) – path to an executable script, or inline script (if wrap_script is True)

  • job_name (str) – job name, used for naming log files

  • log_path (str) – destination for log files.

  • env (dict, optional) – any environment variables to pass to script, defaults to None

  • options (dict, optional) – any CLI options for sbatch, defaults to {}

  • job_group_name (str, optional) – optional group name for this job and related jobs, used for naming log files, defaults to “”

  • wrap_script (bool, optional) – whether the script is inline as opposed to a file on disk, defaults to True

Returns:

an sbatch command

Return type:

str

genetools.helpers.merge_into_left(left, right, **kwargs)[source]#

Defensively merge right series or dataframe into left by index, preserving left’s index exactly. right data will be reordered to match left index.

Parameters:
  • left (pandas.DataFrame or pandas.Series) – left data whose index will be preserved

  • right (pandas.DataFrame or pandas.Series) – right data which will be reordered based on left index.

  • **kwargs – passed to pandas.merge

Returns:

left-merged DataFrame with left’s index

Return type:

pandas.DataFrame

genetools.helpers.parallel_groupby_apply(df_grouped: DataFrameGroupBy, func: Callable, **kwargs) Series[source]#

Parallelize apply() on a pandas groupby object.

Each subprocesses is given one group to process. This approach isn’t appropriate if your applied function is very fast but you have many, many groups. In that scenario, the parallelization of groups will simply introduce a lot of unnecessary overhead. Make sure to benchmark with and without parallelization. May want to first split full dataframe into big chunks containing many groups, then run groupby-apply on each chunk in parallel.

Also, transferring big groups to subprocesses can be slow. Again consider chunking the dataframe first.

Func cannot be a lambda, since lambda functions can’t be pickled for subprocesses.

Kwargs are passed to joblib.Parallel(...)

genetools.helpers.rename_duplicates(series, delim='-')[source]#

Rename duplicate values to be unique. ['a', 'a'] will become ['a', 'a-1'], for example.

Parameters:
  • series (pandas.Series) – series with values to rename

  • delim (str, optional) – delimeter before duplicate-number index, defaults to “-”

Returns:

series where original duplicates have been renamed to -1, -2, etc.

Return type:

pandas.Series

genetools.helpers.vertical_concat(df_top, df_bottom, reset_index=False)[source]#

Concatenate df_bottom vertically to df_top, with no checks for whether the columns match, but confirming final shape.

Parameters:
  • df_top (pandas.DataFrame) – Top data

  • df_bottom (pandas.DataFrame) – Bottom data

  • reset_index (bool, optional) – Reset index values after concat, defaults to False

Returns:

Copied dataframe with df_bottom’s rows glued onto the bottom of df_top’s rows

Return type:

pandas.DataFrame

genetools.palette module#

class genetools.palette.HueValueStyle(color: str, marker: str | None = None, marker_size_scale_factor: float = 1.0, legend_size_scale_factor: float = 1.0, facecolors: str | None = None, edgecolors: str | None = None, linewidths: float | None = None, zorder: int = 1, alpha: float | None = None, hatch: str | None = None)[source]#

Bases: object

Describes how to style a particular value (category) of a categorical hue column.

Use palettes mapping hue values to HueValueStyles to make plots with different marker shapes, transparencies, z-orders, etc. for different groups.

The plotting functions accept a hue_key, which identifies a dataframe column that contains hue values. They also accept a palette mapping each hue value to a HueValueStyle that defines not just the color to use for that hue value, but also other styles:

  • Scatterplot marker shape, primary color, face color, edge color, line width, transparency, and line width.

  • Rectangle/barplot color and hatch pattern.

  • Size scale factor for scatterplot markers and legend entries. (The palette of HueValueStyles is defined separately from choosing marker size, and can be plotted at any selected base marker size.)

Here’s an example of assigning a custom HueValueStyle to a hue value in a color palette. This defines a custom unfilled shape, a custom z-order, and more:

palette = {
    "group_A": genetools.palette.HueValueStyle(
        color=sns.color_palette("bright")[0],
        edgecolors=sns.color_palette("bright")[0],
        facecolors="none",
        marker="^",
        marker_size_scale_factor=1.5,
        linewidths=1.5,
        zorder=10,
    ),
    ...
}

For face and edge colors, None is the default value; to disable them, set to string 'none'.

alpha: float = None[source]#
apply_defaults(defaults: HueValueStyle)[source]#

Returns new HueValueStyle that applies defaults: Modifies this style to fill any missing values with the values from another HueValueStyle.

Use case: supply global style defaults for an entire scatterplot, then override with customizations in any individual hue value style.

color: str[source]#
edgecolors: str = None[source]#
facecolors: str = None[source]#
classmethod from_color(s)[source]#

Construct from color string only; keep all other marker parameters set to defaults. If already a HueValueStyle, pass through without modification.

hatch: str = None[source]#
static huestyles_to_colors_dict(d: dict) dict[source]#

Cast any HueValueStyle values in dict to be color strings.

legend_size_scale_factor: float = 1.0[source]#
linewidths: float = None[source]#
marker: str = None[source]#
marker_size_scale_factor: float = 1.0[source]#
render_rectangle_props()[source]#

Returns kwargs to pass to ax.bar() to apply this style.

render_scatter_continuous_props(marker_size=None)[source]#

Returns kwargs to pass to ax.scatter() to apply this style, in the context of continuous cmap scatterplots.

render_scatter_legend_props()[source]#

Returns kwargs to pass to ax.legend() to apply this style.

render_scatter_props(marker_size=None)[source]#

Returns kwargs to pass to ax.scatter() to apply this style.

zorder: int = 1[source]#
genetools.palette.convert_palette_list_to_dict(palette, hue_names, sort_hues=True)[source]#

If palette is a list, convert it to a dict, assigning a color to each value in hue_names (with sort enabled by default).

If palette is already a dict, pass it through with no changes.

genetools.plots module#

genetools.plots.add_sample_size_to_labels(labels: list, data: DataFrame, hue_key: str) list[source]#

Add sample size to tick labels on any plot with categorical groups.

Sample size for each label is extracted from the hue_key column of dataframe data.

Pairs well with genetools.plots.wrap_tick_labels(ax).

Example usage:

ax.set_xticklabels(
    genetools.plots.add_sample_size_to_labels(
        ax.get_xticklabels(),
        df,
        "Group"
    )
)
Parameters:
  • labels (list) – list of tick labels corresponding to groups in data[hue_key]

  • data (pd.DataFrame) – dataset with categorical groups

  • hue_key (str) – column name specifying categorical groups in dataset data

Returns:

modified tick labels with group sample sizes attached

Return type:

list

genetools.plots.add_sample_size_to_legend(ax: Axes, data: DataFrame, hue_key: str) Axes[source]#

Add sample size to legend labels on any plot with categorical hues.

Sample size for each label is extracted from the hue_key column of dataframe data.

Example usage:

fig, ax = genetools.plots.scatterplot(
    data=df,
    x_axis_key="x",
    y_axis_key="y",
    hue_key="Group"
)
genetools.plots.add_sample_size_to_legend(
    ax=ax,
    data=df,
    hue_key="Group"
)
Parameters:
  • ax (matplotlib.axes.Axes) – matplotlib Axes for existing plot

  • data (pd.DataFrame) – dataset with categorical groups

  • hue_key (str) – column name specifying categorical groups in dataset data

Returns:

matplotlib Axes with modified legend labels with group sample sizes attached

Return type:

matplotlib.axes.Axes

genetools.plots.get_point_size(sample_size: int, maximum_size: float = 100) float[source]#

get scatterplot point size based on sample size (from scanpy), but cut off at maximum_size

genetools.plots.plot_color_and_size_dotplot(data: DataFrame, x_axis_key: str, y_axis_key: str, value_key: str, color_cmap: str | Colormap | None = None, color_and_size_vmin: float | None = None, color_and_size_vmax: float | None = None, color_and_size_vcenter: float | None = None, figsize: Tuple[float, float] | None = None, legend_text: str | None = None, extend_legend_to_vmin_vmax: bool = False, representative_values_for_legend: List[float] | None = None, min_marker_size: int = 1, marker_size_scale_factor: int = 100, grid: bool = True) Tuple[Figure, Axes][source]#

Plot dotplot heatmap showing a key as both color and size.

genetools.plots.plot_confusion_matrix(df: DataFrame, ax: Axes | None = None, figsize: Tuple[float, float] | None = None, outside_borders=True, inside_border_width=0.5, wrap_labels_amount: int | None = 15, wrap_x_axis_labels=True, wrap_y_axis_labels=True, draw_colorbar=False, cmap='Blues') Tuple[Figure, Axes][source]#
genetools.plots.plot_triangular_heatmap(df: DataFrame, cmap='Blues', colorbar_label='Value', figsize=(8, 6), vmin=None, vmax=None, annot=True, fmt='.2g') Tuple[Figure, Axes][source]#

Plot lower triangular heatmap.

Often followed with:

genetools.plots.wrap_tick_labels(
    ax, wrap_x_axis=True, wrap_y_axis=True, wrap_amount=10
)
genetools.plots.plot_two_key_color_and_size_dotplot(data: DataFrame, x_axis_key: str, y_axis_key: str, color_key: str, size_key: str, color_cmap: str | Colormap | None = None, color_vmin: float | None = None, color_vmax: float | None = None, color_vcenter: float | None = None, figsize: Tuple[float, float] | None = None, size_vmin: float | None = None, size_vmax: float | None = None, size_vcenter: float | None = None, extend_size_legend_to_vmin_vmax: bool = False, representative_sizes_for_legend: List[float] | None = None, inverse_size: bool = False, color_legend_text: str | None = None, size_legend_text: str | None = None, shared_legend_title: str | None = None, min_marker_size: int = 1, marker_size_scale_factor: int = 100, grid: bool = True) Tuple[Figure, Axes][source]#

Plot dotplot heatmap showing two keys together.

Example with mean and standard deviation: Circle color represents the mean. Circle size represents stability (inverse of standard deviation). Suggestions for this use case:

  • Pass mean key as color_key and standard deviation key as size_key.

  • Set inverse_size=True. Big circles are trustworthy/stable across the average, while little circles aren’t

  • Set color_legend_text=”Mean”, size_legend_text=”Inverse std. dev.”

  • Set min_marker_size=20 so that the smallest circle for zero standard deviation is still visible

  • With a diverging colormap (e.g. color_cmap=’RdBu_r’, color_vcenter=0) bold circles are strong effects, while near-white circles are weak effects

genetools.plots.savefig(fig: Figure, *args, **kwargs)[source]#

Save figure with smart defaults:

  • Tight bounding box – necessary for legends outside of figure

  • Determinsistic PDF output by fixing SOURCE_DATE_EPOCH to Jan 1, 2000

  • Editable text objects when outputing a vector PDF

Example usage: genetools.plots.savefig(fig, "my_plot.png", dpi=300).

Any positional or keyword arguments are passed to matplotlib.pyplot.savefig.

Parameters:

fig (matplotlib.figure.Figure) – Figure to save.

genetools.plots.scatterplot(data: DataFrame, x_axis_key: str, y_axis_key: str, hue_key: str | None = None, continuous_hue=False, continuous_cmap='viridis', discrete_palette: Dict[str, HueValueStyle | str] | List[HueValueStyle | str] | None = None, ax: Axes | None = None, figsize=(8, 8), marker_size=25, alpha: float = 1.0, na_color='lightgray', marker: str = 'o', marker_edge_color: str = 'none', marker_zorder: int = 1, marker_size_scale_factor: float = 1.0, legend_size_scale_factor: float = 1.0, marker_face_color: str | None = None, marker_linewidths: float | None = None, enable_legend=True, legend_hues: List[str] | None = None, legend_title: str | None = None, sort_legend_hues=True, autoscale=True, equal_aspect_ratio=False, plotnonfinite=False, remove_x_ticks=False, remove_y_ticks=False, tight_layout=True, despine=True, **kwargs) Tuple[Figure, Axes][source]#

Scatterplot colored by a discrete or continuous “hue” grouping variable.

For discrete hues, pass continuous_hue = False and a dictionary of colors and/or HueValueStyle objects in discrete_palette.

Figure size will grow beyond the figsize parameter setting, because the legend is pulled out of figure. So you must use fig.savefig('filename', bbox_inches='tight'). This is provided automatically by genetools.plots.savefig(fig, 'filename').

If using with scanpy, to join umap data from adata.obsm with other plot data in adata.obs, try:

data = adata.obs.assign(umap_1=adata.obsm["X_umap"][:, 0], umap_2=adata.obsm["X_umap"][:, 1])

If hue_key = None, then all points will be colored by na_color and styled with parameters alpha, marker, marker_size, zorder, and marker_edge_color. The legend will be disabled.

Parameters:
  • data (pandas.DataFrame) – Input data, e.g. anndata.obs

  • x_axis_key (str) – Column name to plot on X axis

  • y_axis_key (str) – Column name to plot on Y axis

  • hue_key (str, optional) – Column name with hue groups that will be used to color points. defaults to None to color all points consistently.

  • continuous_hue (bool, optional) – Whether the hue column takes continuous or discrete/categorical values, defaults to False.

  • continuous_cmap (str, optional) – Colormap to use for plotting continuous hue grouping variable, defaults to “viridis”

  • discrete_palette (Union[ Dict[str, Union[HueValueStyle, str]], List[Union[HueValueStyle, str]] ], optional) – Palette of colors and/or HueValueStyle objects to use for plotting discrete/categorical hue groups, defaults to None. Supply a matplotlib palette name, list of colors, or dict mapping hue values to colors or to HueValueStyle objects (or a mix of the two).

  • ax (matplotlib.axes.Axes, optional) – Existing matplotlib Axes to plot on, defaults to None

  • figsize (tuple, optional) – Size of figure to generate if no existing ax was provided, defaults to (8, 8)

  • marker_size (int, optional) – Base marker size. Maybe scaled by individual HueValueStyles. Defaults to 25

  • alpha (float, optional) – Default point transparency, unless overriden by a HueValueStyle, defaults to 1.0

  • na_color (str, optional) – Fallback color to use for discrete hue categories that do not have an assigned style in discrete_palette, defaults to “lightgray”

  • marker (str, optional) – Default marker style, unless overriden by a HueValueStyle, defaults to “o”. For plots with many points, try “.” instead.

  • marker_edge_color (str, optional) – Default marker edge color, unless overriden by a HueValueStyle, defaults to “none” (no edge border drawn). Another common choice is “face”, so the edge color matches the face color.

  • marker_zorder (int, optional) – Default marker z-order, unless overriden by a HueValueStyle, defaults to 1

  • marker_size_scale_factor (float, optional) – Default marker size scale factor, unless overriden by a HueValueStyle, defaults to 1.0

  • legend_size_scale_factor (float, optional) – Default legend size scale factor, unless overriden by a HueValueStyle, defaults to 1.0

  • marker_face_color (str, optional) – Default marker face color, unless overriden by a HueValueStyle, defaults to None (uses point color).

  • marker_linewidths (float, optional) – Default marker line widths, unless overriden by a HueValueStyle, defaults to None

  • enable_legend (bool, optional) – Whether legend (or colorbar if continuous_hue) should be drawn. Defaults to True. May want to disable if plotting multiple subplots/panels.

  • legend_hues (list, optional) – Optionally override the list of hue values to include in legend, e.g. to add any hue values missing from the plotted subset of data; defaults to None

  • legend_title (str, optional) – Specify a title for the legend. Defaults to None, in which case the hue_key is used.

  • sort_legend_hues (bool, optional) – Enable sorting of legend hues, defaults to True

  • autoscale (bool, optional) – Enable automatic zoom in, defaults to True

  • equal_aspect_ratio (bool, optional) – Plot with equal aspect ratio, defaults to False

  • plotnonfinite (bool, optional) – For continuous hues, whether to plot points with inf or nan value, defaults to False

  • remove_x_ticks (bool, optional) – Remove X axis tick marks and labels, defaults to False

  • remove_y_ticks (bool, optional) – Remove Y axis tick marks and labels, defaults to False

  • tight_layout (bool, optional) – whether to format the figure with tight_layout, defaults to True

  • despine (bool, optional) – whether to despine (remove the top and right figure borders), defaults to True

Raises:

ValueError – Must specify correct number of colors if supplying a custom palette

Returns:

Matplotlib Figure and Axes

Return type:

Tuple[matplotlib.figure.Figure, matplotlib.axes.Axes]

genetools.plots.stacked_bar_plot(data, index_key, hue_key, value_key: str | None = None, ax: Axes | None = None, figsize=(8, 8), normalize=True, vertical=False, palette: Dict[str, HueValueStyle | str] | List[HueValueStyle | str] | None = None, na_color='lightgray', hue_order=None, axis_label='Frequency', enable_legend=True, legend_title=None) Tuple[Figure, Axes][source]#

Stacked bar chart.

The index_key groups form the bars, and the hue_key groups subdivide the bars. The value_key determines the subdivision sizes, and is computed automatically if not provided.

See https://observablehq.com/@d3/stacked-normalized-horizontal-bar for inspiration and colors.

Figure size will grow beyond the figsize parameter setting, because the legend is pulled out of figure. So you must use fig.savefig('filename', bbox_inches='tight'). This is provided automatically by genetools.plots.savefig(fig, 'filename').

Parameters:
  • data (pandas.DataFrame) – Plot data containing at minimum the columns identified by index_key, hue_key, and optionally value_key.

  • index_key (str) – Column name defining the rows.

  • hue_key (str) – Column name defining the horizontal bar categories.

  • value_key (str, optional.) – Column name defining the bar sizes. If not supplied, this method will calculate group frequencies automatically

  • ax (matplotlib.axes.Axes, optional) – Existing matplotlib Axes to plot on, defaults to None

  • figsize (tuple, optional) – Size of figure to generate if no existing ax was provided, defaults to (8, 8)

  • normalize (bool, optional) – Normalize each row’s frequencies to sum to 1, defaults to True

  • vertical (bool, optional) – Plot stacked bars vertically, defaults to False (horizontal)

  • palette (Union[ Dict[str, Union[HueValueStyle, str]], List[Union[HueValueStyle, str]] ], optional) – Palette of colors and/or HueValueStyle objects to style the bars corresponding to each hue value, defaults to None (in which case default palette used). Supply a matplotlib palette name, list of colors, or dict mapping hue values to colors or to HueValueStyle objects (or a mix of the two).

  • na_color (str, optional) – Fallback color to use for hue values that do not have an assigned style in palette, defaults to “lightgray”

  • hue_order (list, optional) – Optionally specify order of bar subdivisions. This order is applied from the beginning (bottom or left) to the end (top or right) of the bar. Defaults to None

  • axis_label (str, optional) – Label for the axis along which the frequency values are drawn, defaults to “Frequency”

  • enable_legend (bool, optional) – Whether legend should be drawn. Defaults to True. May want to disable if plotting multiple subplots/panels.

  • legend_title (str, optional) – Specify a title for the legend. Defaults to None, in which case the hue_key is used.

Raises:

ValueError – Must specify correct number of colors if supplying a custom palette

Returns:

Matplotlib Figure and Axes

Return type:

(matplotlib.figure.Figure, matplotlib.axes.Axes)

genetools.plots.superimpose_group_labels(ax: Axes, data: DataFrame, x_axis_key: str, y_axis_key: str, label_key: str, label_z_order=100, label_color='k', label_alpha=0.8, label_size=15) Axes[source]#

Add group (cluster) labels to existing plot.

Parameters:
  • ax (matplotlib.axes.Axes) – matplotlib Axes for existing plot

  • data (pd.DataFrame) – [description]

  • x_axis_key (str) – Column name to plot on X axis

  • y_axis_key (str) – Column name to plot on Y axis

  • label_key (str, optional) – Column name specifying categorical group text labels to superimpose on plot, defaults to None

  • label_z_order (int, optional) – Z-index for superimposed group text labels, defaults to 100

  • label_color (str, optional) – Color for superimposed group text labels, defaults to “k”

  • label_alpha (float, optional) – Opacity for superimposed group text labels, defaults to 0.8

  • label_size (int, optional) – Text size of superimposed group labels, defaults to 15

Returns:

matplotlib Axes with superimposed group labels

Return type:

matplotlib.axes.Axes

genetools.plots.two_class_relative_density_plot(data: DataFrame, x_key: str, y_key: str, hue_key: str, positive_class: str, colorbar_label: str | None = None, quantile: float | None = 0.5, figsize=(8, 8), n_bins=50, range=None, continuous_cmap: str = 'RdBu_r', cmap_vcenter: float | None = 0.5, balanced_class_weights=True) Tuple[Figure, Axes, str][source]#

Two-class relative density plot. For alternatives, see contour KDEs in seaborn’s displot function. (For general 2D density plots, see plt.hexbin, sns.jointplot, and plt.hist2d.)

genetools.plots.wrap_tick_labels(ax: Axes, wrap_x_axis=True, wrap_y_axis=True, wrap_amount=20, break_characters=['/']) Axes[source]#

Add text wrapping to tick labels on x and/or y axes on any plot.

May override existing line breaks in tick labels.

Parameters:
  • ax (matplotlib.axes.Axes) – existing plot with tick labels to be wrapped

  • wrap_x_axis (bool, optional) – whether to wrap x-axis tick labels, defaults to True

  • wrap_y_axis (bool, optional) – whether to wrap y-axis tick labels, defaults to True

  • wrap_amount (int, optional) – length of each line of text, defaults to 20

  • break_characters (list, optional) – characters at which to encourage to breaking text into lines, defaults to [‘/’]. set to None or [] to disable.

Returns:

plot with modified tick labels

Return type:

matplotlib.axes.Axes

genetools.scanpy_helpers module#

Scanpy common recipes.

genetools.scanpy_helpers.clr_normalize(adata, axis=0, inplace=True)[source]#

Centered log ratio transformation for Cite-seq data, normalizing:

  • each protein’s count vectors across cells (axis=0, normalizing each column of the cells x proteins matrix, default)

  • or the antibody count vector for each cell (axis=1, normalizing each row of the cells x proteins matrix)

This is a wrapper of genetools.stats.clr_normalize(matrix, axis).

Parameters:
  • adata (anndata.AnnData) – Protein counts anndata

  • axis (int, optional) – normalize each antibody independently (axis=0) or normalize each cell independently (axis=1), defaults to 0

  • inplace (bool, optional) – whether to modify input anndata, defaults to True

Returns:

Transformed anndata

Return type:

anndata.AnnData

genetools.scanpy_helpers.find_all_markers(adata, cluster_key, pval_cutoff=0.05, log2fc_min=0.25, key_added='rank_genes_groups', test='wilcoxon', use_raw=True)[source]#

Find differentially expressed marker genes for each group of cells.

Parameters:
  • adata (anndata.AnnData) – Scanpy/anndata object

  • cluster_key (str) – The adata.obs column name that defines groups for finding distinguishing marker genes.

  • pval_cutoff (float, optional) – Only return markers that have an adjusted p-value below this threshold. Defaults to 0.05. Set to None to disable filtering.

  • log2fc_min (float, optional) – Limit testing to genes which show, on average, at least X-fold difference (log-scale) between the two groups of cells. Defaults to 0.25. Set to None to disable filtering.

  • key_added (str, optional) – The key in adata.uns information is saved to, defaults to “rank_genes_groups”

  • test (str, optional) – Statistical test to use, defaults to “wilcoxon” (Wilcoxon rank-sum test), see scanpy.tl.rank_genes_groups documentation for other options

  • use_raw (bool, optional) – Use raw attribute of adata if present, defaults to True

Returns:

Dataframe with ranked marker genes for each cluster. Important columns: gene, rank, [cluster_key] (same as argument value)

Return type:

pandas.DataFrame

genetools.scanpy_helpers.pca_anndata(adata: AnnData, pca_transformer: PCA | IncrementalPCA | None = None, n_components=None, inplace=True, **kwargs) Tuple[AnnData, PCA | IncrementalPCA][source]#

PCA anndata, like with scanpy.pp.pca. Accepts pre-computed PCA transformer, so you can apply the same PCA to multiple anndatas.

Args:

  • pca_transformer: pre-defined preprocessing transformer to run PCA on adata.X

  • n_components: number of PCA components

  • inplace: whether to modify input adata in place

Returns: adata, pca_transformer

genetools.scanpy_helpers.pca_train_and_test_anndatas(adata_train, adata_test=None, n_components=None, inplace=True, **kwargs)[source]#

PCA train anndata (like with scanpy.pp.pca), then apply same PCA to test anndata – as opposed to PCAing them independently.

If adata_test isn’t supplied, this just scales adata_train independently.

genetools.scanpy_helpers.scale_anndata(adata: AnnData, scale_transformer: StandardScaler | None = None, inplace=False, set_raw=False, **kwargs) Tuple[AnnData, StandardScaler][source]#

Scale anndata, like with scanpy.pp.scale. Accepts pre-computed StandardScaler preprocessing transformer, so you can apply the same scaling to multiple anndatas.

Args:

  • scale_transformer: pre-defined preprocessing transformer to scale adata.X

  • inplace: whether to modify input adata in place

  • set_raw: whether to set adata.raw equal to input adata

Returns: adata, scale_transformer

genetools.scanpy_helpers.scale_train_and_test_anndatas(adata_train, adata_test=None, inplace=False, set_raw=False, **kwargs)[source]#

Scale train anndata (like with scanpy.pp.scale), then apply same scaling to test anndata – as opposed to scaling them independently.

If adata_test isn’t supplied, this just scales adata_train indpendently.

genetools.scanpy_helpers.umap_anndata(adata, umap_transformer=None, n_neighbors: int | None = None, n_components: int | None = None, inplace=True, use_rapids=False, use_pca=False, **kwargs)[source]#

UMAP anndata, like with scanpy.tl.umap. Accepts pre-computed UMAP transformer, so you can apply the same UMAP to multiple anndatas.

Args:

  • umap_transformer: pre-defined preprocessing transformer to run UMAP on adata.X

  • n_components: number of UMAP components

  • inplace: whether to modify input adata in place

Anndata should already be scaled.

Returns: adata, umap_transformer

genetools.scanpy_helpers.umap_train_and_test_anndatas(adata_train, adata_test=None, n_neighbors: int | None = None, n_components: int | None = None, inplace=True, use_rapids=False, use_pca=False, **kwargs)[source]#

UMAP train anndata (like with scanpy.tl.umap), then apply same UMAP to test anndata – as opposed to PCAing them independently.

If adata_test isn’t supplied, this just scales adata_train independently.

genetools.stats module#

class genetools.stats.ConfusionMatrixValues(true_positives: float, false_negatives: float, false_positives: float, true_negatives: float)[source]#

Bases: object

false_negatives: float[source]#
false_positives: float[source]#
true_negatives: float[source]#
true_positives: float[source]#
genetools.stats.accept_series(func)[source]#

Decorator to seamlessly accept pandas Series in place of a numpy array, and returns with original Series index.

genetools.stats.clr_normalize(mat, axis=0)[source]#

Centered log ratio transformation for Cite-seq data, normalizing:

  • each protein’s count vectors across cells (axis=0, normalizing each column of the cells x proteins matrix, default)

  • or the antibody count vector for each cell (axis=1, normalizing each row of the cells x proteins matrix)

To use with anndata: genetools.scanpy_helpers.clr_normalize(adata, axis)

Notes:

This is almost the same as log(x) - 1/D * sum( log(product of x's) ), which is the same as log(x) - log ( [ product of x's] ^ (1/D) ), where D = len(x).

The general definition is:

from scipy.stats.mstats import gmean
return np.log(x) - np.log(gmean(x))

But geometric mean only applies to positive numbers (otherwise the inner product will be 0). So you want to use pseudocounts or drop 0 counts. That’s what Seurat’s modification does.

Unfortunately there is not a single answer. In some cases, cell-based normalization fails. This is because cell-normalization makes an assumption that the total ADT counts should be constant across cells. That can become a significant issue if you have cell populations in your data, but did not add protein markers for them (this is also an issue for scRNA-seq, but is significantly mitigated because at least you measure many genes).

However, gene-based normalization can fail when there is significant heterogeneity in sequencing depth, or cell size. The optimal strategy depends on the AB panel, and heterogeneity of your sample.

In this implementation, protein-wise is axis=0 and cell-wise is axis=1. Seurat’s default is protein-wise, i.e. axis=0.

The default is “protein-wise” (axis=0), i.e. normalize each protein independently.

Parameters:
  • mat (numpy array or scipy sparse matrix) – Counts matrix (cells x proteins)

  • axis (int, optional) – normalize each antibody independently (axis=0) or normalize each cell independently (axis=1), defaults to 0

Returns:

Transformed counts matrix

Return type:

numpy array

genetools.stats.coclustering(cluster_ids_1, cluster_ids_2)[source]#

Compute coclustering percentage between two sets of cluster IDs for the same cells: Of the cell pairs clustered together by either or both methods, what percentage are clustered together by both methods?

(The clusters are allowed to have different names across methods, and don’t necessarily need to be ints.)

Parameters:
  • cluster_ids_1 (numpy array-like) – One set of cluster IDs.

  • cluster_ids_2 (numpy array-like) – Another set of cluster IDs.

Returns:

Percentage of cell pairs clustered together by one or both methods that are also clustered together by the other method.

Return type:

float

genetools.stats.interpolate_prc(y_true, y_score, n_points=1000)[source]#

Interpolate precision-recall curve.

Returns:

  • interpolated recall

  • interpolated precision

  • original recall

  • original precision

To plot:

fig, ax = plt.subplots(figsize=(4, 4))
plt.plot(
    recall,
    precision,
    label="Real",
    color="k",
    alpha=0.5,
    drawstyle="steps-post",
)
plt.plot(
    interpolated_recall,
    interpolated_precision,
    label="Interpolated",
    color="r",
    alpha=0.5,
    drawstyle="steps-post",
)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend(bbox_to_anchor=(1, 1))
genetools.stats.interpolate_roc(y_true, y_score, n_points=1000)[source]#

Interpolate receiver operating characteristic curve.

Returns:

  • interpolated FPR

  • interpolated TPR

  • original FPR

  • original TPR

To plot:

fig, ax = plt.subplots(figsize=(4, 4))
plt.plot(
    fpr,
    tpr,
    label="Real",
    color="k",
    alpha=0.5,
    drawstyle="steps-post",
)
plt.plot(
    interpolated_fpr,
    interpolated_tpr,
    label="Interpolated",
    color="r",
    alpha=0.5,
    drawstyle="steps-post",
)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.legend(bbox_to_anchor=(1, 1))
genetools.stats.intersect_marker_genes(reference_data, query_data, low_confidence_threshold=0.035, low_confidence_suffix='?')[source]#

Map cluster marker genes against reference lists to find top hits.

query_data and reference_data should both be dictionaries where:
  • keys are cluster names or IDs

  • values are lists of genes associated with that cluster

Or if you have a dataframe where each row contains a cluster ID and a gene name, you can convert to dict with df.groupby('cluster')['gene'].apply(list).to_dict()

Usage with an anndata/scanpy object on groups defined by adata.obs['louvain']:

# find marker genes for all clusters
cluster_markers_df = genetools.scanpy_helpers.find_all_markers(adata, cluster_key='louvain')

# convert to dict of clusters -> gene lists mapping
cluster_marker_lists = cluster_markers_df.groupby('louvain')['gene'].apply(list).to_dict()

# intersect with known marker gene lists
results, label_map, low_confidence_percentage = genetools.stats.intersect_marker_genes(reference_marker_lists, cluster_marker_lists)

# rename clusters in your anndata/scanpy object
adata.obs['louvain_annotated'] = adata.obs['louvain'].copy().cat.rename_categories(label_map)
Behavior:
  • Intersection scores are normalized to marker gene list sizes.

  • Resulting duplicate cluster names are renamed, ensuring that N original query clusters will map to N renamed clusters.

Parameters:
  • reference_data (dict) – reference marker gene lists

  • query_data (dict) – query marker gene lists

  • low_confidence_threshold (float, optional) – Minimal difference between top and subsequent hits for a confident call, defaults to 0.035

  • low_confidence_suffix (str, optional) – Suffix for low-confidence cluster renamings, defaults to “?”

Returns:

dataframe with cluster mapping details, a dictionary for renaming query cluster names, and percentage of low-confidence calls.

Return type:

(pandas.DataFrame, dict, float) tuple

genetools.stats.make_confusion_matrix(y_true: ndarray | list, y_pred: ndarray | list, true_label: str, pred_label: str, label_order: List[str] | None = None) DataFrame[source]#

Make a confusion matrix. Pairs with genetools.plots.plot_confusion_matrix.

genetools.stats.normalize_columns(df)[source]#

Make columns sum to 1. If a column is all zeroes, this will return NaNs for the column entries, since there is no way to make the column sum to 1.

Parameters:

df (pandas.DataFrame) – dataframe

Returns:

column-normalized dataframe

Return type:

pandas.DataFrame

genetools.stats.normalize_rows(df)[source]#

Make rows sum to 1. If a row is all zeroes, this will return NaNs for the row entries, since there is no way to make the row sum to 1.

Parameters:

df (pandas.DataFrame) – dataframe

Returns:

row-normalized dataframe

Return type:

pandas.DataFrame

genetools.stats.percentile_normalize(values)[source]#

Percentile normalize.

Parameters:

values (numpy.ndarray or pandas.Series) – values to normalize

Returns:

percentile-normalized values

Return type:

numpy.ndarray or pandas.Series

genetools.stats.rank_normalize(values)[source]#

Rank normalize, starting with rank 1. All ranks must be unique.

Parameters:

values (numpy.ndarray or pandas.Series) – values to normalize

Returns:

rank-normalized values

Return type:

numpy.ndarray or pandas.Series

genetools.stats.run_sigmoid_if_binary_and_softmax_if_multiclass(arr: ndarray) ndarray[source]#

Convert logits to probabilities using sigmoid if binary, or softmax if multiclass.

Binary is standard logistic regression. Input should have shape (n_samples, ) (as computed by decision_function() in sklearn), but the output will follow predict_proba’s conventional output shape of (n_samples, 2).

Multiclass approach is softmax regression as in R glmnet and sklearn. See https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_set_of_independent_binary_regressions

genetools.stats.softmax(arr: ndarray) ndarray[source]#

softmax a 1d vector or softmax all rows of a 2d array. ensures result sums to 1:

  • if input was a 1d vector, returns a 1d vector summing to 1.

  • if input was a 2d array, returns a 2d array where every row sums to 1.

genetools.stats.unpack_confusion_matrix(y_true, y_pred, positive_label, negative_label)[source]#

Module contents#

Top-level package for genetools.