Skip to content

Conversation

@ilia-kats
Copy link
Collaborator

@ilia-kats ilia-kats commented Oct 6, 2025

  • don't raise exception if mods is used together with common or prefixed (push) / common, nonunique,
    or unique (pull): There is nothing in the logic preventing that
  • don't raise if columns is used together with common or prefixed (push) / common, nonunique, or unique, warn instead
  • bugfix for push_attr: correct ordering of pushed column
  • minor code cleanup

@ilia-kats ilia-kats requested a review from gtca October 6, 2025 14:28
@codecov-commenter
Copy link

codecov-commenter commented Oct 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@ddbcfa4). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #105   +/-   ##
=======================================
  Coverage        ?   96.78%           
=======================================
  Files           ?       13           
  Lines           ?      872           
  Branches        ?        0           
=======================================
  Hits            ?      844           
  Misses          ?       28           
  Partials        ?        0           
Files with missing lines Coverage Δ
tests/test_pull_push.py 99.52% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilia-kats ilia-kats force-pushed the push_pull_fixes branch 2 times, most recently from aad3562 to 2cc418f Compare October 7, 2025 11:49
@ilia-kats ilia-kats requested a review from ilan-gold October 9, 2025 16:11
Copy link
Collaborator

@gtca gtca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes!

I am also wondering if this fixes specific use cases that were failing before, can we add them to the our tests?

assert v is None, f"Cannot use {k} with columns."
if v is not None:
warnings.warn(
f"Both columns and {k} given. Columns take precedence, {k} will be ignored",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this improve readability? (I am not sure we have a consistent policy for formatting in such cases.)

Both `columns=...` and `{k}=True` were given. <...>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If yes, this is also true for similar warnings in other parts of the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a bit misleading here, since the warning will also be emitted if k=False or any other value which is not the None default. Perhaps something like

Both `columns=...` and `{k}={locals()[k]}` were given...

? But I'm not sure if that brings the message across that it should just be not passed at all (as in leave the None default).

@ilia-kats
Copy link
Collaborator Author

ilia-kats commented Oct 10, 2025

I am also wondering if this fixes specific use cases that were failing before, can we add them to the our tests?

I did add a test for the one bugfix that this contains (the ordering of pushed columns). The rest of this PR is more quality of life / logic improvements. I can add tests for those, but I'd prefer to do that as part of a larger effort to improve test coverage.

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally just found it a bit hard to reason about what things are done to the columns, so maybe a class would help?

common
If True, pull common columns.
Common columns do not have modality prefixes.
Pull from all modalities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you have a column that is present in all modalities (AnnData objects) and is named, say batch, it will be pulled and its name in the global dataframe will also be batch, so without any modalitiy prefix.

Comment on lines 1905 to 1907
only_drop
If True, drop the columns but do not actually pull them.
Forces drop=True. False by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this argument turns this function into a "drop columns" function instead of a "pull columns" function? Made add some context to why one might want to do this (deleting data from your underlying AnnData stores strikes me as a bad idea, but I'm sure there's a reason to do this in some sort of structure manner)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was originally implemented by @gtca, so he's best qualified to answer, but I think if you have a bunch of AnnDatas coming from somewhere (e.g. constructed from the output of a proprietary pipeline or some publication), there are often metadata that are either useless (the same value for all observations) or that you just don't care about in your analysis, in which case one can drop them to reduce visual clutter (e.g. when printing the dataframe in a notebook or tab-completing column names).

if only_drop:
drop = True

cols = _classify_attr_columns(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cols should be a class, not just a dictionary, with methods to encapsulate the below iterations (and it's sub-dictionary value as well to handle the modcols logic)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I would agree with you, but this used at exactly one place in the entire codebase, so I think a class is a bit overkill at this point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I would consider making it a named tuple for performance reasons.


if mods is not None:
cols = [col for col in cols if col["prefix"] in mods]
cols = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., with cols as a class this could be

prefix_to_cols = cols.filter_by_name_or_derived_name(colums)

(I would also advocate changing the name from cols to prefix_to_cols to avoid confusion with columns)

Comment on lines 99 to 102
{"name": "global", "prefix": "", "derived_name": "global", "class": "common"},
{"name": "mod1:annotation", "prefix": "mod1", "derived_name": "annotation", "class": "prefixed"},
{"name": "mod2:annotation", "prefix": "mod2", "derived_name": "annotation", "class": "prefixed"},
{"name": "mod1:unique", "prefix": "mod1", "derived_name": "annotation", "class": "prefixed"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class seems redundant, more class-behavior. Isn't just a check on name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general yes, but having this as a separate attribute makes the filtering in _push_attr() much easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a big refactor but this gets back to my point about using classes. It just feels like _classify_attr_columns and _classify_prefixed_columns do such similar things and return such weak types that refactoring around using classes would be a good idea.

It's tough to say without doing it myself, but offhand, I don't see why this should return a list and not the same thing as _classify_attr_columns with methods for getting the needed information that might distinguish them. For example, moving prefix here into the key of the dictionary returned in classify_attr_columns seems. Even there, in that return type the name and derived_name are just slightly different than name and prefix here it seems.

I just generally don't like untyped dictionaries and magic strings and especially in this case, it seems like there's a decent amount of overlapping functionality. But if it's too much for this PR, that's fine. I think stronger typing on the return types is a good start.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, there is now a custom class encapsulating one column and its logic. But moving all the filtering logic etc. into another class feels like overengineering to me tbh. If anything, I would get rid of _classify_attr_columns and just move the few lines into _pull_attr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to you. It's not so much about code correctness as it is about readability for the next person. For example:

https://github.com/scverse/mudata/pull/105/files#diff-5fc9f5c31eeb9fbab11538b609e263905fcc3e03d32bc2436469f26ded6514afR2261-R2267

or

https://github.com/scverse/mudata/pull/105/files#diff-5fc9f5c31eeb9fbab11538b609e263905fcc3e03d32bc2436469f26ded6514afR2012-R2026

I find these a little difficult to reason about, especially in the absence of comment strings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more comments throughout.

@ilia-kats ilia-kats force-pushed the push_pull_fixes branch 5 times, most recently from 1072cf6 to 0dc33db Compare October 13, 2025 09:12
- don't raise exception if mods is used together with common or prefixed
  there is nothing in the logic preventing it
- don't raise if columns is used together with common or prefixed, warn
  instead
- minor code cleanup
- don't raise exception if mods is used together with common, nonunique,
  or unique: there is nothing in the logic preventing it
- don't raise if columns is used together with common, nonunique, or
  unique, warn instead
- fix ordering of pushed column
- minor code cleanup
Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely like the MetadataColumn class!

Comment on lines 99 to 102
{"name": "global", "prefix": "", "derived_name": "global", "class": "common"},
{"name": "mod1:annotation", "prefix": "mod1", "derived_name": "annotation", "class": "prefixed"},
{"name": "mod2:annotation", "prefix": "mod2", "derived_name": "annotation", "class": "prefixed"},
{"name": "mod1:unique", "prefix": "mod1", "derived_name": "annotation", "class": "prefixed"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to you. It's not so much about code correctness as it is about readability for the next person. For example:

https://github.com/scverse/mudata/pull/105/files#diff-5fc9f5c31eeb9fbab11538b609e263905fcc3e03d32bc2436469f26ded6514afR2261-R2267

or

https://github.com/scverse/mudata/pull/105/files#diff-5fc9f5c31eeb9fbab11538b609e263905fcc3e03d32bc2436469f26ded6514afR2012-R2026

I find these a little difficult to reason about, especially in the absence of comment strings.

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there explicit tests for the fixes? For example:

don't raise exception if mods is used together with common or prefixed (push) / common, nonunique, or unique (pull): There is nothing in the logic preventing that

deosn't seem to have a test. I searched through the file and mods is None by default and the other two instances of it in test_push_pull are only with columns=...

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
@ilia-kats
Copy link
Collaborator Author

Are there explicit tests for the fixes? For example:

There is a test for push_attr: correct ordering of pushed column, which is the only actual fix in here. The rest is more about behavior changes / being more permissive in terms of what it accepts. I can add some tests for that now, but I'd prefer to do that as part of a larger effort to get our test coverage up (which is on my todo list, but depends on #104 and #106 being merged).

Comment on lines +174 to +175
== mdata.mod[m].obs["common_obs_col"].to_numpy()[modmap[mask] - 1]
).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, the reason is that 0 encodes missing (why not NA?) so the obsmap indexing starts at 1 and then you need to shift by 1. Is this right?

Unrelated to the PR, why are these not nullables?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. When we first implemented this, AnnData's IO did not support nullables yet, so we went with this.

@ilia-kats ilia-kats merged commit 0681f45 into scverse:main Oct 24, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants