-
Notifications
You must be signed in to change notification settings - Fork 21
fixes for push_attr/pull_attr #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
143a0a1 to
2e797b1
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #105 +/- ##
=======================================
Coverage ? 96.78%
=======================================
Files ? 13
Lines ? 872
Branches ? 0
=======================================
Hits ? 844
Misses ? 28
Partials ? 0
🚀 New features to boost your workflow:
|
aad3562 to
2cc418f
Compare
gtca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes!
I am also wondering if this fixes specific use cases that were failing before, can we add them to the our tests?
| assert v is None, f"Cannot use {k} with columns." | ||
| if v is not None: | ||
| warnings.warn( | ||
| f"Both columns and {k} given. Columns take precedence, {k} will be ignored", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like this improve readability? (I am not sure we have a consistent policy for formatting in such cases.)
Both `columns=...` and `{k}=True` were given. <...>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If yes, this is also true for similar warnings in other parts of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a bit misleading here, since the warning will also be emitted if k=False or any other value which is not the None default. Perhaps something like
Both `columns=...` and `{k}={locals()[k]}` were given...
? But I'm not sure if that brings the message across that it should just be not passed at all (as in leave the None default).
2cc418f to
ca406d7
Compare
I did add a test for the one bugfix that this contains (the ordering of pushed columns). The rest of this PR is more quality of life / logic improvements. I can add tests for those, but I'd prefer to do that as part of a larger effort to improve test coverage. |
ilan-gold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally just found it a bit hard to reason about what things are done to the columns, so maybe a class would help?
| common | ||
| If True, pull common columns. | ||
| Common columns do not have modality prefixes. | ||
| Pull from all modalities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you have a column that is present in all modalities (AnnData objects) and is named, say batch, it will be pulled and its name in the global dataframe will also be batch, so without any modalitiy prefix.
| only_drop | ||
| If True, drop the columns but do not actually pull them. | ||
| Forces drop=True. False by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this argument turns this function into a "drop columns" function instead of a "pull columns" function? Made add some context to why one might want to do this (deleting data from your underlying AnnData stores strikes me as a bad idea, but I'm sure there's a reason to do this in some sort of structure manner)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was originally implemented by @gtca, so he's best qualified to answer, but I think if you have a bunch of AnnDatas coming from somewhere (e.g. constructed from the output of a proprietary pipeline or some publication), there are often metadata that are either useless (the same value for all observations) or that you just don't care about in your analysis, in which case one can drop them to reduce visual clutter (e.g. when printing the dataframe in a notebook or tab-completing column names).
src/mudata/_core/mudata.py
Outdated
| if only_drop: | ||
| drop = True | ||
|
|
||
| cols = _classify_attr_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cols should be a class, not just a dictionary, with methods to encapsulate the below iterations (and it's sub-dictionary value as well to handle the modcols logic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I would agree with you, but this used at exactly one place in the entire codebase, so I think a class is a bit overkill at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I would consider making it a named tuple for performance reasons.
|
|
||
| if mods is not None: | ||
| cols = [col for col in cols if col["prefix"] in mods] | ||
| cols = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e., with cols as a class this could be
prefix_to_cols = cols.filter_by_name_or_derived_name(colums)(I would also advocate changing the name from cols to prefix_to_cols to avoid confusion with columns)
src/mudata/_core/utils.py
Outdated
| {"name": "global", "prefix": "", "derived_name": "global", "class": "common"}, | ||
| {"name": "mod1:annotation", "prefix": "mod1", "derived_name": "annotation", "class": "prefixed"}, | ||
| {"name": "mod2:annotation", "prefix": "mod2", "derived_name": "annotation", "class": "prefixed"}, | ||
| {"name": "mod1:unique", "prefix": "mod1", "derived_name": "annotation", "class": "prefixed"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class seems redundant, more class-behavior. Isn't just a check on name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general yes, but having this as a separate attribute makes the filtering in _push_attr() much easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a big refactor but this gets back to my point about using classes. It just feels like _classify_attr_columns and _classify_prefixed_columns do such similar things and return such weak types that refactoring around using classes would be a good idea.
It's tough to say without doing it myself, but offhand, I don't see why this should return a list and not the same thing as _classify_attr_columns with methods for getting the needed information that might distinguish them. For example, moving prefix here into the key of the dictionary returned in classify_attr_columns seems. Even there, in that return type the name and derived_name are just slightly different than name and prefix here it seems.
I just generally don't like untyped dictionaries and magic strings and especially in this case, it seems like there's a decent amount of overlapping functionality. But if it's too much for this PR, that's fine. I think stronger typing on the return types is a good start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, there is now a custom class encapsulating one column and its logic. But moving all the filtering logic etc. into another class feels like overengineering to me tbh. If anything, I would get rid of _classify_attr_columns and just move the few lines into _pull_attr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's up to you. It's not so much about code correctness as it is about readability for the next person. For example:
or
I find these a little difficult to reason about, especially in the absence of comment strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some more comments throughout.
1072cf6 to
0dc33db
Compare
- don't raise exception if mods is used together with common or prefixed there is nothing in the logic preventing it - don't raise if columns is used together with common or prefixed, warn instead - minor code cleanup
- don't raise exception if mods is used together with common, nonunique, or unique: there is nothing in the logic preventing it - don't raise if columns is used together with common, nonunique, or unique, warn instead - fix ordering of pushed column - minor code cleanup
0dc33db to
2b26ebf
Compare
ilan-gold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely like the MetadataColumn class!
src/mudata/_core/utils.py
Outdated
| {"name": "global", "prefix": "", "derived_name": "global", "class": "common"}, | ||
| {"name": "mod1:annotation", "prefix": "mod1", "derived_name": "annotation", "class": "prefixed"}, | ||
| {"name": "mod2:annotation", "prefix": "mod2", "derived_name": "annotation", "class": "prefixed"}, | ||
| {"name": "mod1:unique", "prefix": "mod1", "derived_name": "annotation", "class": "prefixed"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's up to you. It's not so much about code correctness as it is about readability for the next person. For example:
or
I find these a little difficult to reason about, especially in the absence of comment strings.
_pull_attr/_push_attr
4e2177d to
21b2e83
Compare
ilan-gold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there explicit tests for the fixes? For example:
don't raise exception if mods is used together with common or prefixed (push) / common, nonunique, or unique (pull): There is nothing in the logic preventing that
deosn't seem to have a test. I searched through the file and mods is None by default and the other two instances of it in test_push_pull are only with columns=...
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
There is a test for |
| == mdata.mod[m].obs["common_obs_col"].to_numpy()[modmap[mask] - 1] | ||
| ).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, the reason is that 0 encodes missing (why not NA?) so the obsmap indexing starts at 1 and then you need to shift by 1. Is this right?
Unrelated to the PR, why are these not nullables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. When we first implemented this, AnnData's IO did not support nullables yet, so we went with this.
or unique (pull): There is nothing in the logic preventing that