My favorite function here is more_itertools.one. Especially in something like a unit test, where ValueErrors from unexpected conditions are desirable, we can use it to turn code like
results = list(get_some_stuff(...))
assert len(results) = 1
result = results[0]
into
result = one(get_some_stuff(...))
I guess you could also use tuple-unpacking:
result, = get_some_stuff(...)
But the syntax is awkward to unpack a single item. Doesn't that trailing comma just look implausible? (Also I've worked with type-checkers that will complain when a tuple-unpacking could potentially fail, while one has a clear type signatures Iterable[T] -> T.)
That’s not the same though. Your unpacking allows for any non-empty iterable while OPs only allows for an iterable with exactly one item or else it throws an exception.
The proposal seemed very close to getting shipped alongside https://github.com/tc39/proposal-iterator-helpers while basically accepting many of the constraints of current async iteration (one at a time consumption). But the folks really accepted that concurrency needs had evolved, decided to hold back & keep iterating & churning for better.
I feel like a lot of the easy visible mood on the web (against the web) is that there's too much, that stuff is just piled in. But I see a lot of caring & deliberation & trying to get shit right & good. Sometimes that too can be maddening, but ultimately with the web there aren't really re-do-es & the deliberation is good.
Disclaimer: this code was written several years ago with few downstream users, not all of these are super high performing, and they have not been super extensively tested.
Your nice work on the JS itertools port has a todo for a "better tee". This was my fault because the old "rough equivalent" code in the Python docs was too obscure and didn't provide a good emulation.
Here is an update that should be much easier to convert to JS:
def tee(iterable, n=2):
iterator = iter(iterable)
shared_link = [None, None]
return tuple(_tee(iterator, shared_link) for _ in range(n))
def _tee(iterator, link):
try:
while True:
if link[1] is None:
link[0] = next(iterator)
link[1] = [None, None]
value, link = link
yield value
except StopIteration:
return
> But the folks really accepted that concurrency needs had evolved, decided to hold back & keep iterating & churning for better
I'm not sure if it was this proposal or another one in a similar space, but I've recently heard about several async improvements that were woefully under-spec'd, and would likely have caused much more harm than good due to all the edge cases that were missed.
This library is my python productivity secret weapon. So many things I've needed to impliment in the past is now just chaining functions in itertools, functions, and this
Nice! These can make code a ton simpler. Also no python dependencies, which is a requirement for me adopting. Would love to see this brought into the standard lib at some point.
It’s possible but tends not to be common for a multitude of reasons. The biggest issue is library updates become synced to version patch updates, which doesn’t provide a lot of flexibility. A package would have to be exceptionally stable to be a reasonable candidate.
That’s not actually true. While dataclasses to most of its inspiration from attrs, there are many features of attrs that were deliberately not implemented in dataclasses, just so it could “fit” in the stdlib.
Or maybe you mean the backport of dataclasses to 3.6 that is available on PyPI? That actually came after dataclasses was added to 3.7.
Yes, I think it might have been a slight design mistake to make the variadic version the default. I've only very rarely used it, whereas I use chain.from_iterable a lot.
Maybe in some cases, but the performance characteristics are way different. The functions in `more_itertools` return lazy generators, but it looks like `np.flatten` materializes the results in an ndarray.
np is the standard alias for numpy, probably the most popular numerical and array processing library for python. So, no, not part of the standard lib at all. But a universal import for most users of the language in any science/stats/ml environment. That said, still a surprising place from which to import a basic stream processing function.
I was frustrated by the itertools design, because the chain of operations are going from the inside out. Iterative design in Scala is much friendly to me
Usally, I'd cast my arrays into a pandas DF and then use the equivalent dataframe operations. To me, pandas and numpy might as well be part of the python stdlib.
How should I reason about the tradeoff of using something like this vs pandas/numpy ? Esp. with Numpy 2.0 supporting the string dtype.
I promise I mean no offense by this but this is so comically absurd. Like you know it's not a cast right? Ie that you're constructing pandas dataframes.
> How should I reason about the tradeoff of using something like this vs pandas/numpy ?
For small sizes, operations on native types will be faster than the construction of complex objects.
Also, my grief with DF is they aren't typed (typing module) by column. Maybe that's changed though? It's been a while.
The only way to understand what's going on with DF code is to step it in a debugger. I know they can be much faster, but man you pay a maintainability price!
This is incorrect: each column in a pandas DFs can have a separate type (what you're asking for is compatibility with Python's type-hinting on a per-column basis, though, which is different), and you can debug the code without needing a debugger: I use pandas regularly and I've never needed to use a debugger on pandas.
(Sure, it's easy to write obfuscated pandas, and it sometimes has version-specific bugs or deprecations which need to be hacked around in a way that compromises readability, and sometimes the API has active changes/namings that are non-trivial. But that's miles from "only way to understand is with a debugger".
If you want to claim otherwise, post a counterexample on SO (or Codidact) and post the link here.)
Yeah, that's what I meant. I would like per column type-hinting so that data frames are type-checked along with the rest of our stuff and everything is explicit.
I don't have anything I can show because the stuff I was working on was commercial and I don't code Pandas for fun at home ;)
The code I was maintaining / updating had long pipelines, had lots of folding, and would drift in and out of numpy quite a bit.
Part of the issue was my unfamiliarity with Pandas, for sure. But if I just picked a random function in the code, I would have no idea as to the shape of the data flowing in and out, without reading up and down the callstack to see what columns are in play.
For type-hinting on dataframe-like objects, people recommend pandera [0].
> The code I was maintaining / updating had long pipelines, had lots of folding, and would drift in and out of numpy quite a bit.
(Protein folding?)
Anyway yeah if your codebase is a large proprietary pipeline that thunks to and from pandas-numpy then now I understand you. But that's your very specific usecase. The claim "The only way to understand what's going on with DF code is to step it in a debugger" is in general overkill.
I happen to know a book or two that might help with Pandas.
(Disclaimer: I wrote three of them and spend a good deal of my time helping others level up their Pandas. Spent this morning helping a medical AI company with Pandas.)
My tasks aren't usually bottlenecked by the df creation operation. To me, the convenience offered by dfs outstrips the compute hit. However, if this is an order of magnitude difference , then it would push me to adopt the more-itertools formulation.
> However, if this is an order of magnitude difference , then it would push me to adopt the more-itertools formulation.
My friend it's much worse than a single order magnitude for small inputs
import time
import pandas as pd
ls = list(range(10))
b = time.monotonic_ns()
odds = [v for v in ls if v % 2]
e = time.monotonic_ns() - b
print(f"{e=}")
bb = time.monotonic_ns()
df = pd.DataFrame(ls)
odds = df[df % 2 == 1]
ee = time.monotonic_ns() - bb
print(f"{ee=}")
print("ratio", ee/e)
>>> e=1166
>>> ee=656792
>>> ratio 563.2864493996569
I would like to see some kind of query AST for this stuff in a query engine for semantics that its ops can be fused together for efficiency. For example, like a Clojure transducer.