More Itertools

jszymborski · on May 27, 2024

I've implemented the "chunked" iterator a million times. Glad to see I can just import this next time.

fastily · on May 27, 2024

Since python 3.12, builtin itertools now includes a batched method https://docs.python.org/3/library/itertools.html#itertools.b...

jszymborski · on May 27, 2024

Even better! Thanks :)

tempcommenttt · on May 28, 2024

If you like this sort of things, why not check out “boltons” - things that should be built-in in Python?

https://pypi.org/project/boltons/

benkuykendall · on May 28, 2024

My favorite function here is more_itertools.one. Especially in something like a unit test, where ValueErrors from unexpected conditions are desirable, we can use it to turn code like

  results = list(get_some_stuff(...))
  assert len(results) = 1
  result = results[0]

into

  result = one(get_some_stuff(...))

I guess you could also use tuple-unpacking:

  result, = get_some_stuff(...)

But the syntax is awkward to unpack a single item. Doesn't that trailing comma just look implausible? (Also I've worked with type-checkers that will complain when a tuple-unpacking could potentially fail, while one has a clear type signatures Iterable[T] -> T.)

rjmill · on May 28, 2024

You can also do

  [result] = get_some_stuff(...)

adammarples · on May 28, 2024

Do tuple unpacking like this

result, _* = iterable()

plyp · on May 28, 2024

That’s not the same though. Your unpacking allows for any non-empty iterable while OPs only allows for an iterable with exactly one item or else it throws an exception.

jauntywundrkind · on May 27, 2024

Shout out to JavaScript massively delaying https://github.com/tc39/proposal-async-iterator-helpers in the 23rd hour.

The proposal seemed very close to getting shipped alongside https://github.com/tc39/proposal-iterator-helpers while basically accepting many of the constraints of current async iteration (one at a time consumption). But the folks really accepted that concurrency needs had evolved, decided to hold back & keep iterating & churning for better.

I feel like a lot of the easy visible mood on the web (against the web) is that there's too much, that stuff is just piled in. But I see a lot of caring & deliberation & trying to get shit right & good. Sometimes that too can be maddening, but ultimately with the web there aren't really re-do-es & the deliberation is good.

jacobolus · on May 27, 2024

You can implement quite a lot of Python's itertools in Javascript without too much trouble. For instance, https://observablehq.com/@jrus/itertools

Disclaimer: this code was written several years ago with few downstream users, not all of these are super high performing, and they have not been super extensively tested.

raymondh · on May 28, 2024

Your nice work on the JS itertools port has a todo for a "better tee". This was my fault because the old "rough equivalent" code in the Python docs was too obscure and didn't provide a good emulation.

Here is an update that should be much easier to convert to JS:

        def tee(iterable, n=2):
            iterator = iter(iterable)
            shared_link = [None, None]
            return tuple(_tee(iterator, shared_link) for _ in range(n))

        def _tee(iterator, link):
            try:
                while True:
                    if link[1] is None:
                        link[0] = next(iterator)
                        link[1] = [None, None]
                    value, link = link
                    yield value
            except StopIteration:
                return

jacobolus · on May 28, 2024

Thanks! And thanks, Raymond, for all your hard work over the years!

danpalmer · on May 28, 2024

> But the folks really accepted that concurrency needs had evolved, decided to hold back & keep iterating & churning for better

I'm not sure if it was this proposal or another one in a similar space, but I've recently heard about several async improvements that were woefully under-spec'd, and would likely have caused much more harm than good due to all the edge cases that were missed.

PLenz · on May 27, 2024

This library is my python productivity secret weapon. So many things I've needed to impliment in the past is now just chaining functions in itertools, functions, and this

elijahbenizzy · on May 27, 2024

Nice! These can make code a ton simpler. Also no python dependencies, which is a requirement for me adopting. Would love to see this brought into the standard lib at some point.

slig · on May 27, 2024

What's the process for adding these to the Python's stdlib? Is it even possible to adopt a whole library such as this one?

loloquwowndueo · on May 27, 2024

Yes. Unittest.mock used to be a third-party library.

For an idea of the process followed, look up PEP417 (Python Enhancement Proposal.

slig · on May 29, 2024

Thank you!

appplication · on May 28, 2024

It’s possible but tends not to be common for a multitude of reasons. The biggest issue is library updates become synced to version patch updates, which doesn’t provide a lot of flexibility. A package would have to be exceptionally stable to be a reasonable candidate.

cosmic_quanta · on May 27, 2024

It must be possible, because the 'dataclasses' library used to be third-party.

ericvsmith · on May 27, 2024

That’s not actually true. While dataclasses to most of its inspiration from attrs, there are many features of attrs that were deliberately not implemented in dataclasses, just so it could “fit” in the stdlib.

Or maybe you mean the backport of dataclasses to 3.6 that is available on PyPI? That actually came after dataclasses was added to 3.7.

Source: I wrote dataclasses.

EdwardDiego · on May 28, 2024

> I wrote dataclasses.

Much appreciated!

cosmic_quanta · on May 28, 2024

Thank you for correcting me! I must be thinking of another library

jdeaton · on May 27, 2024

it has always annoyed me that flatten isn't already part of itertools

jdeaton · on May 27, 2024

ok itertools has chain.from_iterable but that name is hard to remember

Myrmornis · on May 27, 2024

Yes, I think it might have been a slight design mistake to make the variadic version the default. I've only very rarely used it, whereas I use chain.from_iterable a lot.

isoprophlex · on May 27, 2024

Amen. For a language that gloats on about "flat is better than nested" you have to jump to too many hoops to get your stuff flattened.

vismit2000 · on May 28, 2024

It's there in form of chain:

from itertools import chain

flatten = chain.from_iterable

Ref: pytudes - https://github.com/norvig/pytudes/blob/main/ipynb/Advent-202...

jonathan_landy · on May 28, 2024

Is np.flatten not a workable option in some cases?

benkuykendall · on May 28, 2024

Maybe in some cases, but the performance characteristics are way different. The functions in `more_itertools` return lazy generators, but it looks like `np.flatten` materializes the results in an ndarray.

rnewme · on May 28, 2024

Is np part of the itertools?

benkuykendall · on May 29, 2024

np is the standard alias for numpy, probably the most popular numerical and array processing library for python. So, no, not part of the standard lib at all. But a universal import for most users of the language in any science/stats/ml environment. That said, still a surprising place from which to import a basic stream processing function.

zhukovgreen · on May 28, 2024

I was frustrated by the itertools design, because the chain of operations are going from the inside out. Iterative design in Scala is much friendly to me

https://pybites.circle.so/c/python-discussion/functional-com...

hiAndrewQuinn · on May 27, 2024

itertools is a gem and has been since the 2.7 days. Glad to see people waking up to its powerful abstractions.

wenc · on May 27, 2024

itertools (iterators) and collections (data structures) are both underrated modules in stdlib.

drexlspivey · on May 27, 2024

And are both written by Raymond Hettinger

zokier · on May 27, 2024

itertools and more-itertools are two different libs

screye · on May 27, 2024

This looks great.

Usally, I'd cast my arrays into a pandas DF and then use the equivalent dataframe operations. To me, pandas and numpy might as well be part of the python stdlib.

How should I reason about the tradeoff of using something like this vs pandas/numpy ? Esp. with Numpy 2.0 supporting the string dtype.

almostgotcaught · on May 27, 2024

> Usally, I'd cast my arrays into a pandas DF

I promise I mean no offense by this but this is so comically absurd. Like you know it's not a cast right? Ie that you're constructing pandas dataframes.

> How should I reason about the tradeoff of using something like this vs pandas/numpy ?

For small sizes, operations on native types will be faster than the construction of complex objects.

mabster · on May 27, 2024

Also, my grief with DF is they aren't typed (typing module) by column. Maybe that's changed though? It's been a while.

The only way to understand what's going on with DF code is to step it in a debugger. I know they can be much faster, but man you pay a maintainability price!

smcin · on May 28, 2024

This is incorrect: each column in a pandas DFs can have a separate type (what you're asking for is compatibility with Python's type-hinting on a per-column basis, though, which is different), and you can debug the code without needing a debugger: I use pandas regularly and I've never needed to use a debugger on pandas.

(Sure, it's easy to write obfuscated pandas, and it sometimes has version-specific bugs or deprecations which need to be hacked around in a way that compromises readability, and sometimes the API has active changes/namings that are non-trivial. But that's miles from "only way to understand is with a debugger". If you want to claim otherwise, post a counterexample on SO (or Codidact) and post the link here.)

mabster · on May 28, 2024

Yeah, that's what I meant. I would like per column type-hinting so that data frames are type-checked along with the rest of our stuff and everything is explicit.

I don't have anything I can show because the stuff I was working on was commercial and I don't code Pandas for fun at home ;)

The code I was maintaining / updating had long pipelines, had lots of folding, and would drift in and out of numpy quite a bit.

Part of the issue was my unfamiliarity with Pandas, for sure. But if I just picked a random function in the code, I would have no idea as to the shape of the data flowing in and out, without reading up and down the callstack to see what columns are in play.

Breakpoint and then look at the data, every time!

smcin · on May 31, 2024

For type-hinting on dataframe-like objects, people recommend pandera [0].

> The code I was maintaining / updating had long pipelines, had lots of folding, and would drift in and out of numpy quite a bit.

(Protein folding?)

Anyway yeah if your codebase is a large proprietary pipeline that thunks to and from pandas-numpy then now I understand you. But that's your very specific usecase. The claim "The only way to understand what's going on with DF code is to step it in a debugger" is in general overkill.

[0]: https://pandera.readthedocs.io/en/stable/

staticautomatic · on May 28, 2024

They effectively are since each column is a series, which is typed.

__mharrison__ · on May 28, 2024

I happen to know a book or two that might help with Pandas.

(Disclaimer: I wrote three of them and spend a good deal of my time helping others level up their Pandas. Spent this morning helping a medical AI company with Pandas.)

screye · on May 28, 2024

No offense taken.

My tasks aren't usually bottlenecked by the df creation operation. To me, the convenience offered by dfs outstrips the compute hit. However, if this is an order of magnitude difference , then it would push me to adopt the more-itertools formulation.

almostgotcaught · on May 28, 2024

> However, if this is an order of magnitude difference , then it would push me to adopt the more-itertools formulation.

My friend it's much worse than a single order magnitude for small inputs

    import time
    import pandas as pd

    ls = list(range(10))

    b = time.monotonic_ns()
    odds = [v for v in ls if v % 2]
    e = time.monotonic_ns() - b
    print(f"{e=}")

    bb = time.monotonic_ns()
    df = pd.DataFrame(ls)
    odds = df[df % 2 == 1]
    ee = time.monotonic_ns() - bb
    print(f"{ee=}")
    print("ratio", ee/e)

    >>> e=1166
    >>> ee=656792
    >>> ratio 563.2864493996569

foretop_yardarm · on May 28, 2024

my experience is also that numpy and pandas can add 1-2 seconds to python startup time (which is terrible for the testing experience).

samsquire · on May 27, 2024

This is really helpful. Thank you.

I would like to see some kind of query AST for this stuff in a query engine for semantics that its ops can be fused together for efficiency. For example, like a Clojure transducer.