Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
More Itertools (more-itertools.readthedocs.io)
206 points by stereoabuse on May 27, 2024 | hide | past | favorite | 49 comments


I've implemented the "chunked" iterator a million times. Glad to see I can just import this next time.


Since python 3.12, builtin itertools now includes a batched method https://docs.python.org/3/library/itertools.html#itertools.b...


Even better! Thanks :)


If you like this sort of things, why not check out “boltons” - things that should be built-in in Python?

https://pypi.org/project/boltons/


My favorite function here is more_itertools.one. Especially in something like a unit test, where ValueErrors from unexpected conditions are desirable, we can use it to turn code like

  results = list(get_some_stuff(...))
  assert len(results) = 1
  result = results[0]
into

  result = one(get_some_stuff(...))
I guess you could also use tuple-unpacking:

  result, = get_some_stuff(...)
But the syntax is awkward to unpack a single item. Doesn't that trailing comma just look implausible? (Also I've worked with type-checkers that will complain when a tuple-unpacking could potentially fail, while one has a clear type signatures Iterable[T] -> T.)


You can also do

  [result] = get_some_stuff(...)


Do tuple unpacking like this

result, _* = iterable()


That’s not the same though. Your unpacking allows for any non-empty iterable while OPs only allows for an iterable with exactly one item or else it throws an exception.


Shout out to JavaScript massively delaying https://github.com/tc39/proposal-async-iterator-helpers in the 23rd hour.

The proposal seemed very close to getting shipped alongside https://github.com/tc39/proposal-iterator-helpers while basically accepting many of the constraints of current async iteration (one at a time consumption). But the folks really accepted that concurrency needs had evolved, decided to hold back & keep iterating & churning for better.

I feel like a lot of the easy visible mood on the web (against the web) is that there's too much, that stuff is just piled in. But I see a lot of caring & deliberation & trying to get shit right & good. Sometimes that too can be maddening, but ultimately with the web there aren't really re-do-es & the deliberation is good.


You can implement quite a lot of Python's itertools in Javascript without too much trouble. For instance, https://observablehq.com/@jrus/itertools

Disclaimer: this code was written several years ago with few downstream users, not all of these are super high performing, and they have not been super extensively tested.


Your nice work on the JS itertools port has a todo for a "better tee". This was my fault because the old "rough equivalent" code in the Python docs was too obscure and didn't provide a good emulation.

Here is an update that should be much easier to convert to JS:

        def tee(iterable, n=2):
            iterator = iter(iterable)
            shared_link = [None, None]
            return tuple(_tee(iterator, shared_link) for _ in range(n))

        def _tee(iterator, link):
            try:
                while True:
                    if link[1] is None:
                        link[0] = next(iterator)
                        link[1] = [None, None]
                    value, link = link
                    yield value
            except StopIteration:
                return


Thanks! And thanks, Raymond, for all your hard work over the years!


> But the folks really accepted that concurrency needs had evolved, decided to hold back & keep iterating & churning for better

I'm not sure if it was this proposal or another one in a similar space, but I've recently heard about several async improvements that were woefully under-spec'd, and would likely have caused much more harm than good due to all the edge cases that were missed.


This library is my python productivity secret weapon. So many things I've needed to impliment in the past is now just chaining functions in itertools, functions, and this


Nice! These can make code a ton simpler. Also no python dependencies, which is a requirement for me adopting. Would love to see this brought into the standard lib at some point.


What's the process for adding these to the Python's stdlib? Is it even possible to adopt a whole library such as this one?


Yes. Unittest.mock used to be a third-party library.

For an idea of the process followed, look up PEP417 (Python Enhancement Proposal.


Thank you!


It’s possible but tends not to be common for a multitude of reasons. The biggest issue is library updates become synced to version patch updates, which doesn’t provide a lot of flexibility. A package would have to be exceptionally stable to be a reasonable candidate.


It must be possible, because the 'dataclasses' library used to be third-party.


That’s not actually true. While dataclasses to most of its inspiration from attrs, there are many features of attrs that were deliberately not implemented in dataclasses, just so it could “fit” in the stdlib.

Or maybe you mean the backport of dataclasses to 3.6 that is available on PyPI? That actually came after dataclasses was added to 3.7.

Source: I wrote dataclasses.


> I wrote dataclasses.

Much appreciated!


Thank you for correcting me! I must be thinking of another library


it has always annoyed me that flatten isn't already part of itertools


ok itertools has chain.from_iterable but that name is hard to remember


Yes, I think it might have been a slight design mistake to make the variadic version the default. I've only very rarely used it, whereas I use chain.from_iterable a lot.


Amen. For a language that gloats on about "flat is better than nested" you have to jump to too many hoops to get your stuff flattened.


It's there in form of chain:

from itertools import chain

flatten = chain.from_iterable

Ref: pytudes - https://github.com/norvig/pytudes/blob/main/ipynb/Advent-202...


Is np.flatten not a workable option in some cases?


Maybe in some cases, but the performance characteristics are way different. The functions in `more_itertools` return lazy generators, but it looks like `np.flatten` materializes the results in an ndarray.


Is np part of the itertools?


np is the standard alias for numpy, probably the most popular numerical and array processing library for python. So, no, not part of the standard lib at all. But a universal import for most users of the language in any science/stats/ml environment. That said, still a surprising place from which to import a basic stream processing function.


I was frustrated by the itertools design, because the chain of operations are going from the inside out. Iterative design in Scala is much friendly to me

https://pybites.circle.so/c/python-discussion/functional-com...


itertools is a gem and has been since the 2.7 days. Glad to see people waking up to its powerful abstractions.


itertools (iterators) and collections (data structures) are both underrated modules in stdlib.


And are both written by Raymond Hettinger


itertools and more-itertools are two different libs


This looks great.

Usally, I'd cast my arrays into a pandas DF and then use the equivalent dataframe operations. To me, pandas and numpy might as well be part of the python stdlib.

How should I reason about the tradeoff of using something like this vs pandas/numpy ? Esp. with Numpy 2.0 supporting the string dtype.


> Usally, I'd cast my arrays into a pandas DF

I promise I mean no offense by this but this is so comically absurd. Like you know it's not a cast right? Ie that you're constructing pandas dataframes.

> How should I reason about the tradeoff of using something like this vs pandas/numpy ?

For small sizes, operations on native types will be faster than the construction of complex objects.


Also, my grief with DF is they aren't typed (typing module) by column. Maybe that's changed though? It's been a while.

The only way to understand what's going on with DF code is to step it in a debugger. I know they can be much faster, but man you pay a maintainability price!


This is incorrect: each column in a pandas DFs can have a separate type (what you're asking for is compatibility with Python's type-hinting on a per-column basis, though, which is different), and you can debug the code without needing a debugger: I use pandas regularly and I've never needed to use a debugger on pandas.

(Sure, it's easy to write obfuscated pandas, and it sometimes has version-specific bugs or deprecations which need to be hacked around in a way that compromises readability, and sometimes the API has active changes/namings that are non-trivial. But that's miles from "only way to understand is with a debugger". If you want to claim otherwise, post a counterexample on SO (or Codidact) and post the link here.)


Yeah, that's what I meant. I would like per column type-hinting so that data frames are type-checked along with the rest of our stuff and everything is explicit.

I don't have anything I can show because the stuff I was working on was commercial and I don't code Pandas for fun at home ;)

The code I was maintaining / updating had long pipelines, had lots of folding, and would drift in and out of numpy quite a bit.

Part of the issue was my unfamiliarity with Pandas, for sure. But if I just picked a random function in the code, I would have no idea as to the shape of the data flowing in and out, without reading up and down the callstack to see what columns are in play.

Breakpoint and then look at the data, every time!


For type-hinting on dataframe-like objects, people recommend pandera [0].

> The code I was maintaining / updating had long pipelines, had lots of folding, and would drift in and out of numpy quite a bit.

(Protein folding?)

Anyway yeah if your codebase is a large proprietary pipeline that thunks to and from pandas-numpy then now I understand you. But that's your very specific usecase. The claim "The only way to understand what's going on with DF code is to step it in a debugger" is in general overkill.

[0]: https://pandera.readthedocs.io/en/stable/


They effectively are since each column is a series, which is typed.


I happen to know a book or two that might help with Pandas.

(Disclaimer: I wrote three of them and spend a good deal of my time helping others level up their Pandas. Spent this morning helping a medical AI company with Pandas.)


No offense taken.

My tasks aren't usually bottlenecked by the df creation operation. To me, the convenience offered by dfs outstrips the compute hit. However, if this is an order of magnitude difference , then it would push me to adopt the more-itertools formulation.


> However, if this is an order of magnitude difference , then it would push me to adopt the more-itertools formulation.

My friend it's much worse than a single order magnitude for small inputs

    import time
    import pandas as pd

    ls = list(range(10))

    b = time.monotonic_ns()
    odds = [v for v in ls if v % 2]
    e = time.monotonic_ns() - b
    print(f"{e=}")

    bb = time.monotonic_ns()
    df = pd.DataFrame(ls)
    odds = df[df % 2 == 1]
    ee = time.monotonic_ns() - bb
    print(f"{ee=}")
    print("ratio", ee/e)

    >>> e=1166
    >>> ee=656792
    >>> ratio 563.2864493996569


my experience is also that numpy and pandas can add 1-2 seconds to python startup time (which is terrible for the testing experience).


This is really helpful. Thank you.

I would like to see some kind of query AST for this stuff in a query engine for semantics that its ops can be fused together for efficiency. For example, like a Clojure transducer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: