The namedtuple example is wrong. The constructor requires all of the parameters, and an attribute cannot be set:
>>> from collections import namedtuple
>>> LightObject = namedtuple('LightObject', ['shortname', 'otherprop'])
>>> m = LightObject()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __new__() takes exactly 3 arguments (1 given)
>>> m = LightObject("first", "second")
>>> m.shortname
'first'
>>> m.shortname = 'athing'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: can't set attribute
Also, bare try/excepts as in:
try:
# get API data
data = db.find(id='foo') # may raise exception
except:
# log the failure and bail out
log.warn("Could not retrieve FOO")
return
are really bad. The failure might be caused by a ^C or MemoryError, or even a SystemExit, should db.find() desire to do that.
Instead, qualify it by catching Exception:
try:
# get API data
data = db.find(id='foo') # may raise exception
except Exception:
# log the failure and bail out
log.warn("Could not retrieve FOO")
return
It's also poor form to "return True" in the exit method of the context manager. If there is no exception then that's not needed at all, and if there is an exception ... well, that code will swallow AttributeError and NameError and ZeroDivisionError, and leave people confused as to the source of the error.
Shouldn't you qualify the exception class as properly as you can?
If you can't I'd expect to check for certain error classes you can handle and else reraise the Exception.
However, I don't know the larger context. If this were at the top-level of a web services handler where a "return None" indicates a 400 - Internal Server Error, then logging the database failure and stopping is likely acceptable.
Even then, I wouldn't call it good code. However, the goal of this essay seemed to be to give the minimal example, and a more complete example would have required introducing a fake database module with its own exception type. I believe that would have obscured the intent.
I would have preferred real, working code. In this case, with sqlite3. That's fundamentally a pedagogical choice though.
Always glad to see tips, tricks, and otherwise for Python. But for anyone checking out Python who sees how useful the defaultdict construct is but doesn't necessarily need nested attributes, the Counter class[0] has been available for some time now. If you just want to keep track of, well, counts, it's very handy and versatile.
Use contextlib to write your context managers.
Thought the exception handling code was going to catch the all-encompassing use of 'except:' but nope. Don't do that.
I think I have a karma with python. Every python project a needed to touch I lost a lot of time dealing with mixed spaces and tabs scattered throughout every source file. I really know I am unlucky, this cannot be normal.
You can start by converting all tabs into 8 spaces. This can be tricky should some strings have tabs. That's a bad idea in the first place. Use "\t".
Don't mix tabs and spaces to get the same indentation level. Python 3 prohibits it. With Python 2 use "-t" or "-tt", which respectively warns and raises an exception if both spaces and tabs are used in the same block.
Thanks for the reminder about integer division changing! I had seen some code in our (2.7) codebase that used (a/b), and I had wondered if I should be explicit about using math.floor, but this is even better.
How would you create that hypothetical recurise defaultdict with defaults of defaultdicts? Is such a construct possible without creating a new defaultdefaultdict class?
For what it's worth, I found the "bad" example in addict's documentation[1] to be better than the "good" one. It's explicit about the structure of the data.
Also, I found the name and the tagline to be distasteful.
I have used python for years, how did I never encounter python's collections module? I have implemented functionally partial versions of some of these. Looks like I gotta brush up on the batteries included!
set() is a great way to deduplicate small lists, but it's important to note that it requires O(n) extra space (in-place sorting can avoid this overhead, but is more complex).
Sorting is almost always the wrong way to do it. (Wordy and slow).
It is fragile design to write code that depends on 1) the data is so large that you don't have room for a set() BUT 2) it is small enough for an in-memory sort. (IOW, the almost-out-of-memory case invariably degrades over time to flat-out-of-memory).
Another thought: people seem to place too much concern about about the size of various data structures rather than thinking about the data itself. Python containers don't contain anything, they just hold references. (Usually, the weight of a bucket of water is mostly the water, not the bucket itself).
Finally, if your task is to dedup a lot of data, it doesn't make sense to read it all into memory in the first place (which you would need for a sorting approach). It is better dedup it using a set as you read in the data:
# Only the unique lines are ever kept in memory
with open('hugefile.txt') as f:
uniq_lines = set(f)
Dude, sorry to go off like this, but the advice you gave is almost always the wrong way to do it.
Just to note, but if it's possible that your worst case is that all lines are unique, then you've effectively read the entire file into memory. But this just gets back to your point:
> rather than thinking about the data itself
…i.e, in this case, we need to realize that the set will require on the order of the size of how many unique items there are, and that in the worst case, that's O(n).
Of course, if you can guarantee that the size of unique set will fit into memory, then you're fine. But you might need a bit of knowledge about the data to do so.
You're still reading all of the lines into memory in that example as soon as you call `set(f)` (which is basically equivalent to set(f.readlines()), though, which may not necessarily be what you want.
No, that's not at all what's happening. f.readlines() creates and returns a full list of all lines, loaded into memory. But set(f) uses f as an iterator, which reads chunks of the file and yields one line at a time, which can then be inserted into the set, de-duping on the fly. Your parent is correct (and clever).
It was incorrect of me to say it was equivalent, but if all lines in the file are unique (ignoring the parent comment's assumption there will be lots of duplicates) then you're still creating a large object in memory. I'm not sure of the space of sets/hash tables vs. lists, but I imagine they'd be prety close.
But of course that will be unavoidable if you want fast deduplication. You could do fully lazy deduplication using a generator, but you'd have to avoid using sets.
If you're going from generator object (or any other lazy iterator, like the file object in this case) -> set object, there will be a memory usage increase with each additional line read.
What I meant is you could in theory process a generator and omit duplicates without any real memory usage (even with a file of millions of lines) by chaining generators together. This would be slower than a set but much more memory efficient.
I think you may have gotten ahead of yourself in your explanation, but to be clear, it's the "with" statement that causes the line iteration that then feeds into set(f). To say that "set(f) uses f as an iterator" might imply to some that set() is causing the iteration, which isn't true.
Sorry, this post is also filled with misinformation.
* The with-statement only causes the file to be closed after use. It has nothing to do with iteration.
* Files themselves are self-iterable (anything that loops over a file object uses a line-at-a-time iterator). Even writing "for line in f: ..." causes you to loop a line at a time without the whole file being in memory all at once.
* Sets are just one of many objects that takes an iterable as an argument: min(f), max(f), list(f), tuple(f), etc.
Ah! Thanks for the correction. Somewhere along the way I picked up the belief that in addition to the file-closing aspect, "with" ensured a line-at-a-time iterator, akin to adding on "for line in f:".
I'm not sure I understand your third comment though. As I understand it, iterables have an __iter__ method. __iter__ methods return iterators. So an iterator for the iterable f traverses f and sends f's values to, in this case, set(). I believe we agree there. My initial concern was simply that the statement could be read "The iterator, set(), uses f.", and I don't see where the disagreement arises.
Instead, qualify it by catching Exception:
It's also poor form to "return True" in the exit method of the context manager. If there is no exception then that's not needed at all, and if there is an exception ... well, that code will swallow AttributeError and NameError and ZeroDivisionError, and leave people confused as to the source of the error.