Python Tips and Traps

dalke · on Jan 19, 2015

The namedtuple example is wrong. The constructor requires all of the parameters, and an attribute cannot be set:

    >>> from collections import namedtuple
    >>> LightObject = namedtuple('LightObject', ['shortname', 'otherprop'])
    >>> m = LightObject()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: __new__() takes exactly 3 arguments (1 given)
    >>> m = LightObject("first", "second")
    >>> m.shortname
    'first'
    >>> m.shortname = 'athing'
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: can't set attribute

Also, bare try/excepts as in:

    try:
      # get API data
      data = db.find(id='foo') # may raise exception
    except:
      # log the failure and bail out
      log.warn("Could not retrieve FOO")
      return

are really bad. The failure might be caused by a ^C or MemoryError, or even a SystemExit, should db.find() desire to do that.

Instead, qualify it by catching Exception:

    try:
      # get API data
      data = db.find(id='foo') # may raise exception
    except Exception:
      # log the failure and bail out
      log.warn("Could not retrieve FOO")
      return

It's also poor form to "return True" in the exit method of the context manager. If there is no exception then that's not needed at all, and if there is an exception ... well, that code will swallow AttributeError and NameError and ZeroDivisionError, and leave people confused as to the source of the error.

ryan_sb · on Jan 19, 2015

Thanks for correcting me, I'll get those updated ASAP.

pdonis · on Jan 19, 2015

A good rule of thumb: if you show actual code, run it before you post it!

timdierks · on Jan 20, 2015

I frequently want ad-hoc slots to create a structure of related values without the verbosity of a dict with static keys. What's the pythonic method?

E.g. when I want to carry a foo, a bar, and a baz around, but namedtuple isn't right (e.g. I need mutation). I'd prefer

    config = Someclass()
    config.foo = 1
    config.bar = "quux"
    config.baz = 3.3

and then using config.bar, etc.

over:

    config = { 'foo': 1, 'bar': "quux", 'baz': 3.3 }

and then using config['bar']

mkesper · on Jan 20, 2015

Shouldn't you qualify the exception class as properly as you can? If you can't I'd expect to check for certain error classes you can handle and else reraise the Exception.

dalke · on Jan 20, 2015

Yes, I would do that in nearly every case.

However, I don't know the larger context. If this were at the top-level of a web services handler where a "return None" indicates a 400 - Internal Server Error, then logging the database failure and stopping is likely acceptable.

Even then, I wouldn't call it good code. However, the goal of this essay seemed to be to give the minimal example, and a more complete example would have required introducing a fake database module with its own exception type. I believe that would have obscured the intent.

I would have preferred real, working code. In this case, with sqlite3. That's fundamentally a pedagogical choice though.

bdevine · on Jan 19, 2015

Always glad to see tips, tricks, and otherwise for Python. But for anyone checking out Python who sees how useful the defaultdict construct is but doesn't necessarily need nested attributes, the Counter class[0] has been available for some time now. If you just want to keep track of, well, counts, it's very handy and versatile.

[0] https://docs.python.org/3.1/library/collections.html#collect...

ryan_sb · on Jan 20, 2015

Totally true, I'll add that in

kasabali · on Jan 19, 2015

If you want to read hundreds of these look no further than Python Cookbook [0].

[0] http://chimera.labs.oreilly.com/books/1230000000393/

task_queue · on Jan 19, 2015

Use contextlib to write your context managers. Thought the exception handling code was going to catch the all-encompassing use of 'except:' but nope. Don't do that.

ryan_sb · on Jan 20, 2015

I don't know how I've missed out on contextlib all this time. I'll update that section.

_hyn3 · on Jan 19, 2015

Excellent article. Loved the specific examples for collection types, esp defaultdict and namedtuples!

rhapsodyv · on Jan 19, 2015

I think I have a karma with python. Every python project a needed to touch I lost a lot of time dealing with mixed spaces and tabs scattered throughout every source file. I really know I am unlucky, this cannot be normal.

dalke · on Jan 20, 2015

Don't use tabs. See https://www.python.org/dev/peps/pep-0008/#tabs-or-spaces which says to prefer spaces.

You can start by converting all tabs into 8 spaces. This can be tricky should some strings have tabs. That's a bad idea in the first place. Use "\t".

Don't mix tabs and spaces to get the same indentation level. Python 3 prohibits it. With Python 2 use "-t" or "-tt", which respectively warns and raises an exception if both spaces and tabs are used in the same block.

rhapsodyv · on Jan 20, 2015

Yeah, I ended converting all the project to spaces.

TheLoneWolfling · on Jan 20, 2015

Have a save hook that converts everything to <tabs or spaces>.

pjmlp · on Jan 20, 2015

Some of us just use Python friendly editors.

undergrowth54 · on Jan 20, 2015

You are correct that this is not normal. Both sublimetext and vim make this easy to avoid. I recommend sublime

gknoy · on Jan 19, 2015

Thanks for the reminder about integer division changing! I had seen some code in our (2.7) codebase that used (a/b), and I had wondered if I should be explicit about using math.floor, but this is even better.

mbauman · on Jan 19, 2015

How would you create that hypothetical recurise defaultdict with defaults of defaultdicts? Is such a construct possible without creating a new defaultdefaultdict class?

birken · on Jan 19, 2015

  >> nested_dd = collections.defaultdict(lambda: nested_dd)
  >> nested_dd['a']['b']['c']['d'] = 'hello'

jonathanpoulter · on Jan 19, 2015

For reference, this is called autovivification. http://en.wikipedia.org/wiki/Autovivification#Python

birken · on Jan 19, 2015

Oops, messed this up (too late to edit). Thanks choochootrain for the correction:

  >>> NestedDD = lambda: collections.defaultdict(NestedDD)
  >>> nested_dd = NestedDD()
  >>> nested_dd['a']['b']['c']['d'] = 'hello'

does the proper thing.

bdr · on Jan 19, 2015

I don't think that's what they wanted:

    >>> nested_dd['d']
    'hello'

choochootrain · on Jan 19, 2015

try nested_dd = lambda: defaultdict(nested_dd)

_hyn3 · on Jan 19, 2015

Just. Wow.

rnhmjoj · on Jan 20, 2015

  from collections import defaultdict

  defaultobj = lambda: type('defaultobj', (defaultdict,), {
      '__getattr__': lambda self, x:    self.__getitem__(x),
      '__setattr__': lambda self, x, v: self.__setitem__(x, v)
      })(defaultobj)

  names = defaultobj()
  names.mammalia.primates.homo['H. Sapiens'] = 'Human being'

  print(names)

The same but also allows dot notation.

heydenberk · on Jan 19, 2015

For what it's worth, I found the "bad" example in addict's documentation[1] to be better than the "good" one. It's explicit about the structure of the data.

Also, I found the name and the tagline to be distasteful.

[1] https://github.com/mewwts/addict#addict---the-python-dict-th...

_hyn3 · on Jan 19, 2015

Thanks for the awesome link and agreed, distasteful. (could be worse.. at least it's not 'pimp my python!')

At least you won't forget.

     pip install addict

tekromancr · on Jan 20, 2015

I have used python for years, how did I never encounter python's collections module? I have implemented functionally partial versions of some of these. Looks like I gotta brush up on the batteries included!

pjmlp · on Jan 20, 2015

While you are at it, check the itertools, functools as well, in case you also don't know them.

tekromancr · on Jan 21, 2015

itertools I know and love, functools is a mystery I should look into.

settrans · on Jan 19, 2015

set() is a great way to deduplicate small lists, but it's important to note that it requires O(n) extra space (in-place sorting can avoid this overhead, but is more complex).

raymondh · on Jan 19, 2015

Sorting is almost always the wrong way to do it. (Wordy and slow).

It is fragile design to write code that depends on 1) the data is so large that you don't have room for a set() BUT 2) it is small enough for an in-memory sort. (IOW, the almost-out-of-memory case invariably degrades over time to flat-out-of-memory).

Another thought: people seem to place too much concern about about the size of various data structures rather than thinking about the data itself. Python containers don't contain anything, they just hold references. (Usually, the weight of a bucket of water is mostly the water, not the bucket itself).

Finally, if your task is to dedup a lot of data, it doesn't make sense to read it all into memory in the first place (which you would need for a sorting approach). It is better dedup it using a set as you read in the data:

    # Only the unique lines are ever kept in memory
    with open('hugefile.txt') as f:
        uniq_lines = set(f)

Dude, sorry to go off like this, but the advice you gave is almost always the wrong way to do it.

deathanatos · on Jan 20, 2015

Just to note, but if it's possible that your worst case is that all lines are unique, then you've effectively read the entire file into memory. But this just gets back to your point:

> rather than thinking about the data itself

…i.e, in this case, we need to realize that the set will require on the order of the size of how many unique items there are, and that in the worst case, that's O(n).

Of course, if you can guarantee that the size of unique set will fit into memory, then you're fine. But you might need a bit of knowledge about the data to do so.

meowface · on Jan 19, 2015

You're still reading all of the lines into memory in that example as soon as you call `set(f)` (which is basically equivalent to set(f.readlines()), though, which may not necessarily be what you want.

jzwinck · on Jan 20, 2015

No, that's not at all what's happening. f.readlines() creates and returns a full list of all lines, loaded into memory. But set(f) uses f as an iterator, which reads chunks of the file and yields one line at a time, which can then be inserted into the set, de-duping on the fly. Your parent is correct (and clever).

meowface · on Jan 20, 2015

It was incorrect of me to say it was equivalent, but if all lines in the file are unique (ignoring the parent comment's assumption there will be lots of duplicates) then you're still creating a large object in memory. I'm not sure of the space of sets/hash tables vs. lists, but I imagine they'd be prety close.

But of course that will be unavoidable if you want fast deduplication. You could do fully lazy deduplication using a generator, but you'd have to avoid using sets.

DasIch · on Jan 20, 2015

If all lines in the file are unique, memory usage stays constant after the first line is read.

A set by definition contains only unique items, that is all items in a set are different from each other.

meowface · on Jan 22, 2015

If you're going from generator object (or any other lazy iterator, like the file object in this case) -> set object, there will be a memory usage increase with each additional line read.

What I meant is you could in theory process a generator and omit duplicates without any real memory usage (even with a file of millions of lines) by chaining generators together. This would be slower than a set but much more memory efficient.

bdevine · on Jan 20, 2015

I think you may have gotten ahead of yourself in your explanation, but to be clear, it's the "with" statement that causes the line iteration that then feeds into set(f). To say that "set(f) uses f as an iterator" might imply to some that set() is causing the iteration, which isn't true.

ETA: See raymondh's post below.

raymondh · on Jan 20, 2015

Sorry, this post is also filled with misinformation.

* The with-statement only causes the file to be closed after use. It has nothing to do with iteration.

* Files themselves are self-iterable (anything that loops over a file object uses a line-at-a-time iterator). Even writing "for line in f: ..." causes you to loop a line at a time without the whole file being in memory all at once.

* Sets are just one of many objects that takes an iterable as an argument: min(f), max(f), list(f), tuple(f), etc.

bdevine · on Jan 20, 2015

Ah! Thanks for the correction. Somewhere along the way I picked up the belief that in addition to the file-closing aspect, "with" ensured a line-at-a-time iterator, akin to adding on "for line in f:".

I'm not sure I understand your third comment though. As I understand it, iterables have an __iter__ method. __iter__ methods return iterators. So an iterator for the iterable f traverses f and sends f's values to, in this case, set(). I believe we agree there. My initial concern was simply that the statement could be read "The iterator, set(), uses f.", and I don't see where the disagreement arises.

beagle3 · on Jan 20, 2015

I'm not sure about the internals of set(f), but the logically equivalent

    s = set()
    for x in f: s.add(x)

Does not read everything into memory - by using iterators, it only reads one line at a time.

raymondh · on Jan 19, 2015

Nice article. It well expresses the delight that comes from seeing the expressiveness of just a handful of tools that fit well together.