Hacker News new | past | comments | ask | show | jobs | submit login
Python Tips and Traps (airpair.com)
201 points by ryan_sb on Jan 19, 2015 | hide | past | favorite | 45 comments



The namedtuple example is wrong. The constructor requires all of the parameters, and an attribute cannot be set:

    >>> from collections import namedtuple
    >>> LightObject = namedtuple('LightObject', ['shortname', 'otherprop'])
    >>> m = LightObject()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: __new__() takes exactly 3 arguments (1 given)
    >>> m = LightObject("first", "second")
    >>> m.shortname
    'first'
    >>> m.shortname = 'athing'
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: can't set attribute
Also, bare try/excepts as in:

    try:
      # get API data
      data = db.find(id='foo') # may raise exception
    except:
      # log the failure and bail out
      log.warn("Could not retrieve FOO")
      return
are really bad. The failure might be caused by a ^C or MemoryError, or even a SystemExit, should db.find() desire to do that.

Instead, qualify it by catching Exception:

    try:
      # get API data
      data = db.find(id='foo') # may raise exception
    except Exception:
      # log the failure and bail out
      log.warn("Could not retrieve FOO")
      return

It's also poor form to "return True" in the exit method of the context manager. If there is no exception then that's not needed at all, and if there is an exception ... well, that code will swallow AttributeError and NameError and ZeroDivisionError, and leave people confused as to the source of the error.


Thanks for correcting me, I'll get those updated ASAP.


A good rule of thumb: if you show actual code, run it before you post it!


I frequently want ad-hoc slots to create a structure of related values without the verbosity of a dict with static keys. What's the pythonic method?

E.g. when I want to carry a foo, a bar, and a baz around, but namedtuple isn't right (e.g. I need mutation). I'd prefer

    config = Someclass()
    config.foo = 1
    config.bar = "quux"
    config.baz = 3.3
and then using config.bar, etc.

over:

    config = { 'foo': 1, 'bar': "quux", 'baz': 3.3 }
and then using config['bar']


Shouldn't you qualify the exception class as properly as you can? If you can't I'd expect to check for certain error classes you can handle and else reraise the Exception.


Yes, I would do that in nearly every case.

However, I don't know the larger context. If this were at the top-level of a web services handler where a "return None" indicates a 400 - Internal Server Error, then logging the database failure and stopping is likely acceptable.

Even then, I wouldn't call it good code. However, the goal of this essay seemed to be to give the minimal example, and a more complete example would have required introducing a fake database module with its own exception type. I believe that would have obscured the intent.

I would have preferred real, working code. In this case, with sqlite3. That's fundamentally a pedagogical choice though.


Always glad to see tips, tricks, and otherwise for Python. But for anyone checking out Python who sees how useful the defaultdict construct is but doesn't necessarily need nested attributes, the Counter class[0] has been available for some time now. If you just want to keep track of, well, counts, it's very handy and versatile.

[0] https://docs.python.org/3.1/library/collections.html#collect...


Totally true, I'll add that in


If you want to read hundreds of these look no further than Python Cookbook [0].

[0] http://chimera.labs.oreilly.com/books/1230000000393/


Use contextlib to write your context managers. Thought the exception handling code was going to catch the all-encompassing use of 'except:' but nope. Don't do that.


I don't know how I've missed out on contextlib all this time. I'll update that section.


Excellent article. Loved the specific examples for collection types, esp defaultdict and namedtuples!


I think I have a karma with python. Every python project a needed to touch I lost a lot of time dealing with mixed spaces and tabs scattered throughout every source file. I really know I am unlucky, this cannot be normal.


Don't use tabs. See https://www.python.org/dev/peps/pep-0008/#tabs-or-spaces which says to prefer spaces.

You can start by converting all tabs into 8 spaces. This can be tricky should some strings have tabs. That's a bad idea in the first place. Use "\t".

Don't mix tabs and spaces to get the same indentation level. Python 3 prohibits it. With Python 2 use "-t" or "-tt", which respectively warns and raises an exception if both spaces and tabs are used in the same block.


Yeah, I ended converting all the project to spaces.


Have a save hook that converts everything to <tabs or spaces>.


Some of us just use Python friendly editors.


You are correct that this is not normal. Both sublimetext and vim make this easy to avoid. I recommend sublime


Thanks for the reminder about integer division changing! I had seen some code in our (2.7) codebase that used (a/b), and I had wondered if I should be explicit about using math.floor, but this is even better.


How would you create that hypothetical recurise defaultdict with defaults of defaultdicts? Is such a construct possible without creating a new defaultdefaultdict class?


  >> nested_dd = collections.defaultdict(lambda: nested_dd)
  >> nested_dd['a']['b']['c']['d'] = 'hello'


For reference, this is called autovivification. http://en.wikipedia.org/wiki/Autovivification#Python


Oops, messed this up (too late to edit). Thanks choochootrain for the correction:

  >>> NestedDD = lambda: collections.defaultdict(NestedDD)
  >>> nested_dd = NestedDD()
  >>> nested_dd['a']['b']['c']['d'] = 'hello'
does the proper thing.


I don't think that's what they wanted:

    >>> nested_dd['d']
    'hello'


try nested_dd = lambda: defaultdict(nested_dd)


Just. Wow.


  from collections import defaultdict

  defaultobj = lambda: type('defaultobj', (defaultdict,), {
      '__getattr__': lambda self, x:    self.__getitem__(x),
      '__setattr__': lambda self, x, v: self.__setitem__(x, v)
      })(defaultobj)

  names = defaultobj()
  names.mammalia.primates.homo['H. Sapiens'] = 'Human being'

  print(names)
The same but also allows dot notation.


For what it's worth, I found the "bad" example in addict's documentation[1] to be better than the "good" one. It's explicit about the structure of the data.

Also, I found the name and the tagline to be distasteful.

[1] https://github.com/mewwts/addict#addict---the-python-dict-th...


Thanks for the awesome link and agreed, distasteful. (could be worse.. at least it's not 'pimp my python!')

At least you won't forget.

     pip install addict


I have used python for years, how did I never encounter python's collections module? I have implemented functionally partial versions of some of these. Looks like I gotta brush up on the batteries included!


While you are at it, check the itertools, functools as well, in case you also don't know them.


itertools I know and love, functools is a mystery I should look into.


set() is a great way to deduplicate small lists, but it's important to note that it requires O(n) extra space (in-place sorting can avoid this overhead, but is more complex).


Sorting is almost always the wrong way to do it. (Wordy and slow).

It is fragile design to write code that depends on 1) the data is so large that you don't have room for a set() BUT 2) it is small enough for an in-memory sort. (IOW, the almost-out-of-memory case invariably degrades over time to flat-out-of-memory).

Another thought: people seem to place too much concern about about the size of various data structures rather than thinking about the data itself. Python containers don't contain anything, they just hold references. (Usually, the weight of a bucket of water is mostly the water, not the bucket itself).

Finally, if your task is to dedup a lot of data, it doesn't make sense to read it all into memory in the first place (which you would need for a sorting approach). It is better dedup it using a set as you read in the data:

    # Only the unique lines are ever kept in memory
    with open('hugefile.txt') as f:
        uniq_lines = set(f)
Dude, sorry to go off like this, but the advice you gave is almost always the wrong way to do it.


Just to note, but if it's possible that your worst case is that all lines are unique, then you've effectively read the entire file into memory. But this just gets back to your point:

> rather than thinking about the data itself

…i.e, in this case, we need to realize that the set will require on the order of the size of how many unique items there are, and that in the worst case, that's O(n).

Of course, if you can guarantee that the size of unique set will fit into memory, then you're fine. But you might need a bit of knowledge about the data to do so.


You're still reading all of the lines into memory in that example as soon as you call `set(f)` (which is basically equivalent to set(f.readlines()), though, which may not necessarily be what you want.


No, that's not at all what's happening. f.readlines() creates and returns a full list of all lines, loaded into memory. But set(f) uses f as an iterator, which reads chunks of the file and yields one line at a time, which can then be inserted into the set, de-duping on the fly. Your parent is correct (and clever).


It was incorrect of me to say it was equivalent, but if all lines in the file are unique (ignoring the parent comment's assumption there will be lots of duplicates) then you're still creating a large object in memory. I'm not sure of the space of sets/hash tables vs. lists, but I imagine they'd be prety close.

But of course that will be unavoidable if you want fast deduplication. You could do fully lazy deduplication using a generator, but you'd have to avoid using sets.


If all lines in the file are unique, memory usage stays constant after the first line is read.

A set by definition contains only unique items, that is all items in a set are different from each other.


If you're going from generator object (or any other lazy iterator, like the file object in this case) -> set object, there will be a memory usage increase with each additional line read.

What I meant is you could in theory process a generator and omit duplicates without any real memory usage (even with a file of millions of lines) by chaining generators together. This would be slower than a set but much more memory efficient.


I think you may have gotten ahead of yourself in your explanation, but to be clear, it's the "with" statement that causes the line iteration that then feeds into set(f). To say that "set(f) uses f as an iterator" might imply to some that set() is causing the iteration, which isn't true.

ETA: See raymondh's post below.


Sorry, this post is also filled with misinformation.

* The with-statement only causes the file to be closed after use. It has nothing to do with iteration.

* Files themselves are self-iterable (anything that loops over a file object uses a line-at-a-time iterator). Even writing "for line in f: ..." causes you to loop a line at a time without the whole file being in memory all at once.

* Sets are just one of many objects that takes an iterable as an argument: min(f), max(f), list(f), tuple(f), etc.


Ah! Thanks for the correction. Somewhere along the way I picked up the belief that in addition to the file-closing aspect, "with" ensured a line-at-a-time iterator, akin to adding on "for line in f:".

I'm not sure I understand your third comment though. As I understand it, iterables have an __iter__ method. __iter__ methods return iterators. So an iterator for the iterable f traverses f and sends f's values to, in this case, set(). I believe we agree there. My initial concern was simply that the statement could be read "The iterator, set(), uses f.", and I don't see where the disagreement arises.


I'm not sure about the internals of set(f), but the logically equivalent

    s = set()
    for x in f: s.add(x)
Does not read everything into memory - by using iterators, it only reads one line at a time.


Nice article. It well expresses the delight that comes from seeing the expressiveness of just a handful of tools that fit well together.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: