Hacker Newsnew | past | comments | ask | show | jobs | submit | runT1ME's commentslogin

I really think this is where pure FP shines.

If you look at the architecture of something like Apache Beam, while you describe your computations in a language like Java or Python, you're really using a DSL for creating an immutable DAG that processes chunks of immutable data, persisiting each chunk (being loose with terms here) for durability, 'modifying' it in the immutable sense and then passing it on to the next edge and so on.

In a single standalone system, many argue that a purely immutable conceptual approach has pros and cons. In the Big Data world, I've never heard anyone argue for an imperative approach. Immutability is necessary, and as soon as you need immutability you want all the other goodies that facilitate it in a clean way like monads, lenses, and the like.


Intriguing, but I think I lost you a little in the details and really dying to understand your perspective.

    1) creating an immutable DAG that processes chunks of immutable data, 
    2) persisting each chunk (being loose with terms here) for durability, 
    3) 'modifying' it in the immutable sense 
    4) and then passing it on to the next edge and so on
Isn't #4 the persisting part, not #2? Maybe I'm confused by what you mean by "persist"? Render the instructions (mutate the data)?

And when you say "'modifying' it in the immutable sense", does this mean that the original data stays untouched, but "instructions" on how to modify the data get "pinned" to the data?

Then every other subsequent step in the DAG is just modifying the "instructions"?

If I understand you correctly, your point about how this is where FP shines is really illustrated in #4. You end up passing instructions around instead of the data. But how FP instructions can be modified and modified again really baffles me. The closest I can get to understanding it in practical terms is by using something like Clojure and thinking about how Clojure is just data, and you can modify, or build upon Clojure by modifying like you modify data. I struggle extending this to another language like Scala or JavaScript.


Let's say your immutable DAG has four different 'transformations' (think either a map or a fold on the data). Beam or Spark will take a chunk of your data, partition it somehow (so you get parallelism), and do a transform on the data.

Now if you're doing a bunch of computations off of tens or hundreds machines you don't want to fail the whole thing because one hypervisor crashes or there's a JVM segfault or whatnot. So each step is usually saved to disk via checkpoint and then moved along to the next transformation in batches, so if the subsequent transfomation is where it dies, you have a safe spot to restart that particular computation from.

In addition, your (chunk of) data might need to be shipped to an additional machine to be transformed so you're not so much 'updating' data as you are creating new data in a space efficient way of the changes and shuffling those along the various steps.


Ok, makes sense now. Thanks for taking the time to explain.


FP?


Functional programming


Besides performance, what other cons exists for immutable data structures in a single standalone system?


It's awkward to work with fully nested structures. Think about having a map of customer objects which have a list of addresses and you need to update everyone's first address to be capitalized for some raeson. You'd really want a fully fleshed out lens library and maybe even macros for auto generating them on your concrete data structures to make it easy to 'update' (via a persistent structure) without having to deal with all the cruft of doing it by hand.


That is the use case that we have to work with quite often. In the end, we use something along the lines of

  (update-fn (fn [{:keys [address]}
                   :as customer]
               (if address (update customer :address clojure.string/capitalize) customer))
    customers)
which pattern matches on some `object` (map) and does processing. We find it less fragile than specifying explicit path to an element. It can also work in a polymorphic fashion. On the other hand, there is a risk of a false positive (when you modify address that you shouldn't). But you can mitigate that risk by using additional checks (in case of a customer, you can check for additional set of fields that are specific for that object(map) edit:formatting


I agree that you want specialised mini-DSLs for querying and updating highly nested structures (... which you should probably avoid creating anyway), but for such a simple task you could just do this (Clojure example):

    (update-vals customers #(update-in % [:address 0] str/capitalize))


In JavaScript world, there is ImmutableJS, which uses this style of updates. However, there is better approach which is what ImmerJS uses – you write a function where you can do mutation to the target data structure, wrap it in a "produce" function, and library will pick on those mutations and create a new copy of the data that is structurally shared with the original.

https://immerjs.github.io/immer/


Yeah, that is basically what that Clojure code does. All of the data structures in Clojure are persistent and use structural sharing.


That Clojure code isn't quite grasping the essence of ImmerJS. The whole point of ImmerJS is that JS has nice, built-in syntax for in-place mutation and we can reuse that syntax to generate updates to an immutable data structure so long as we scope the mutation syntax so that outside of an individual block of mutation syntax everything stays immutable. That it is implemented with JS proxies is something of an implementation detail (it could e.g. be implemented with linear or affine types or something like Haskell's ST type).

In this sense it's closer to Clojure's transients, if Clojure came prepackaged with special mutation syntax (notably assignment via =) and if transient-ness could be propagated transitively to all nested structures in a data structure.


Memory usage, the fancier data structures (most of which are pioneered by Okasaki) in newer FP frameworks and languages are often very memory hungry.


Nothing wrong with using incredibly un-fancy data structures, like arrays, in a FP way. You just need to make sure that you don't do operations that modify individual elements, but work in large batches. So no for loops inserting elements one by one, but map, filter, flatmap...

Most good FP data structures will be tree like, but with a low branching factor and chunky leafs. Now of course that does not save you from lots of memory consumption if you use a language/platform where even primitives are boxed, like the JVM. But that is a completely separate topic.


Sharing immutable data is hard. Mutable throw you throw a lock on it and mutate in place. How do you share mutations when you can't change anything?

To be clear it is certainly a solvable problem (as this post shows) but it can make things difficult.


Sharing mutable data without a mutex (which suffers from unbounded contention) is hard. Approaches that work include updating persistent data structures and sharing the new copy, or sharing diff objects over a lock-free queue.


Sharing a new copy isn't always easy though. That is why lens are a thing. Unfortunately lens are (last I checked) still not quite easy to grow.

The point is neither copying the whole structure nor diffs are as easy as mutex + inner mutation.


Sharing the copy is not what's solved by lenses. In a certain sense lenses are just a way to try to do away with the boilerplate of updating nested immutable data structures. It is very easy to write this boilerplate (in the sense that it is straightforward and hard to mess up). It's just mind-numbingly tedious.

Although I don't know what you mean by growing lenses.

Mutex + inner mutation is no easier than CAS (which is the usual solution with concurrent writing of immutable data structures) a la Java AtomicReference or STM (another popular one) and in my opinion significantly harder as soon as you have multiple mutexes.


CAS is a massive pain in the ass for complex changes. Lenses are a way to point to part of the object which can in theory make it less painful as you can redo a change without redoing the work to setup the change necessarily. Growing just refers to growing a codebase using them by adding them is all.

I am not against FP I think the point of "it is good if you can fix performance" is very accurate. I just think it is also important to acknowledge that the paradigm can be more complex in situations where it is supposed to help leading to a mixed bag.

Similar to distributed databases. Your database can now be phenomenally powerful but you can't do a stored proc anymore without losing that power.

It can be a very effective and totally worth it but it isn't a pure win necessarily.


> CAS is a massive pain in the ass for complex changes.

Why is that?


What are you CASing? If the object is too large you will have contention to the point that you might as well be single threaded. If the object is too small you now have to CAS multiple things which is far from trivial.

CAS is a primitive, it isn't complex itself but it can be complex to work with once you are talking non trivial work. Just like locks. A global lock is dumb simple but a real locking system can be as complex as you will let it.


> If the object is too large you will have contention to the point that you might as well be single threaded. If the object is too small you now have to CAS multiple things which is far from trivial.

Right but both of those things are true for locks too right? CAS seems no harder than locks.


Retry logic is hidden for locks while in your face for CAS.

Additionally while multiple locks is annoying it is way easier than multiple CAS. "Have a global order for locks" is the hard but solvable problem for multiple locks. For CAS if you need to CAS two dependent things you... I don't know it depends.


> Retry logic is hidden for locks while in your face for CAS.

It's hidden in both cases (usually CAS instructions are hidden behind an `update` interface as in Java's Atomic* family rather than directly used in the same way that generally a lock is an interface for a TAS instruction/spin lock + upgrading to wait queue).

> For CAS if you need to CAS two dependent things you... I don't know it depends.

You use nested atomic references. Just like locks it's not a great way of doing things, and an analogous problem to ordering locks rears its head (by virtue of nesting you cannot mess up ordering in the strict sense, but you can accidentally "cross the boundaries" of two atomics in an update that goes against the nesting order), but it's doable in the same way as locks.

The usual CAS-like but better approach is STM (which is where immutability really shines).

I'm still not seeing how CAS is any harder than locks.


IMO the hard part of immutable structures is mutation (solved by https://lib.rs/crates/im), the hard part of diffing is writing one "command" subclass per type of operation (which can be more or less manageable), and the hard part of mutexes is remembering to lock the mutex or rwlock in every single access (Rust fixes this), avoiding priority inversions (Rust doesn't help), and avoiding deadlocks (Rust doesn't help).


This is how clojure does it: https://clojure.org/reference/atoms

It's very costly though.


Is it particularly costly compared to locks?

Under the hood it's a effectively a single CAS instruction that loops on failure (which only occurs under contention, but then you have waiting with locks too).


I don't have time (nor the endurance) for a 90 minute run, so could you combine the two? I often do a 14 hour fast and then go for a 20 minute run combined with some calisthenics on my lunchbreak before eating. How can you tell your body is in a ketogenic state?

I can definitely say the first few times I exerted a ton of energy after not eating for 12+ hours I felt like I had the flu for a few hours, but now I don't really get that.


They sell test strips that supposedly detect ketone bodies in urine although I don't know if they're reliable.


Personally my breath starts smelling like shit when I enter in ketogenic state.


Would love to get your take on if python is popular in large organizations because of the existing libraries or if it is the language asethetic itself?

To an outsider who knows a bit of python because of Airflow and Spark, it seems Python has become a popular metaprogramming language that various disparate ecosystems have all adapated.

Pyspark is python but kind of its own language and much of the processing is happening outside the python runtime. I think you could say the same for people doing Numpy or Pandas.

Tensorflow probably even more so where I believe Python is the most popular interface, but you're really just programming a program to run somewhere else. Again this is the same with Airflow conceptually, though I believe the runtime is python.

If my hypothesis is overall correct, and that a majority (or large share) of python programmers aren't sharing the same ecosystem, libraries and packages, then shifting to a different runtime or language that just encapsulates a subset of the language is much less challenging.

This is the opposite problem of the JVM, where no one really likes Java the langauge, so everyone tries to create enjoyable (and productive) languages for the jvm to keep using the libraries and the runtime optimizations.


This isn’t legal. A corporation that leases someone a car and a house is paying them a taxable income, regardless of it not being a cash income.


When I would commute from Orange County so Santa Clara the flight was often ~40 minutes. Combine with takeoff and landing it didn't make a lot of sense to pull out an addition device but it was easy enough to watch a short TV show on my phone.


Depends on the language (I don't use it for Java), but for Scala and python I definitely prefer it. I'd definitely recommend using TMux with Vim, I've mapped my tmux navigation to mirror Vim's window navigation, and I can keep a terminal and REPL open and easily copy-paste between Vim and my tmux windows.


It might only be wrong for programs that have never existed and will ever exist.


Sure, that's possible, but the opposite is also possible: it might be wrong for most programs we actually write. Well, to be fair, there is some upper bound for any program running on a real CPU.


>Sure, that's possible, but the opposite is also possible: it might be wrong for most programs we actually write.

We could confirm this though! It's not like we can't find out if a given program halts or is inconsisent. Godel talks about it in his letter to Von Nuemann.


There are programs for which we can check this, but there is no general procedure to check if any program halts. Even ignoring the halting problem itself, say we analyze a program and realize it halts iff P=NP, or pi to the e is transcendental, .or if Pi's decimal expansion at position Graham's number is divisible by 3. Will that program halt? It might be very hard to say.

More promisingly, there are ways to construct programs such they will halt, using total languages (though not every problem can be solved with such a limitation).


We can write a general procedure to see it any given program halts within n-steps however.


>Only one thread actually mutates anything, and its the same one over and over

Sounds great in theory and usually results in fantastic average case, but you get a hot partition and suddenly you can't work share and things go south. It's not a perfect solution.


If you have a single serialized+synchronous business context in which activities must be processed, you literally have no other option than to use a single physical core if you want to go fast.

The trick is to minimize the raw amount of bytes that must enter into that synchronous context. Maybe the account disclosure PDFs can be referred to by some GUID token in AWS S3, but the actual decimal account balance/transaction facts should be included in the event data stream.

One other option is to talk to the business and see if you can break their 1 gigantic synchronous context into multiple smaller ones that can progress independently.


There is no working around Amdahls law unless you redesign your app.


I think it's important to point out that you can do calculus and arithmetic and only construct proofs that are provable. The possibility of deriving unprovable but true statements in an axiomatic system doesn't mean they're common or you can't prove all the satatements one has encountered thus far.


> The day rates look high, but they have to pay self-employment taxes and health insurance

Most SWEs doing 'contract work' aren't actually 1099 contractors, what it means is they work for an intermediary and then 'contracted out' to a big company with a recognizable name (Google/BofA/Verizon/IBM). These jobs are W2 so you don't pay self-employment tax, they usually have mediocre insurance options, and you lose your job as soon as the work dries up.

It's simply a way for a company to staff up quickly without going through as strict of hiring committee and having any obligation to keep said staff around if plans change.

It's often easier for a department to get a req for a senior/staff contractor (and make salary exceptions) than a FTE, so what happens is departments will offer 200, 250 an hour for temp work but only able to hire engineers for 150k a year because their 'employee' req is slotted lower.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: