I don't know all the details, perhaps one of their employees on here can provide...

I don't know all the details, perhaps one of their employees on here can provide a better, non-marketing-spin (please no "we're building an intelligence operating system" nonsense again).

My understanding is, that when they deploy they build a pretty sophisticated ontology for the customer. It's not a comprehensive ontology, but one that's domain specific to that customer -- which actually has the potential to work really well. So for example, the FBI would get one that targets law enforcement (people, places, addresses, license plates, violations, prior convictions, gangs, etc.), while the Army might get one that's focused around military operations (people, places, rank, equipment, unit, etc.).

They then custom write Java software that connects one of the customer databases to theirs, and probably either mirrors it and links it all to some metadata, or if it's federated, generates a pile of metadata with pointers/links back to the original data. The meta-data itself can actually be very very comprehensive. For example, assume the original data is a news report, that report might have everything from tagged people and their attributes to photos of some of the people, videos, maps, whatever (actually I think they just store a list of entity IDs and then store the rest of the junk oriented around the entities, but it's not that important). But basically it's a custom ETL tool for each data source.

I believe it's in that code that they also tell the Palantir enterprise backend how to map individual fields from the incoming data to entity types in the ontology. Assume instead of news articles, you are connecting to a phone book. You have to map names to people, numbers to phone numbers and addresses to locations (or whatever they have it called in that particular instance of the ontology at that site), etc. If it's a yellow pages, you can map names to businesses instead of people, etc. The ontology itself is mutable, so they can decide post-facto that they would like to add a new attribute to a person "place of birth". So if their knowledge base has phonebooks AND birth certificates, when you go to inspect an entity, like a person, it'll retrieve the place of birth and build you a nice dossier of that person.

In cases where they're consuming unstructured data (the lions share of government reports), there's no fields to map to the ontology. So analysts have to sift through each report and do the mapping with the front-end tool while they are conducting analysis. I know of at least one site that's in the process of hiring out a bunch of low-paid data entry people simply to go through and do this tag-map operation on their reports.

If ahead of time, the reports have been run through an enterprise named entity extractor, they can leverage those to populate the knowledge base. But in practice, quality is low, and the named entity extractors tend not to do a good job of determining different subclasses of entities. For example, you'll get a giant list of people from a document, but no indication if a particular person is a scientist, a politician or a terrorist (and all three of those categories might be entity types in the ontology for that site). In addition, the key factor here is the relationships between entities and between entities and their attributes. And most named entity extractors do an even poorer job determining that than just finding the entities. So the default at most sites is to have shifts of analysts manually tagging and mapping documents.

On the front-end, when you search for a person in the little search box, you can not only get documents that name might appear in, but you can call up a dossier on them with all the little various attributes and other bits and pieces filled in for you. Imagine a police record like you see on TV, with the person's photos and other facts and figures and such -- except that it's generated dynamically based on all the tagged and mapped meta-data, and come be composed of data from several different sources at once, like a phone book, a DMB database and a million news reports.