GCP is pretty pleasant overall. The API, command line, and UI are all pretty good. Vast majority of functionality is supported in all three places too. Like all providers, they have a few services that suck but overall Google has done a great job at developer experience.
There are several use cases where ML can help even if it isn't perfect or even just better than random. Here is one example in NLP/search.
Let's say you have a product search engine and you analyzed the logged queries. What you find is a very long tail of queries that are only searched once or twice. In most cases, the queries are either misspellings, synonyms that aren't in the product text, or long queries that describe the product with generic keywords. And the queries either return zero results or junk.
If text classification for the product category is applied to these long tail queries, then the search results will improve and likely yield a boost in sales because users can find what they searched for. Even if the model is only 60% accurate, it will still help because more queries are returning useful results than before. However you don't apply ML with 60% accuracy to your top N queries because it could ruin the results and reduce sales.
Knowing when to use ML is just as important as improving its accuracy.
I am not against ML. I have built useful ML models.
I am against GPT-3.
For that matter I was interested in AGI 7 years before it got ‘cool’. Back then I was called a crackpot, now I say the people at lesswrong are crackpots.
If you can write a SQL query or a set of SQL queries to do your transformation, then you can use DBT. DBT doesn't do transformation itself rather it helps you manage all the dependencies between your SQL models. Whether you can use SQL depends on your data and database/warehouse functionality. For example, JSON parsing support is pretty good now in many databases and warehouses. If your objects can be represented as JSON, then you could write SQL via DBT to parse the objects into columns and tables.
I think using a data warehouse as your data lake or lake house is optimal. Even for data that isn't relational. Storage is so cheap now and is decoupled from compute costs for several providers that I don't even give it a thought. You get a fast, scalable SQL interface which is still nice and useful for non-relational data. Then all, or most, of the transformations needed for analysis can be pure SQL using a tool like DBT. In my experience, it greatly simplifies the entire pipeline.
I don't get it... Looks to me like DBT is a Python SQL wrapper / big library that among other things includes an SQL generator / something else like that -- but not "pure" SQL?
DBT has two main innovations. First, everything is a SELECT statement and DBT handles all the DDL for you. You can handle DDL yourself if you have a special case too. Second, the ref/source macros build a DAG of all your models so you don't have to think about build order. There are other innovations but those are the main ones.
You can give it truly pure SQL in both models and scripts, and mixing in Jinja if you need it for dynamic models. But I'd recommend at least using ref/source.
I guess it depends on what you mean by "simple". The algorithms are complex but there are good tools that implement them. I would imagine smaller companies would use off the shelf tooling, and I would argue that is simpler. Vector embeddings are so unbelievably powerful and often yield better results than classical methods with one of the good tools + pretrained embeddings.
Specifically for search, I use them to completely replace stemming, synonyms, etc in ES. I match the query's embedding to the document embeddings, find the top 1000 or so. Then I ask ES for the BM25 score for that top 1000. I combine the embedding match score with BM25, recency, etc for final rank. The results are so much better than using stemming, etc and it's overall simpler because I can use off the shelf tooling and the data pipeline is simpler.
> I match the query's embedding to the document embeddings,
I assume the doc size is relatively small, otherwise a document may contain too many different topics that make it hard to differentiate different queries?
For my search use case, documents are mostly single topic and less than 10 pages. However I have found embeddings still work surprisingly well for longer documents with a few topics in them. But yes, multi-topic documents can certainly be an issue. Segmentation by sentence, paragraph, or page can help here. I believe there are ML-based topic segmentation algorithms too, but that certainly starts making it less simple.
Yes I know for sure. Postgres search is essentially an easier to use regex engine. If you have a recall-only use case and/or a small dataset, then that works great. As soon as you need multiple languages, advanced autocomplete, misspelling detection, large documents, large datasets, custom scoring, etc you need Solr or ES.
While I don't doubt that you know your usecase and weighed/tried the option.
> Postgres search is essentially an easier to use regex engine.
I'm not sure exactly what you meant to convey here, but if you're searching with LIKE or `~` you're not doing Postgres's proper Full Text Search. You should be dealing with tsvectors[0]
> As soon as you need multiple languages
Postgres FTS supports multiple languages and you can create your own configurations[1]
> advanced autocomplete
I'm not sure what "advanced" autocomplete is but you can get pretty fast trigram searches going[2] (back to LIKE/ILIKE here but obviously this is an isolated usecase). In the end I'd expect auto complete results to actually not hit your DB most of the time (maybe I'm naive but that feels like a caching > cache invalidation > cache pushdown problem to me)
> misspelling detection
pg_similarity_extension[3] might be of some help here, but it may require some wrangling.
> large documents, large datasets,
PG has TOAST[4], and obviously can scale (maybe not necessarily great at it) -- see pg_partman/Timescale/Citus/etc.
> custom scoring
Postgres only has basic ranking features[5], but you can write your own functions and extend it of course.
Solr/ES are definitely the right tools for the job (tm) when the job is search, but you can get surprisingly far with Postgres. I'd argue that many usecases actually don't want/need a perfect full text search solution -- it's often minor features that turn into overkill fests and ops people learning/figuring out how to properly manage and scale an ES cluster and falling into pitfalls along the way.
I disagree based on the trends, especially in the US where few, if any, new coal plants are being built and most existing ones are scheduled to be converted to natural gas or shut down. But let's assume that a significant amount of coal will continue to be burned over the next 20 years. Do you think we should stop innovation in other sectors until we are off coal?
Most innovation doesn’t need massive amounts of electricity. EV’s being the only notable exception, but they are also offsetting significant CO2 emissions.
Further, the amount of electricity generated by fossil fuels is going to be heavily dependent on overall electricity demand. Keeping older less efficient and more expensive power plants operating is very much a response to electricity demand not an inherent lifespan independent of the electrical market.
I grew up in Cincinnati. You are correct, however traffic is bad enough now that a subway or light rail would make a big difference. Unfortunately it is unlikely the voters would approve a tax increase to support that.
It's not that people wouldn't approve a tax increase to support it. The issue is that every politician in the state would attack a light rail project for the city.
The city managed to complete a street car after ten years, but it faced opposition from the mayor, the governor, and federal congressman. The congressman in question even pushed a law that would prohibit any federal funding from going to the operation of the project. How fucked up is that? A congressman who prevents federal funding from flowing into their district.
Cincinnati in particular also suffers from a hodgepodge of local government entities that don't really get along. The bus system is run at the county level, but the city limits end before that. Anecdotally, a friend of mine claims that his family voted against the early-2000s light-rail plan on the basis that it would have resulted in the City of Deer Park being downgraded to a village.
But is it fair to compare light rail with subways? Seems to me light rail is the worst of all worlds: Slow, low carrying capacity, subject to the same traffic as cars and probably expensive.
The political opposition to the project had nothing to do with preferring a technically superior form of public transportation. Instead, it was opposition to public transportation in general.
Light rail works great downtown. In particular, slowness is a good thing for local businesses. Fast travel like cars does not encourage people to go to your store, it encourages people to drive past it. This is why e.g. Melbourne has a free transit zone in the CBD.
I agree. If you wait until traffic is bad enough, then transit might end up getting pushed -- but it's just to appease people who want to maintain their car-driven lifestyle. Everyone currently sitting in their car on the freeway doesn't think they're the "other" person who should be taking the bus or train. The next step is outright fighting any growth, if they already haven't been.
We have no data for mRNA vaccines except those trials. We don't know if the historical data on traditional vaccines are applicable here. As such, it is also possible that a single dose isn't good enough or a single dose is good enough but only for a couple months. We just don't know and it would be way worse if it turns out we have to re-vaccinate everyone because we were impatient.
> We have no data for mRNA vaccines except those trials. We don't know if the historical data on traditional vaccines are applicable here.
This is not how decision making under uncertainty works. When you don't have rigorous proof of whether something works or doesn't work you have to make educated guesses based on various priors and make the percentage play. You can't refuse to incorporate priors into your decision making just because you don't have a peer reviewed p<0.05 study validating it.
Uncertainty about the result cuts both ways. You can't claim that we can't do X over Y because we don't have rigorous proof that X is better than Y, when we don't have rigorous proof that Y is better than X either. Regardless of what you do you're taking a leap of faith.
> We just don't know and it would be way worse if it turns out we have to re-vaccinate everyone because we were impatient.
If there's probability p that it doesn't work and we have to spend X extra months re-vaccinating everyone that's an expected delay of p * X. But if it does work then not pursuing the single dose strategy will delay the vaccination schedule by X' months also.
If p * X < (1-p) * X' then the former is a perfectly acceptable risk.
I personally try to make decisions using probability. I understand where you are coming from but there is one factor missing in your analysis: this situation is literally life and death. That changes the math a bit to something more akin to "better safe than sorry" in my opinion. We have found a guaranteed path out of this mess. There may be other faster paths that save more lives, but it could also end up killing millions more too. I'm all for experimenting but I take issue with making the experiment the policy when lives are on the line.
We DO have a good understanding of the human immune system, and how the mRNA vaccines interact with it. The effectiveness shown in the trials confirm those theories.
Sure, there could be some odd unforeseen effect. If so we'd learn as we go.
> it would be way worse if it turns out we have to re-vaccinate everyone
Worse than 3000 people dying every day? Because of the extra expense of making more doses?
> We DO have a good understanding of the human immune system...
As someone who is an active researcher in this field, I can tell you that this is just not true. The immune system is VERY complex and we know very little about it. We have only had the tools to begin to systematically probe it for a few years.
I know what you're getting at in this context, but in reality when you start digging into how the immune system works it's shocking how little is understood about it.
"There’s a joke about immunology, which Jessica Metcalf of Princeton recently told me. An immunologist and a cardiologist are kidnapped. The kidnappers threaten to shoot one of them, but promise to spare whoever has made the greater contribution to humanity.
The cardiologist says, “Well, I’ve identified drugs that have saved the lives of millions of people.” Impressed, the kidnappers turn to the immunologist. “What have you done?” they ask.
The immunologist says, “The thing is, the immune system is very complicated …”