In case someone is looking for historical weather data for ML training and prediction, I created an open-source weather API which continuously archives weather data.
Using past and forecast data from multiple numerical weather models can be combined using ML to achieve better forecast skill than any individual model. Because each model is physically bound, the resulting ML model should be stable.
I just quit photographing weddings (and other stuff) this year. It's a job where the forecast really impacts you, so you tend to pay attention.
The amount of brides I've had to calm down when rain was forecast for their day is pretty high. In my experience, in my region, precipitation forecasts more than 3 days out are worthless except for when it's supposed to rain for several days straight. Temperature/wind is better but it can still swing one way or the other significantly.
For other types of shoots I'd tell people that ideally we'd postpone on the day of, and only to start worrying about it the day before the shoot.
I'm in Minnesota, so our weather is quite a bit more dynamic than many regions, for what it's worth.
They're very cautious about naming a "best" model though!
> Weather forecasting is a multi-faceted problem with a variety of use cases. No single metric fits all those use cases. Therefore,it is important to look at a number of different metrics and consider how the forecast will be applied.
I would like to see an independent forecast comparison tool similar to Forecast Advisor, which evaluates numerical weather models. However, getting reliable ground truth data on a global scale can be a challenge.
Since Open-Meteo continuously downloads every weather model run, the resulting time series closely resembles assimilated gridded data. GraphCast relies on the same data to initialize each weather model run. By comparing past forecasts to future assimilated data, we can assess how much a weather model deviates from the "truth," eliminating the need for weather station data for comparison. This same principle is also applied to validate GraphCast.
Moreover, storing past weather model runs can enhance forecasts. For instance, if a weather model consistently predicts high temperatures for a specific large-scale weather pattern, a machine learning model (or a simple multilinear regression) can be trained to mitigate such biases. This improvement can be done for a single location with minimal computational effort.
How did you handle missing data? I’ve used NOAA data a few times and I’m always surprised at how many days of historical data are missing. They have also stopped recording in certain locations and then start in new locations over time making it hard to get solid historical weather information.
It can take up to 10 min to generate a report - I had a spinner before but people just left the page. So I implemented a way to send it to them instead. I’ve never used the emails for anything else than that. Try it with a 10 min disposable email address if you like. Thanks for your feedback!
Ok, seems like your UI is not coming from a place of malice. However, pulling out an email input form at the final step is a very widespread UI dark pattern, so if nothing else please let people know that you will ask their email before they start interacting with your forms.
I see the limit for non-commercial use should be "less than 10.000 daily API calls". Technically 2 is less than 10.000, I know, but still I decided to drop you a comment. :)
I confirm, open-meteo is awesome and has a great API (and API playground!).
And is the only source I know to offer 2 weeks of hourly forecasts (I understand at that point they are more likely to just show a general trend, but it still looks spectacular).
Thank you, I didn't know!
I'd love to, but I'd need another 24 hours in a day to also process the data - I'm glad I can build on a work of others and use the friendly APIs :).
This is awesome. I was trying to do a weather project a while ago, but couldn't find an API to suit my needs for the life of me. It looks like yours still doesn't have exactly everything I'd want but it still has plenty. Mainly UV index is something I've been trying to find wide historical data for, but it seems like it just might not be out there. I do see you have solar radiation, so I wonder if I could calculate it using that data. But I believe UV index also takes into account things like local air pollution and ozone forecast as well.
Both APIs use weather models from NOAA GFS and HRRR, providing accurate forecasts in North America. HRRR updates every hour, capturing recent showers and storms in the upcoming hours. PirateWeather gained popularity last year as a replacement for the Dark Sky API when Dark Sky servers were shut down.
With Open-Meteo, I'm working to integrate more weather models, offering access not only to current forecasts but also past data. For Europe and South-East Asia, high-resolution models from 7 different weather services improve forecast accuracy compared to global models. The data covers not only common weather variables like temperature, wind, and precipitation but also includes information on wind at higher altitudes, solar radiation forecasts, and soil properties.
Using custom compression methods, large historical weather datasets like ERA5 are compressed from 20 TB to 4 TB, making them accessible through a time-series API. All data is stored in local files; no database set-up required. If you're interested in creating your own weather API, Docker images are provided, and you can download open data from NOAA GFS or other weather models.
This is great. I am very curious about the architectural decisions you've taken here. Is there a blog post / article about them? 80 yrs of historical data -- are you storing that somewhere in PG and the APIs are just fetching it? If so, what indices have you set up to make APIs fetch faster etc. I just fetched 1960 to 2022 in about 12 secs.
Traditional database systems struggle to handle gridded data efficiently. Using PG with time-based indices is memory and storage extensive. It works well for a limited number of locations, but global weather models at 9-12 km resolution have 4 to 6 million grid-cells.
I am exploiting on the homogeneity of gridded data. In a 2D field, calculating the data position for a graphical coordinate is straightforward. Once you add time as a third dimension, you can pick any timestamp at any point on earth. To optimize read speed, all time steps are stored sequentially on disk in a rotated/transposed OLAP cube.
Although the data now consists of millions of floating-point values without accompanying attributes like timestamps or geographical coordinates, the storage requirements are still high. Open-Meteo chunks data into small portions, each covering 10 locations and 2 weeks of data. Each block is individually compressed using an optimized compression scheme.
While this process isn't groundbreaking and is supported by file systems like NetCDF, Zarr, or HDF5, the challenge lies in efficiently working with multiple weather models and updating data with each new weather model run every few hours.
I always suspect that they don't tell me the actual temperature. Maybe I am totally wrong but I suspect. I need to get my own physical thermometer not the digital one in my room and outside my house and have a camera focussed on it. So that later I can speed up the video and see how much the weather varied the previous night.
this is really cool, I've been looking for good snow-related weather APIs for my business. I tried looking on the site, but how does it work, being coordinates-based?
I'm used to working with different weather stations, e.g. seeing different snowfall prediction at the bottom of a mountain, halfway up, and at the top, where the coordinates are quite similar.
You'll need a local weather expert to assist, as terrain, geography and other hyper-local factors create forecasting unpredictability. For example, Jay Peak in VT has its own weather, the road in has no snow, but it's a raging snowstorm on the mountain.
Extreme weather is predicted by numerical weather models. Correctly representing hurricanes has driven development on the NOAA GFS model for centuries.
Open-Meteo focuses on providing access to weather data for single locations or small areas. If you look at data for coastal areas, forecast and past weather data will show severe winds. Storm tracks or maps are not available, but might be implemented in the future.
KML files for storm tracks are still the best way to go. You could calculate storm tracks yourself for other weather models like DWD ICON, ECMWF IFS or MeteoFrance ARPEGE, but storm tracks based on GFS ensembles are easy to use with sufficient accuracy
Appreciate the response. Do you know of any services that provide what I described in the previous comments? I'm specifically interested in extreme weather conditions and their visual representation (hurricanes, tornados, hails etc.) with API capabilities
Go to:
nhc.noaa.gov/gis
There's a list of data and products with kmls and kmzs and geojsons and all sorts of stuff. I haven't actually used the API for retrieving these, but NOAA has a pretty solid track record with data dissemination.
I have heard the same regarding 5xx errors in the past couple of months. I am also working on open-source weather API https://open-meteo.com/. It covers most of WeatherKit features and offers more flexibility. You can either use the public API endpoint or even consider to host your own API endpoint.
Forecast quality should be comparable as the API uses open-data weather forecasts from the American weather service NOAA (GFS and HRRR models) with hourly updates. Depending on the region, weather models from other national weather services are used. Those open-data weather models are commonly used among the most popular weather APIs although without any attribution.
Hi, creator of open-meteo.com here! I am using a more wide range of weather models to better cover Europe, Northern Africa and Asia. North America is covered as well with GFS+HRRR and even weather models from the Canadian weather service.
In contrast to pirate weather, I am using compressed local files to more easily run API nodes, without getting a huge AWS bill. Compression is especially important for large historical weather datasets like ERA5 or the 10 km version ERA5-Land.
open-meteo.com looks awesome. I've been messing around writing a snow forecast app for skiing/snowboarding for a while now and the main thing I'm missing is historical snowfall data. Do these data sources exist in a machine readable format and I've just not been able to find them? If so, would you ever consider adding precip + kind of precip to your historical API?
Snowfall is already available in the historical weather API. Because the resolution is fairly limited for long term weather reanalysis data, snow analysis for single mountains slopes/peaks may not be that accurate.
If you only want to analyse the weeks to get the date of last snowfall and how much power might be there, use the forecast API and the "past_days" parameter to get a continuous time-series of past high-resolution weather forecasts.
Open-Meteo offers free weather APIs for a while now. Archiving data was not an option, because forecast data alone required 300 GB storage.
In the past couple of weeks, I started to look for fast and efficient compression algorithms like zstd, brotli or lz4. All of them, performed rather poor with time-series weather data.
After a lot of trial and error, I found a couple of pre-processing steps, that improve compression ratio a lot:
1) Scaling data to reasonable values. Temperature has an accuracy of 0.1° at best. I simply round everything to 0.05 instead of keeping the highest possible floating point precision.
2) A temperature time-series increases and decreases by small values. 0.4° warmer, then 0.2° colder. Only storing deltas improves compression performance.
3) Data are highly spatially correlated. If the temperature is rising in one "grid-cell", it is rising in the neighbouring grid cells as well. Simply subtract the time-series from one grid-cell to the next grid-cell. Especially this yielded a large boost.
4) Although zstd performs quite well with this encoded data, other integer compression algorithms have far better compression and decompression speeds. Namely I am using FastPFor.
With that compression approach, an archive became possible. One week of weather forecast data should be around 10 GB compressed. With that, I can easily maintain a very long archive.
Amazing you were able to get data from 300-GB to 10-GB, impressive!
Radar-Data: Find the most obvious gaps in predicting short-term weather are related to radar data. Obviously radar datasets would be require massive storage space, but curious if you have run across any free sources for archival radar data or APIs for real-time streams; or open source code from scrapping existing services radar feeds.
`300 GB` to `10 GB` was bit over optimistic ;-) 300 GB already included 3 weeks of data. `100 GB` to `10 GB` is a more realistic number.
Many weather variables like precipitation or pressure are very easy to compress. Variables like solar radiation are more dynamic and therefore less efficient to compress.
Getting radar data is horrible... In some countries like the US or Germany, it is easy, but many other countries do not offer open-data radar access. For the time being, I will integrate more open datasets first
I wonder if with bitpacking you can achieve even higher ratio, considering each temp has 3 digit and temp range 51.2 to -51.2 if reasonable range 1 for signature 9 temp bit could store 3 temp in an integer. Deltas might consume less range maybe but might need extra bit tweak, afaik fastpfor also does similar run with simd , but what i understand time is not your main concern.
Edit: just read the 0.04-0.02 range , if I understand right only putting 1 real temp and then deltas could fit 12 first int and 16 temp following ints? Quick napkin math , could be wrong:)
Yes, it is a combination of delta coding, zigzag, bitpacking and outliner detection.
It only works well for integer compression. For text-based data, results are not use-full.
SIMD and decompression speed is an important aspect. All forecast APIs use the compressed files as well. Previously I was using mmap'ed float16 grids, which were faster, but took significantly more space.
Would flac work for compression? Given the weather data is a time series of numbers it could be represented as audio. It would then automatically do the difference encoding thing you’re doing.
If you encoded nearby grid cells as audio channels, flac would even handle the correlation like it does for stereo audio.
3) Data are highly spatially correlated. If the temperature is rising in one "grid-cell", it is rising in the neighbouring grid cells as well. Simply subtract the time-series from one grid-cell to the next grid-cell. Especially this yielded a large boost.
Sure. I bundle a small rectangle of neighbouring locations like 5x5 (= 25 locations). The actual weather model may have a grid like 2878x1441 cells (4 million).
Inside the 5x5 chunk, I subtract all grid-cells from the center grid-cell. The borders will then contain only the difference to the center grid-cell.
Because the values of neighbouring grid-cells are similar, the resulting deltas are very small and better compressible.
(1) Maybe it’s just me, but the “current jobs” are only available in German, if you switch to English, Spanish or French — the page gets translated, but the three “current jobs” drop down lists get removed; super confusing, since it gets reset to German if you click “current jobs” from any of the other pages;
(2) HN is an English site, would be nice if you were linking to the English page, not German;
(3) if you’re affiliated with the company, which I believe you are, you should say so and noting it in your profile with contact information would be nice too.
(4) Reminder that HN has free job postings every month if you are affiliated with the company:
I have not. It looks promising as it seems to offer multi dimensional data storage and some compression aspects.
I must also admit, that I like my simple approach of just keeping data in compressed local files. With fast SSDs it is super easy to scale and fault tolerant. Nodes can just `rsync` data to keep up to date.
In the past I used InfluxDB, TimescaleDB and ClickHouseDB. They also offer good solutions for time-series data, but add a lot of maintenance overhead.
Thanks for your response. I have no affiliation, it just piqued my interest.
> I must also admit, that I like my simple approach of just keeping data in compressed local files. With fast SSDs it is super easy to scale and fault tolerant. Nodes can just `rsync` data to keep up to date.
Actually, they are historical weather forecasts, but assembled to a continuous time-series.
Storing each weather forecast individually to a performance evaluation for "how good a forecast in 5 days is", would require a lot of storage. Some local weather models update every 6 hours.
But even with a continuous time-series, you can already tell how good or bad a forecast compared to measurements are. Assuming, your measurements are correct ;-)
This would still be a remarkable dataset for learning. And worth the storage. Though it might need other inputs as well (like pressure zone etc.) to escape potential biases.
Using past and forecast data from multiple numerical weather models can be combined using ML to achieve better forecast skill than any individual model. Because each model is physically bound, the resulting ML model should be stable.
See: https://open-meteo.com