Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Seriously, a loglog plot? Even in that, there is a seriously wide dispersion to your correlation. And then look at the same data on a linear plot.


The linear plot in that Wolfram link is messed up. It doesn't show all the data (caps out at 800 billion GDP). Here's a corrected linear plot, from the script that I linked (commenting out the log-log scaling):

https://ibb.co/9bBgwH8

There is clearly a correlation, even on linear. It's a little messy, but it's undeniably there.

The starting point for this discussion was about the relationship between a country's size and population and it's power and influence. The correlation between area and GDP demonstrates that there is a meaningful relationship.

Btw, what is your specific complaint about a log-log plot? Country data points for area and GDP span many orders of magnitude, which makes it harder to visualize any patterns on a linear plot.

I also don't understand your point about the dispersion. The correlation and trend is pretty clear. No one said the correlation was 99%.

Edit: I've calculated Pearson's correlation coefficient for this data [1]. The result is 0.82, which indicates a strong positive correlation.

[1] https://en.wikipedia.org/wiki/Pearson_correlation_coefficien...


> The result is 0.82, which indicates a strong positive correlation.

datamash gave me 0.52 for Pearson. Which is "eh, maybe".


That's weird, are you looking only at the top 10 countries?

I've reproduced dwaltrib's results using World Bank data on 251 countries, and I get a Pearson's r of 0.82 and a p value of 5.6e-61 (!). I.e. a strong correlation, with high confidence. It makes sense too -- larger countries generally have more people, and more people generally generate more economic activity.

Code if you want to try yourself:

import pandas as pd

gdp = pd.read_csv("~/Downloads/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_5551501.csv").set_index("Country Name")

land_area = pd.read_csv("~/Downloads/API_AG.LND.TOTL.K2_DS2_en_csv_v2_5552158.csv").set_index("Country Name")

gdp["GDP"] = gdp["2020"]

gdp["Land"] = land_area["2020"]

gdp = gdp.dropna(subset=["GDP", "Land"])

from scipy import stats

print(stats.pearsonr(gdp.Land, gdp.GDP))

#+RESULTS: : PearsonRResult(statistic=0.8151313879150333, pvalue=5.621180589722219e-61)


Thanks for double checking. I ran out of steam on the convo heh.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: