Data Resources
The following list gives links to a selection of data
repositories that are used by data-driven software developers
and data scientists:
-
Simplemaps World Cities
Database
is a narrowly focused
resource that provides high quality data regarding cities, countries
and populations. The free dataset they provide may be useful for many
projects that involve analysis of geographically related data.
- Our World in Data is
both a journal and a
data repository.
It is focussed on geographic distribution of human related attributes,
such as life expectancy, health and economics. It provides a very good way to
investigate data analysis and visualisation since all the charts in the
publications enable you to download the data from which they were created.
- The UK Met Office
website hosts several data sources
relating to weather and climate.
Data and information on how to use it can be obtained via their
datapoint service web page.
- US Geological Survey (USGS) provides many
geological and geographic datasets, including live stream data
of several kinds. (Detailed geographic data relating to Great
Britain is available from the UK's Ordnace Survey via their
open data
resource, but is a little more difficult to access and utilise than the USGS data.)
- The US National Aeronautics and Space Administration
(NASA) provides a data resource hub
that enables access to a huge variety of scientific
data. This of course includes data relating to astronomy and space
science, but also covers data from all fields of physical science.
The site provides a
visualisation
of the different kinds of data that are
available.
- Kaggle is currently the largest and
most widely used resource providing datasets for data science.
It is not just a dataset reporsitory but also supports a community
within which data scientists can share and explore each others
data. What is more it provides a cloud-based Jupyter notebooks
environment and computational resources so that researchers
can implement data analysis projects within Kaggle itself.
It also facilitates browsing and discover of datasets
via its search interface.
Kaggle is very much oriented towards the Machine Learning
approaches to data analysis and provides many features that
support this.
You will need to make an account on Kaggle in order to be
able to make full use of its facilities, but that is not
necessary for just searching or browsing the datasets.
- data.world is similar to Kaggle in
supporting a data sharing community rather than just being
a repository. It differs from Kaggle in that it is not
particularly oriented to ML. It is geared towards more
established ways of accessing and analysing and provides
facilities for accessing remotely stored data by means
of queries formulated in
SQL or
SPARQL.
- For those interested in using machine learning techniques for data analysis
in Python, the
Scikit-learn
package is an excellent framework within which to conduct such reasearch.
It also provides a detailed
user guide and
tutorials,
as well as access to many datasets.
- The prestigeous science journal Nature curates
a web page of
recommended
data repositories,
which lists and describes a large number of repositories, covering a
very wide range of disciplines.
Whereas, the links suggested above were somewhat biased
towards geographic and physical sciencs, Nature's list of
recommended repositories also contains many links to
repositories of biological, medical, and social sciences
data.