You are reading tutorials
Author: Ioannis Prapas
Data Scientist

Analyze Geospatial Data in Python: GeoPandas and Shapely

This article is the first out of three of our geospatial series.

Geospatial data have a lot of value. Our Geospatial series will teach you how to extract this value as a data scientist.

  • This 1st article introduces you to the mindset and tools needed to deal with geospatial data. It also includes a reincarnation of what has become known as the first spatial data analysis ever conducted: John Snow's investigation of the 1854 Broad Street cholera outbreak.
  • The 2nd article will dive deeper into the geospatial python framework by showing you how to conduct your own spatial analysis.
  • The 3rd article will apply machine learning to geospatial data.

Intro to Geospatial Data

Geospatial data describe any object or feature on Earth's surface. Common examples include:

  • Where should a brand locate its next store?
  • How does the weather impact regional sales?
  • What's the best route to take in a car?
  • Which area will be hit hardest by a hurricane?
  • How does ice cap melting relate to carbon emissions?
  • Which areas will be at the highest risk of fires?

Answers to these questions are valuable, making spatial data skills a great addition to any data scientist's toolset.

The Basics

Let's start by learning to speak the language of geospatial data. At the end of this section, you will know about:

  1. Vector vs. raster data
  2. Geographic Reference Systems (CRS)
  3. The difference between Georeferencing and Geocoding.


Vector data represent geometries in the world. When you open a navigation map, you see vector data. The road network, the buildings, the restaurants, and ATMs are all vectors with their associated attributes.

Note: Vectors are mathematical objects. Unlike rasters, you can zoom into vectors without losing resolution.

There are three main types of vector data:

  1. Points
  2. Lines. Connecting points creates a line.
  3. Polygons. Connecting lines with an enclosed area generate a polygon.

We can use vectors to present features and properties on the Earth’s surface. You'll most often see vectors stored in shapefiles (.shp).

Specific attributes that define properties will generally accompany vectors. For example, properties of a building (e.g., its name, address, price, date built) can accompany a polygon.

Vectors make up a navigation map


Raster data is a grid of pixels. Each pixel within a raster has a value, such as color, height, temperature, wind velocity, or other measurements.

Whereas the default view in Google maps contains vectors, the satellite view contains raster satellite images stitched together. Each pixel in the satellite image has a value/color associated with it. Each pixel in an elevation map represents a specific height. Raster $=$ Image with Pixels

Agriculture fields of the southern Limpopo region, South Africa.
Source: Flickr

These are not your usual images. They contain RGB data that our eyes can see, and multispectral or even hyperspectral information from outside the visible electromagnetic spectrum. Instead of being limited to only 3 channels/colors (RGB), we can get images with many channels.

Things that are invisible to the naked eye, absorbing only a small part of the electromagnetic spectrum, can be revealed in other electromagnetic frequencies.


Raster VS Vector Table

Points, Lines, PolygonsPixels
Geometric Objects, Infinitely ScalableFixed Grid, Fixed Resolution
.svg, .shp.jpg, .png, .tif

Coordinate Reference System (CRS)

To identify exact locations on the surface of the Earth, we use a geographic coordinate system.

For example, try searching for 37.971441, 23.725665 on Google Maps. Those two numbers point to an exact place - the Parthenon in Athens, Greece. The two numbers are coordinates defined by the CRS.

Even though the Earth is a 3-dimensional sphere, we use a 2-dimensional coordinate system of longitude (vertical lines running north-south) and latitude (horizontal lines running east-west) to identify a position on the Earth's surface. Converting a 3D sphere (the globe) into a 2D coordinate system introduces some distortions. We will explore those distortions in the next section on Map Projections.


Note: No CRS is perfect

Any choice of CRS involves a tradeoff that distorts one or all of the following:

  • shape
  • scale/distance
  • area

Very Important!!! Most mistakes in geospatial analyses come from choosing the wrong CRS for the desired operation. If you do not want to spend days and nights debugging, read this section thoroughly!

Common CRS pitfalls:

  • Mixing coordinate systems: When combining datasets, the spatial objects MUST have the same reference system. Be sure to convert everything to the same CRS. We show you how to perform this conversion below.
  • Calculating areas: Use an equal-area CRS before measuring a shape's area.
  • Calculating distances: Use an equidistant CRS when calculating distances between objects.

Map Projections


A map projection flattens a globe's surface by transforming coordinates from the Earth's curved surface into a flat plane. Because the Earth is not flat (I hope we agree here), any projection of the Earth into a 2D plane is a mere approximation of reality.

In reality, the Earth is a geoid, meaning an irregularly-shaped ball that is not quite a sphere. The most well-known projection is the Mercator projection. As shown in the above gif, the Mercator projection inflates objects that are far from the equator.

These inflations lead to some surprising revelations of our ignorance, like how the USA, China, India, and Europe all fit inside Africa.


Georeferencing is the process of assigning coordinates to vectors or rasters to project them on a model of the Earth’s surface. It is what allows us to create layers of maps.

With just a click within Google Maps, you can change seamlessly from satellite view to road network view. Georeferencing makes that switch possible.


Geocoding is the process of converting a human-readable address into a set of geographic coordinates. For example, the Parthenon in Athens, Greece, is at latitude 37.988579 and longitude 23.7285437.

There several libraries that handle geocoding for you. In Python, geopandas has a geocoding utility that we'll cover in the following article.

Python geospatial libraries

In this article, we'll learn about geopandas and shapely, two of the most useful libraries for geospatial analysis with Python.

Shapely - a library that allows manipulation and analysis of planar geometry objects. pip install shapely.

Geopandas - a library that allows you to process shapefiles representing tabular data (like pandas), where every row is associated with a geometry. It provides access to many spatial functions for applying geometries, plotting maps, and geocoding. Geopandas internally uses shapely for defining geometries.


What can you put into geometry?

The basic shapely objects are points, lines, and polygons, but you can also define multiple objects in the same object. Then you have multipoints, multilines and multipolygons. These are useful for objects defined by various geometries, such as countries with islands.

Let's see how that looks:

import shapely
from shapely.geometry import Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon

Shapely defines a point by its x, y coordinates, like so:


We can calculate the distance between shapely objects, such as two points:

a = Point(0, 0)
b = Point(1, 0)

Multiple points can be placed into a single object:

MultiPoint([(0,0), (0,1), (1,1), (1,0)])

A sequence of points form a line object:

line = LineString([(0,0),(1,2), (0,1)])

The length and bounds of a line are available with the length and bounds attributes:

print(f'Length of line {line.length}')
print(f'Bounds of line {line.bounds}')
Length of line 3.6502815398728847
Bounds of line (0.0, 0.0, 1.0, 2.0)

A polygon is also defined by a series of points:

pol = Polygon([(0,0), (0,1), (1,1), (1,0)])

Polygons also have helpful attributes, such as area:


There are other useful functions where geometries interact, such as checking if the polygon pol intersects with the line from above:


We can also calculate the intersection:


But what is this object?


It's a GeometryCollection, which is a collection of different types of geometries.

Pretty straightforward and intuitive so far! You can do so much more with the shapely library, so be sure to check the docs.

Geopandas basics

Another tool for working with geospatial data is geopandas. As we know, pandas DataFrames represent tabular datasets. Similarly, geopandas DataFrames represent tabular data with two extensions:

  • The geometry column defines a point, line, or polygon associated with the rest of the columns. This column is a collection of shapely objects. Whatever you can do with shapely objects, you can also do with the geometry object.
  • The CRS is the coordinate reference system of the geometry column that tells us where a point, line, or polygon lies on the Earth's surface. Geopandas maps a geometry onto the Earth's surface (e.g., WGS84).

Let's see this in practice.

How to install geopandas on Windows

The easiest way to install geopandas on Windows is to use Anaconda with the following command: conda install -c conda-forge geopandas

If you don't have Anaconda, there are several dependencies you need to install first for geopandas to install via pip successfully.

1) Find your Python version

Open command prompt and type python. You will see a similar version info printout to this:

Python 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)] on win32

This printout tells me that I have Python 3.8, 64 bit (AMD64), which we'll need to keep in mind for the next steps.

2) Install the GDAL dependency

geopandas requires GDAL, and you can obtain a wheel of GDAL for your system here:

You will see a list of different versions, and you need to pick the version that corresponds to the Python version you found in step one. At the time of this writing, the latest version of GDAL is 3.3.0, but this version did not successfully install despite my Python version matching. Instead, I went with one version lower: 3.2.3.

Since my Python version is Python 3.8, 64-bit, this corresponds to the GDAL wheel GDAL‑3.3.0‑cp38‑cp38‑win_amd64.whl. Notice that the cp38 and amd64 match my Python version.

Once you download the wheel, you can install it using pip by first using command prompt to go to the directory where the wheel is located, then run the following install command:

pip install GDAL‑3.3.0‑cp38‑cp38‑win_amd64.whl

This command will differ depending on the GDAL version you downloaded. If the install completes without errors, you can now install Fiona in the next step.

3) Install the Fiona dependency

geopandas also requires Fiona, and you can obtain a wheel of Fiona for your system here:

Again, you will see different wheel options, and like GDAL from the previous step, you need to match your Python version. In this case, the most recent Fiona version is 1.8.20, which worked with my system. The wheel that corresponds matches my Python version is Fiona‑1.8.20‑cp38‑cp38‑win_amd64.whl.

Similar to GDAL, you can install the Fiona wheel with pip like so:

pip install Fiona‑1.8.20‑cp38‑cp38‑win_amd64.whl

Make sure to replace the wheel with your version. If you don't see any errors from running this command, geopandas should install successfully.

4) Install geopandas

Now that you have GDAL and Fiona, you should be able to run the following command to install geopandas:

pip install geopandas

You can then confirm the install was successful by opening a Python interpreter and running import geopandas. This import should not result in any exceptions.

# !pip install geopandas
import matplotlib
import geopandas as gpd

Loading data

Let's start by loading a dataset shipped with geopandas, called 'naturalearth_lowres'. This dataset includes the geometry of each country in the world, accompanied by some further details such as Population and GDP estimates.

world_gdf = gpd.read_file(
0920938OceaniaFijiFJI8374.0MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
153950935AfricaTanzaniaTZA150600.0POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2603253AfricaW. SaharaESH906.5POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
335623680North AmericaCanadaCAN1674000.0MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4326625791North AmericaUnited States of AmericaUSA18560000.0MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
1727111024EuropeSerbiaSRB101800.0POLYGON ((18.82982 45.90887, 18.82984 45.90888...
173642550EuropeMontenegroMNE10610.0POLYGON ((20.07070 42.58863, 19.80161 42.50009...
1741895250EuropeKosovo-9918490.0POLYGON ((20.59025 41.85541, 20.52295 42.21787...
1751218208North AmericaTrinidad and TobagoTTO43570.0POLYGON ((-61.68000 10.76000, -61.10500 10.890...
17613026129AfricaS. SudanSSD20880.0POLYGON ((30.83385 3.50917, 29.95350 4.17370, ...

177 rows × 6 columns

If you ignore the geometry column (a shapely object), this looks like a regular dataframe. All the columns are pretty much self-explanatory.


The dataframe also includes a CRS that maps the polygons defined in the geometry column to the Earth's surface.
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In our case, the CRS is EPSG:4326. That CRS uses Latitude and Longitude in degrees as coordinates.


Components of a CRS

Datum - The reference system, which in our case defines the starting point of measurement (Prime Meridian) and the model of the shape of the Earth (Ellipsoid). The most common Datum is WGS84, but it is not the only one.

Area of use - In our case, the are of use is the whole world, but there are many CRS that are optimized for a particular area of interest.

Axes and Units - Usually, longitude and latitude are measured in degrees. Units for x, y coordinates are often measured in meters.

Let's see an application for which we have to change the CRS.

Let's measure the population density of each country! We can measure the area of each geometry but bear in mind that we need first need to convert to an equal-area projection that uses meters as units.

world_gdf = world_gdf.to_crs("+proj=eck4 +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs")
<Projected CRS: +proj=eck4 +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +un ...>
Name: unknown
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- undefined
Coordinate Operation:
- name: unknown
- method: Eckert IV
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

We can now calculate each country's population density by dividing the population estimate by the area.

Note: We can access the area of the geometries as we would regular columns. Although no column contains geometry areas, the area is an attribute of the geometry objects.

world_gdf['pop_density'] = world_gdf.pop_est / world_gdf.area * 10**6

world_gdf.sort_values(by='pop_density', ascending=False)
99157826578AsiaBangladeshBGD628400.00POLYGON ((8455037.031 2862141.705, 8469605.972...1174.967806
794543126AsiaPalestinePSE21220.77POLYGON ((3127401.561 4023733.541, 3087561.638...899.418534
14023508428AsiaTaiwanTWN1127000.00POLYGON ((11034560.069 3156825.603, 11032285.2...681.899108
776229794AsiaLebanonLBN85160.00POLYGON ((3141154.397 4236334.349, 3117804.289...615.543551
9651181299AsiaSouth KoreaKOR1929000.00POLYGON ((10835604.955 4755864.739, 10836040.9...515.848728
973068243AsiaMongoliaMNG37000.00POLYGON ((7032142.671 6000941.853, 7107939.605...1.987486
202931South AmericaFalkland Is.FLK281.80POLYGON ((-4814015.486 -6253920.632, -4740858....0.179343
2257713North AmericaGreenlandGRL2173.00POLYGON ((-2555525.099 8347965.820, -2346518.8...0.026295
23140Seven seas (open ocean)Fr. S. Antarctic LandsATF16.00POLYGON ((5550199.759 -5932855.132, 5589906.67...0.012091
1594050AntarcticaAntarcticaATA810.00MULTIPOLYGON (((-2870542.982 -8180812.656, -28...0.000331

177 rows × 7 columns

Notice how the geometry objects now have values that are in totally different units than before.

Just looking at the dataframe above, we can quickly identify the outliers. Bangladesh has a population density of around $ 1175 \space persons / km^2$. Antarctica has a near-zero population density, with only 810 people living in a vast space.

It's always better to visualize maps, though. So let's visualize!


We can call .plot() on world_gdf just like a pandas dataframe:

figsize = (20, 11)

world_gdf.plot('pop_density', legend=True, figsize=figsize);

The above map doesn't look very helpful, so let's make it better by doing the following:

  1. Change to the Mercator projection since it's more familiar. Using the parameter to_crs('epsg:4326') can do that.
  2. Convert the colorbar to a logscale, which can be achieved using matplotlib.colors.LogNorm(vmin=world.pop_density.min(), vmax=world.pop_density.max())

We can pass different arguments to the plot function as you would directly on matplotlib.

norm = matplotlib.colors.LogNorm(vmin=world_gdf.pop_density.min(), vmax=world_gdf.pop_density.max())


Up until now, we've gone over the basics of shapely and geopandas, but now it's time we move to a complete case study.

Case Study: The source of the 1854 cholera outbreak

We'll use modern Python tools to redo John Snow's analysis identifying the source of the 1854 cholera outbreak on London's Broad Street. In contrast to his Game of Thrones counterpart, London's John Snow did now something: the source of cholera. He learned it doing the first-ever geospatial analysis!

For this example, we'll use the data from Robin's blog. Robin did the work to digitize Snow's original map and data.

Let's first retrieve the data and unzip it in our current directory:



Let's see what's inside:

!ls SnowGIS/

Ignore the file extensions for a moment, and let's see what we have here.

Vector Data

  • Cholera_Deaths : number of deaths at a given spatial coordinate
  • Pumps : location of water pumps

We can ignore the other files for the vector data and only deal with the '.shp' files. Called shapefiles, .shp is a standard format for vector objects.

Raster Data

  • OSMap_Grayscale : raster - georeferenced grayscale map of the area from OpenStreet Maps (OSM)
  • OSMap : raster - georeferenced map of the area from OpenStreet Maps (OSM)
  • SnowMap : raster - digitized and georeferenced John Snow's original map

We can ignore the other files for the raster data and only deal with the '.tif' files. '.tif' is the most common format for storing raster and image data.

How to install contextily on Windows

If you followed the Windows installation guide for geopandas earlier in the article, installing contextily will be very similar.

1) Install rasterio dependency

First, go to and download the correct wheel for your Python version Then run pip install rasterio‑1.2.6‑cp38‑cp38‑win_amd64.whl but using the wheel version you downloaded in the previous step.

2) Install contextily with pip install contextily

You should now be able to use pip to install contextily.

!pip install rasterio
!pip install contextily

We've already imported geopandas and matplotlib, so all we need for the rest of the analysis is to import contextily for plotting maps. Let's import those now:

import contextily as ctx

Reading in data

Let's read in the Cholera_Death.shp and Pumps.shp files into geopandas:

deaths_df = gpd.read_file('SnowGIS/Cholera_Deaths.shp')
pumps_df = gpd.read_file('SnowGIS/Pumps.shp')
003POINT (529308.741 181031.352)
102POINT (529312.164 181025.172)
201POINT (529314.382 181020.294)
301POINT (529317.380 181014.259)
404POINT (529320.675 181007.872)

The output looks exactly like a pandas dataframe. The only difference with geopandas' dataframes is the geometry column, which is our vector dataset's essence. In our case, it includes the point coordinates of the deaths as John Snow logged them.

Let's see what the CRS data looks like:
<Projected CRS: EPSG:27700>
Name: OSGB 1936 / British National Grid
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: United Kingdom (UK) - offshore to boundary of UKCS within 49°45'N to 61°N and 9°W to 2°E; onshore Great Britain (England, Wales and Scotland). Isle of Man onshore.
- bounds: (-9.0, 49.75, 2.01, 61.01)
Coordinate Operation:
- name: British National Grid
- method: Transverse Mercator
Datum: OSGB 1936
- Ellipsoid: Airy 1830
- Prime Meridian: Greenwich

The other difference is that correctly defined shapefiles include metadata articulating their Coordinate Reference System (CRS). In this case, it is EPSG:27700.

Let's now briefly look at the pumps data:

00POINT (529396.539 181025.063)
10POINT (529192.538 181079.391)
20POINT (529183.740 181193.735)
30POINT (529748.911 180924.207)
40POINT (529613.205 180896.804)
50POINT (529453.586 180826.353)
60POINT (529593.727 180660.455)
70POINT (529296.104 180794.849)

Similarly, pumps_df holds the positions of the water pumps near Broad Street.

Here are the pumps CRS data:
<Projected CRS: EPSG:27700>
Name: OSGB 1936 / British National Grid
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: United Kingdom (UK) - offshore to boundary of UKCS within 49°45'N to 61°N and 9°W to 2°E; onshore Great Britain (England, Wales and Scotland). Isle of Man onshore.
- bounds: (-9.0, 49.75, 2.01, 61.01)
Coordinate Operation:
- name: British National Grid
- method: Transverse Mercator
Datum: OSGB 1936
- Ellipsoid: Airy 1830
- Prime Meridian: Greenwich


When dealing with geospatial data, you should make sure all your sources have the same CRS. I cannot stress this enough. It is probably the most common source of all mistakes when dealing with geospatial data.

Plotting the outbreak

We can now plot the deaths and pumps data on a map of London's Broad Street.

We'll start building the plot by first charting deaths:

ax = deaths_df.plot(column='Count', alpha=0.5, edgecolor='k', legend=True)

With a reference to ax, we can then plot the pumps in their locations, marking them with a red X. Let's also make the figure larger.

ax = deaths_df.plot(column='Count', figsize=(15, 15), alpha=0.5, edgecolor='k', legend=True)
pumps_df.plot(ax=ax, marker='x', color='red', markersize=50)

We would now like to show a map of London's Broad Street underneath the data. This is where we can use contextily to read the CRS data:

ax = deaths_df.plot(column='Count', figsize=(15, 15), alpha=0.5, edgecolor='k', legend=True)
pumps_df.plot(ax=ax, marker='x', color='red', markersize=50)

    # CRS definition. Without the line below, the map stops making sense,

Now let's look at the same data, but on John Snow's original map. We can do this by changing the source parameter to SnowMap.tif, like so:

ax = deaths_df.plot(column='Count', figsize=(15, 15), alpha=0.5, edgecolor='k', legend=True)
pumps_df.plot(ax=ax, marker='x', color='red', markersize=50);

    # Using the original map, hand-drawn by Snow

John Snow understood that most cholera deaths were clustered around a specific water pump at the intersection of Broad Street and Lexington Street (red X near the middle of the map). He attributed the outbreak to an infected water supply at that pump.

It's interesting to see how little the area has changed since 1854.


In this article, we have had a small glimpse of what you can do with geospatial data:

  • We covered the basic notions that you need to understand to work with geospatial data. You should know the difference between a vector vs. raster and between geocoding vs. georeferencing. You also learned about projections, CRSs, and that Africa is HUGE!
  • We covered the basics of shapely and geopandas, allowing us to work with geospatial vectors.
  • Lastly, we reincarnated the first geospatial analysis. We found the infected water pump that was the source of the 1854 cholera outbreak in London.

Follow us for the following articles where we:

  1. dive deeper into geopandas, preparing and analyzing a geospatial dataset
  2. do machine learning with geospatial data!

After this series, you'll be ready to carry out your own spatial analysis and identify patterns in our world!

More Resources

A good place to find free spatial datasets is rtwilson's list of free spatial data sources.

You can download free satellite imagery from NASA's portal or Copernicus.

Consider enrolling in a course to learn more about how to handle spatial data. Check out Geo University. Or start with the Geospatial Analysis course on Coursera.

Take the internet's best data science courses Learn More

Meet the Authors


MEng in Electrical and Computer Engineering from NTUA Athens. MSc in Big Data Management and Analytics from ULB Brussels, UPC Barcelona and TUB Berlin. Has software engineering experience working at the European Organization for Nuclear Research (CERN). Currently he is working as a Research Data Scientist on a Deep Learning based fire risk prediction system.

Editor: Rhys Lewis
Brendan Martin
Editor: Brendan Martin
Founder of LearnDataSci

Get updates in your inbox

Join over 7,500 data science learners.