Getting Started with GEOGRAPHY Data—Why Don’t Planes Fly in a Straight Line?
Last June, Snowflake released geospatial data using the GEOGRAPHY data type to general availability. Since then, we’ve seen the number of geospatial queries rise significantly. However, the increased popularity brought more questions from Snowflake users on how to get started using geospatial data to enrich their business insights. This blog post will help anyone who is new to geospatial data understand the basics and get started in Snowflake. It also serves as a great pre-read for someone who wants to follow this geospatial-focused quickstart guide to get their feet wet.
GEOGRAPHY basics (aka, why don’t planes fly in a straight line?)
We’re all familiar with looking at visual representations of the earth. Depending on your age, you may remember flat poster maps decorating your classroom walls. Maybe your history teacher had a globe on their desk. Or maybe once upon a time you kept a book of maps in your car for road trips. Still, most of us today are accustomed to looking at our phones for anything related to where we are in the world, or where we’re going.
Let’s focus on that globe on your teacher’s desk for a moment. As a kid (or even as an adult!), you probably spun that globe, maybe letting your fingers glide across the smooth surface as the spherical object rotated round and round. You probably also found your location on the globe and looked at how you might travel to some other interesting country or continent. Perhaps you traced your finger along the path from your current location to that other place. If you did, chances are you didn’t trace the shortest path possible!
Look at the image above. Imagine if an airline told you that when flying from Berlin to San Francisco, the plane’s path would travel over Iceland, Greenland, and Canada. You’d think they were nuts, right? Clearly that arc is a longer flight than just simply drawing a straight line with your finger from Berlin to San Francisco like this:
That looks so much shorter! Ah, but it’s not. And the reason gets to the heart of understanding the GEOGRAPHY data type in Snowflake: the difference between geodesic and euclidean measurement. Simply put, geodesic measurement is based on a 3D spherical shape and measures distance over a curvature, whereas euclidean measurement is based on the flat 2D shape most of us are used to seeing in the above image. When we look at the line in the first image from more of a spherical perspective, though, we see why a plane would want to fly on that flight path:
The GEOGRAPHY data type in Snowflake accounts for this 3D curvature, using the latest World Geodetic System, WGS 84, standard for representing the earth as a spherical object using a latitude and longitude-based coordinate system. Let’s dive deeper into the GEOGRAPHY data type support in more detail (and if you’re a euclidean-2D-kinda-person, don’t fret—we’ll look at using that method in part 2).
Geospatial formats, functions, and object types
To understand the GEOGRAPHY data type, it helps to understand three fundamental concepts:
- Geospatial data formats
- Geospatial object types (also called “dimensions”)
- Geospatial functions and how to apply them
Let’s start with geospatial data formats. You can think of these formats as both input formats (loading data into the GEOGRAPHY data type) and output formats (how the data looks when queried from a GEOGRAPHY-typed column). There are three primary formats that Snowflake supports: (E)WKT - (Extended) Well-Known Text; (E)WKB - (Extended) Well-Known Binary; and GeoJSON. The table below shows the three formats and extended variants where applicable, as well as a query output representation of the longitude and latitude of the Bellagio Fountains (since the Snowflake Summit 2022 is in Las Vegas this year):
You should also note in the above table the use of the session parameter “GEOGRAPHY_OUTPUT_FORMAT” which will control how a GEOGRAPHY output will display in Snowflake or in a file after an unload operation.
In the above output representations of the Bellagio Fountains, the output is showing a particular object type, or geospatial dimension type: a POINT. There are many different object types, some of which are singular objects, and others which are groupings of multiple objects, as shown in the table below:
If you’re having nightmarish visions of grade school geometry class, don’t fret—Snowflake has your back on how to do all of those area, perimeter, and intersection calculations. Geospatial functions know how to operate on a GEOGRAPHY-typed object, and they make the calculations easy. Let’s take a look at the categories of geospatial functions and a few examples of each, and we’ll follow that with some examples of how to use them in SQL. If you want the full list of functions, you can find them here.
In addition to the different types of available functions, there are different ways we can use these functions: in the SELECT clause, in the query predicate, and in a JOIN clause. Let’s look at some examples of each usage, with different functions used in each instance.
Query examples
Conversion in the SELECT
We showed this example earlier in the format section, but let’s look at it again for its use of TO_GEOGRAPHY in the SELECT clause. What’s happening here is that we’re passing a string (‘POINT (-115.174016 36.112650)’) that is properly formatted as geospatial data and we’re converting it into a true geospatial data type object that can be used with geospatial functions, as we’ll see in a bit. Without wrapping that string in the TO_GEOGRAPHY function, Snowflake has no idea that the string is geospatial data. And for the purposes of this example, showing how the output of this query is actually geospatial data in GeoJSON drives the conversion point further home, since we’re not just regurgitating the WKT string.
A quick note about our string: the coordinates -115.174016 36.112650 were found by using one of a multitude of websites on the internet that output longitude and latitude coordinates when you click a point on the interactive map. As you’ll see later, we also have the Bellagio Fountains as a POLYGON shape in a table. If you need high-quality and ready-to-query geospatial data, Snowflake Marketplace has many geospatial-centric listings that you can access very quickly within your Snowflake account.
Conversion and measurement in the SELECT
The first query above was pretty simple—it didn’t even query a table. As the comment in line 4 describes, we are trying to find the row in a table of buildings that is closest to our Bellagio Fountains POINT by measuring the distance between the fountains to every row in the table, sorting the distances in ascending order, then limiting our results to the first (closest) row. Spoiler alert: the Bellagio Hotel is the closest building to the Bellagio Fountains!
Some notes about what’s happening here:
- ST_MAKEPOINT is just like TO_GEOGRAPHY, but specifically for POINTs. The syntax is a little different in that you can simply feed it longitude and latitude values (in that order) without the extra text around it, but the result is the same.
- ST_DISTANCE calculates the distance in meters between two objects: our fountains POINT created in the bullet above, and the shape column in our buildings table, which is itself a GEOGRAPHY-typed column.
- Note that we can perform a sort on the number calculated in the above bullet just like any other number in Snowflake.
Relationship comparisons in the WHERE clause
Instead of calculating the measurement of something, we’re now going to use a function to look at how objects relate to each other. The query above is similar to the distance calculation we just did, but instead of finding the shortest distance between two objects, we’re looking to see if any objects share space with each other, or intersect, which is a condition that we can evaluate with a WHERE clause. Either they do intersect or they don’t.
Some notes about what’s happening here:
- This query uses the same buildings table from before, but introduces a second table, all_objects. Where the former is a specific point-of-interest type, the latter contains POLYGONS for all points-of-interest, and it just so happens there is a POLYGON for the Bellagio Fountains, which is way_id=22971030. You were spared the research to locate this specific row.
- ST_INTERSECTS then looks at each shape in buildings and determines if it intersects with the shape found by running the subquery against all_objects. If the answer is TRUE, the row is returned. If it is FALSE, the row is not. You do not need to specify =TRUE in the condition when searching for TRUE results, as it is implied, but you would need to include =FALSE if you wanted to return the rows that do not intersect.
Relationship comparisons in a JOIN clause
In the previous relationship query, a subquery was run to return the shape of a specific object, then compare that object to all the rows in a given table. But what if we want to compare many rows in one table against many rows in another table? That’s what this query does by using a geospatial function as the condition in a join between two tables, rather than just using a simple t1.column = t2.column condition, and it can be written in two ways.
Some notes about what’s happening here:
- ST_DWITHIN is looking to see if two objects are within a certain meters distance from each other. It is comparable to writing ST_DISTANCE() <= [meters]. In this case, our two objects are shapes within our sports and cemetery tables (don’t read too much into those two table choices—purely coincidental!).
- The query starting at line 18 is written with the join condition expressed in the WHERE clause. The query starting at line 23 is written using the ANSI SQL JOIN clause, and both are equally accurate and performant. You may be thinking to yourself, though, that the JOIN in this example is a full cartesian product CROSS JOIN, and you would be correct. However, you should not specify a CROSS JOIN in geospatial joins in Snowflake, as Snowflake does not use a slow CROSS JOIN in a geospatial join of this nature. Instead, we have a more optimized join technique for these types of queries.
The example queries you’ve seen in this blog post are just a taste of what’s possible. The quickstart is an excellent place to learn how to use more geospatial functions than what we’ve covered here, but hopefully reading this post gives you a good foundation.
Stay tuned for part 2
In summary, Snowflake introduced geospatial GEOGRAPHY data type functionality, which allows users to represent data in Snowflake using common geospatial formats and use specialized functions to query that data. GEOGRAPHY uses the WGS84 standard, which treats the earth as a sphere and calculates measurements using geodesic arcs along a curve instead of a two-dimensional straight line. In part 2, we’ll explore how Snowflake is expanding this functionality to accommodate queries against more types of geospatial data.