Geo-Spatial Search

DSE Version: 6.0

Video

Transcript: 

Hello, this is Joe Chu.  In this video we will be talking about geo-spatial search. We finally get to one of the coolest  features in DSE Search.

While the focus of this video is about geo-spatial search, let's back up a moment and clarify a broader concept, which is spatial search, which lacks the geo part. Spatial search is a way to evaluate and compare the spatial geometry between two or more geometrical shapes, which can anything from a point, to a line, triangles, rectangles, and even more complex polygons. With the smallest unit being the point, everything else is therefore described as a series of points as well.

With this capability, DSE Search is able to do things like filter results for geometrical shapes that exist on, within, or outside another geometric shape. We can sort or boost scores based on the distance between two points or shapes, or by comparing the relative area between two shapes. Spatial search can also generate a 2D grid of facet counts, that can be used for heatmap generation or point-plotting.

Spatial search itself does not need to rely on any specific context, it only relies on the geometry to do its comparison. On the other hand, geo-spatial search typically focuses on a geographical or locational contaxt, and tends to work in just two dimensions.

Before we can do any geo-spatial searching, you'll probably want to do know what geospatial data looks like first. Usually the data is something that represents an object's location as defined by a coordinate system. The can be either absolute, where the points of an object are fixed somewhere in coordinate space, or they can be relative, where the data about the points of an object are an offset of other points. Although we have described objects so far as geometric shapes, they typically represent something meaningful, such as a particular address, or the location of vehicles, points of interests, are even your own current location.

There are several standards used for representing geo-spatial data, but the one used in DataStax Enterprise is the Well Known Text markup language, or WKT. This is a format that allows points and other geometric shapes to be defined within a two-dimensional, signed Cartesian coordinate system. You can think of that as those x,y coordinate planes from your old math classes, but they can also represent our own world which uses latitude and longitude coordinates.

DataStax Enterprise has three data types that are used specifically for storing WKT object data, which are the PointType, the LineStringType, and the PolygonType.  Let's go over these one by one.

The PointType is an object that represents a point in space, and is described using a coordinate pair, which is its location on the x axis and on the y axis. Depending on the coordinate space used to describe the point, the values for the coordinate pair can be described as an integer or as a decimal, and DSE allows for both to be stored. When working with the PointType within CQL, you work with the PointType as a string, which starts with the word point, and then contains the coordinate pair enclosed in parents, as shown in this example here. Note that with lat-long coordinates, the best practice is to actually describe the point with the longtitude first, as that is what describes the x coordinate, and then with the latitude, which would be the y coordinate.

An example of geo-spatial data with a point would be places of interests, such as the location of our own DataStax headquarters, shown on the map here.

The next data type is the LineStringType, which is a line object that is made up of two or more points. The object will start with the word, linestring, and then contain all of point coordinates for the line, enclosed in parenthenses, and separated by commas, as seen in this example. Search will automatically calculate the line as a vector, using the points provided, so you do no need to use a large number of number points for a stright line. It doesn't matter if the points are provided starting from one end of the line or from the other end. However, it does require the points to be provided in a specific order so that the line doesn't intersect itself.

An example where a linestring might be used would be to describe something such as a boundary, a road, rivers, etc.

The last geo-spatial data type in DataStax Enterprise is the PolygonType, and this is what is used to represent all other shapes, such as triangles, squares, or other complex shapes. The PolygonType data would start with the word polygon, and then contains the point coordinates for an outer bound, which represents the outer bound of the polygon, and an optional inner bound. For example if you are trying to describe a polygon with a hole in it, the inner bound is used to define the points that make up the hole. To organize all of this, there is a set of parentheses that encloses all of the polygon coordinates. Within that, there is a second set of parentheses that holds the outer bound coordinates, and a second set of parenthenses that has the inner bound coordinates. If you don't have inner bounds, the coordinates would just be enclosed in two sets of parenthenes.

All of the bounds should be closed, which means that the polygon is defined with the start and the end being the same point. Don't worry if you forget the last point though, because DSE can automatically add that for you, so long as the polygon contains a minimum of four points. Best practice would also indicate that the points for the outer bound should be written in a counter-clockwise direction, and the inner bounds in a clockwise direction, though DSE does not enforce this.

That's a lot of requirements to try to visualize what a polygon looks like, so let's take a look at an example! Here is a polygon that has outer bounds defined that would make up a rectangle, as seen on the map here. Note that it the coordinates just appear to be enclosed by two sets of parentheses, so there is no inner bound defined for this polygon.

This example here shows a polygon with an inner bound. Inside the first set, there are two additional sets of parentheses, separated by a comma, with the first being the outer bound and second being the inner bound. This would look like the shape on the map here, with a rectangular hole somewhat in the middle.

Let's also take a quick look at an example of a CQL table with these geospatial datatypes and some insert operations that will be adding some geospatial data.

We have a new table here called geospatial, in the killrvideo keyspace, which contains a point column, a linestring column, and a polygon column, using their respective datatypes. Note the single quotes that includes that name of the datatype, as that is needed to properly invoke those data types.

In the three insert statements at the bottom, we are adding a row with the geospatial data in the WKT format that we've seen earlier. Since they are treated as text, they'll also need to be enclosed in single-quotes, which is CQL's way of denoting a string.

Once there is geospatial data available in the database, then we can actually do some geospatial searching! Of course, there is some configuration that we needson the Search end before we can start, particular with the search index schema. Unfortunately, even up to DataStax Enerprise 6, DSE is not able to automatically generate a schema for tables with geospatial data types, and so they much be done manually. When creating a search index on a table with geospatial columns, make sure to use the CREATE SEARCH INDEX option with lenient equals to true, which will allow DSE to avoid trying to automatically map the geospatial columns, as that would lead to an error.

Since we do need to do the mapping automatically, you'll want to know what field types are available to use for the search index schema. Solr actually has several different field types to use for spatial search and geospatial search, some being older and even deprecated. The one that we're really interested in though is the SpatialRecursivePrefixTreeFieldType, or RPT, since that explicitly supports data formatted as Well-known text and is what we should use for the pointtype and linestring type. If you're already somewhat familiar with geo-spatial search in Solr, you might notice that other popular field types such as the LatLongPointSpatial field type is not on the list here. This is because these field types actually exist in newer versions of Solr that did not make it into the current integration with DataStax Enterprise 6

Now, since we know the field type to use, we can add that to the search index schema as shown in this example on the top. Note the class is SpatialRecursivePrefixTreeFieldType, and the fieldtype will also need several other unique attributes defined: dist err pct, geo, and distanceunits. distErrPct defines the precision of shapes other than points, and determines how much disk space the shape uses and how long it takes to index. More precise shapes (with a smaller value for disterrpct, will use more disk space and take longer to index. Larger values will make queries faster, but may not be as accurate.  the attribute geo indicates the coordinate data is lat-long, when true, or not lat-long, when set to false. The distanceunits attribute what unit to use for distance measurements, can can be in degrees, kilometers, or miles.

With a rpt field type defined, the other two examples here show declaring a field in the search index schema for the point column and linestring column in our killrvideo dot geospatial table.

After a search index reload, and rebuild, you'll be ready to run geospatial searches with those two columns.

Now if you've really been paying attention, you might have noticed that we haven't said anything about the polygon column. That's because there's just a little bit more extra work that needs to be done to enable support for polygons. Enter the JTS topology suite library, a Java library used for creating and manipulating vector geometry, which we'll need to be able to index and search on polygon data. This is a jar file that will need to be downloaded and installed, in the Solr library directory, which exists somewhere in your DSE installation. For package installs that would bin /usr/share/dse/solr/lib, and for a tarball installation, the relative path would be resources/solr/lib. Note that for DSE 6, version 1.13 of the JTS jar is required. After installed, you should also restart DataStax Enterprise so that the jar can be included in the classpath.Finally there is a spatialContextFactory parameter for the RPT field type that we are using in the search index schema, which should be added and set to the value org.locationtech.spatial4j.context.jts.jtsspatialcontextfactory.

It may sound like a bunch of extra work in order to use polygons, but it's also fully compatible with our pointtype and linestring type, so we really only need to do this once.

Here is an example of the ALTER SEARCH INDEX SCHEMA command that will add the rpt field type, this time including the spatialcontextfactory parameter. Afterwards, the same field type can be used for any of the three geospatial datatypes, as seen in these field declarations below.

Finally, we can get to the queries! Typically you'll find that geospatial queries are best used as a filter query; in most cases geo-spatial is used to support some main search by filtering out results that are not in a specific location or bounded by a distance.

There are three predicates, or functions that you can for comparing two shapes: intersects, iswithin, and contains. intersects can calulate if any part of a shape intersects, or touches, another shape.  iswithin determines if a shape is completely within another shape. and contains is also similar, but in reverse. It dtermines if the second shape is completely within the first shape.

Aside from your saved geospatial data, you might also need to arbitrarily define shapes to use for queries. This can be done with WKT, and can be used to define rectangles, circle, and other complex shapes. Note that the radius used in some of these shapes is specifically lat-long degrees, and cannot be changed to use other distance units.

Here are some basic examples using the predicates that we've described. The first examples uses IsWithin to find all rows or documents whose value in the point column, is within a line that we describe starting from the point 10,30 to 20,40.

The second example here shows a query that searches for anything, but filters to include only the rows or documents whose linestring value intersectos a certain point, with the coordinates 10 30.

The last example is another query that filters out rows on the polygon column, so that only those that contains or overlaps the point 10 30, would be included in the results.

You might realize that these are not lat-long coordinates that we are using, but again, this is just a basic example.

Now let's take a look at some applications of geo-spatial queries with specific functions we might want to be able to perform in an application. The first here is finding all of the objects within a polygon. This is a commonly used one which can filter out all of the objects that are not within the polygon, which could represent a neighborhoods, cities, states, and so forth. Other applications the polygon might represent cellular coverage. One other application would be to have a polygon with min and max x and y coordinates, which can be used to show all points of interest on the visible section of a map (described by the min and max x and y). If the map zooms in or out, the min and max coordinates change, and the search will run again to repopulate the points of interests on the map.

In our query example here, let's say that an application has the ability to search for movie theatres, and can do find movie theatres within a certain area, let's say in a city like New York. The query for that would be to do a search on theatres, but then include a filter query that checks if the lat-long coordinate for the theatres is within a polygon, which would be the geospatial boundary for the city of new york

Another common search is to find all objects within a certain distance of a particular point, which might be your current location, for example. This would be used for those type of functions that finds coffee shops or gas stations near you.  This is done by using the point location to create a circle, where the radius is some maximum distance you want to search. A search is then done to find all objects within the circle.

In the example here, we are now looking for movie theaters within a 10 kilometers of our current location. A point representing the current location is used, along with the distance value to create a circle. The filter query then uses IsWithin to search for all objects within the circle. Note that the value for the radius is in lat-long degrees, and can be fairly complicated to convert to and from for miles and kilometers. We won't get into it here, but do a search on great-circle-distance if you're interested in the technical challenges.

Another way to perform the function to find all objects within a certain distance is by using the Solr geofilt function. This can be a much easier to do as it doesn't require much calculations. If uses three parameters, sfield, which is the field name that contains the geospatial data you're comparing, pt, which represents the center of the area you are searching, and d which is the distance, or how far away from the center you are searching in. With the pt parameter the point coordinates can be either in lat long order, making sure you have a comma in between, or as long lat, using a space. The unit for the d distance is based on the distanceUnits parameter set for the field type.

As an aside you might want to keep in mind that there is also a bbox function as well, that uses a rectangle instead of a circle for the bounds it will search in. This means that objects may be included that are technically outside of the set distance. This is used though because it is computationally faster, and means that results can be returned sooner.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.