Dot Spatial Data Structures

May 26, 2010 at 11:32 PM

The System.Spatial.Data structures cannot simply reuse our MapWindow FeatureSet design model because that has lots of interdependency between the different areas, like topology.  The new structures for System.Spatial.Data should be as stand-alone as possible with no confusion with the analytical functions of topology.  I propose that our .Net data provider classes plan on having some reasonable compatibility with the .Net linkages from gdal and ogr.  However, in a model that was pioneered in MapWindow, I recommend that they allow for extensibility and offer no fixed dependence on unmanaged library sets.  Instead a pair of libraries work in tandem.  Firstly, a System.Spatial.Data library provides the data provider interface as well as a data manager class that works with external libraries that implement the interface, as well as the in memory structures for features, rasters, and images.  Then, using the dynamically loaded run-time dependency model, an infinitely extensible collection of optional plug-in systems provide the specific instances that extend the data support.  One example might depend on gdal and ogr.  Ideally, FDO could serve as an alternative for ogr, or perhaps even work in tandem with it.  I recommend against the OGC format as used by NTS for this functionality because in .NET the heavy use of lots of classes requires extra memory and is very slow for building graphics paths and drawing.  I have also heard rumors that libraries like GEOS are much faster than NTS, leaving me to suspect that we should be thinking in terms of raw data access primarily for display engines, and that can be adapted to topology or analysis mechanisms on demand.  It has also been useful to be able to load the geometric features without loading all the attributes.  For shapefiles, at any rate, this can drastically improve the memory and loading performance.

Things that MapWindow does not provide that I believe should be included in System.Spatial.Data from the start are the ability to query by extents or query shape index directly from the file, rather than attempting to load everything into memory.  Drawing from memory is faster, but, especially in the cases of images, it is not generally possible to store all the content into memory.  So even if there is a performance hit, we would prefer our layers to be able to work directly with the providers.

I was considering using the term "Feature" but I think that has been so overused and at this point has very specific connotations in other areas.  It still might be the best fit, however.

  • FeatureProvider - A class that can handle queries by shape index or spatial extent and returns in memory features
  • FeatureSet - An in memory result (like a DataSet)
  • Feature - shape vector content and attributes
    • One option is to make the WKB content a blob in the attributes.  The disadvantage is for shapefiles turning the vector into WKB, but better for SQLSpatial and the like.
    • Another is to use a class to make a tuple from a datarow and the geometric content.
    • Finally, we could just access the vector content completely independently from attributes and use the shape index for coordination.
  • Vertices
    • Could be interleaved array of XY values with a separate, optional array of Z, M values (following shapefile format)
    • Indexing is messy with the above structure.  Perhaps it is worth it to have a tall array of X values, a tall array of Y values and a tall array of Z values.
  • Vertex Grouping
    • Tall vertex arrays are continuous for the entire featureset.  Fast for reprojecting point shapefiles where you simply pass the entire array (or pair of arrays) in one call.
    • Tall arrays, but store separate arrays for each "shape" in the shapefile. 
      • + Easier for re-arranging in memory shapes. 
      • -  Slow for reproject points, which now requires every vertex to have a separate reproject call.
      • - Double the memory required for point shapefiles which have so many shapes.

I'd like people's feedback, especially on vector data structures and the best way to proceed here.  If you were doing analysis, I think you would want to be able to cycle through all the vertices with as little effort as possible.  I suppose that for reprojection of point shapefiles, a single array could be built, the reprojection could be done, and then the vertices updated.  The memory hit would still be there, but that I feel is less of a problem when we are thinking from the start of working with file based systems.  So I think my current vote would be to have tall arrays that are separate for each shape, and where the X is one array, the Y is one array, and the Z and M are also separate arrays.  This makes indexing much less confusing and error prone than having interleaved vertices, or combining all the shapes into a single array.  It will likely be a little slower for accessing shapefiles, however, since we have to sort out those X and Y values.  I also don't like having a tall array of short arrays where each coordinate is stored as an array.  It guarantees that a lot of extra memory is required and doesn't offer much in the way of benefits.  I'm not a fan of using lists because the access time for lists is slower than arrays, and for drawing, speed is critical.  Anyway, let me know what you guys think.

May 31, 2010 at 11:16 AM

Hi Shade1974, are you ever on irc? We should try and arrange an informal chat sometime.. cheers jd

Sep 1, 2010 at 8:53 AM
Edited Sep 1, 2010 at 8:55 AM

I hope soon support FDO or Refactoring  DotSpatial Data support SQLSpatial,oracle,postgis

Sep 1, 2010 at 3:14 PM

In the comparisons I've done recently with other components, current DotSpatial memory use is very high. It required roughly an additional 80MB to load/display a shapefile of 200K points.  Map Suite required 40MB, Tatuk 15MB. The amount of time to do the fitted display was as follows:

DotSpatial: 1.7sec

Map Suite: 5.4sec

Tatuk: 1.4 sec

We're trying to replace Map Objects which required 13MB and the fitted display took just 0.4sec.

The memory usage numbers above are rough as I just used task manager to measure.