DBF EOF

Developer
Dec 3, 2010 at 2:11 PM

The DotSpatial.Data.AttributeTable.Open(string filename) method tries to detect deleted records by making the following check:

 

if (fi.Length == (_headerLength + 1) + _numRecords * _recordLength)
{
	_hasDeletedRecords = false;
	// No deleted rows detected
	return;
}

If the if test fails, the Open method goes on to inspect the entire DBF, looking for deleted records which be time consuming.  This logic assumes there is an EOF (1Ah) at the end of the file (that's where the +1 comes in).  Any file I've seen generated by ESRI software indeed includes the EOF.  Some of my company's code does not, so I thought fair enough... I'll change my company's code to ensure there is always an EOF, which will prevent the Open method from needlessly looking for non-existent deleted rows. But then, I tried adding a feature using the FeatureSource code.  It did not append the EOF either.

So, what should I do?

A. Change the logic in Open to not be so picky about the file length?

B. Make sure all code that writes to DBF tacks on the EOF?

I vote for A.

This actually brings up a larger question.  I'm mulling over the idea of somehow providing the information about deleted rows so the Open does not have to look for them every time in the case where the application knows what has been deleted.  Not sure how to pull that one off yet.

Kyle

Developer
Dec 3, 2010 at 3:45 PM
Yes, Kyle, all very good ideas. Adding in at least the tolerance for _headerLength or _headerLength+1 would be good, especially if we could write it like

long length = _headerLength + _numRecords * _recordLength;
if (HasEOF(length)) length += 1;
if(fi.Length != length)
{
FindDeletedRecords();
}

The HasEOF is a made up function that tests to see if the last line has the EOF you are mentioning or not. Anyway, also, it would not be a bad idea to cache the deleted rows, perhaps encoded somewhere. The problem is that we can't trust it. If the dbf is edited outside of our software, then the cache will not be correctly updated. However, we could use the length discrepency to validate the cache. So in other words, after doing our EOF test, we know there are 15 rows extra. Our cache has a list of 15 deleted rows, so that looks good, but to verify we need to use the cache index values to jump to those 15 rows and test them. If all 15 rows are marked with the deleted *, then we can move on, using the current cache. If any of them are not deleted, or the total count in the cache doesn't match the expected 15 rows, then the slow process needs to be used to look for deleted rows and rebuild the cache.

Anyway, those are my thoughts on it for now.

Ted


Developer
Dec 3, 2010 at 8:35 PM

I understand. Those are all very good suggestions. So, the deleted rows cache would be in memory, right? I think that is OK as it is just a list of integers. If size got to be an issue (after deleting tons of features) we could dump the cache if it exceeds a certain size.

I’ll think about it some more and maybe implement next week if I can do it in a low risk fashion.

Kyle

From: Shade1974 [mailto:notifications@codeplex.com]
Sent: Friday, December 03, 2010 10:46 AM
To: kellison@geocue.com
Subject: Re: DBF EOF [DotSpatial:237041]

From: Shade1974

Yes, Kyle, all very good ideas. Adding in at least the tolerance for _headerLength or _headerLength+1 would be good, especially if we could write it like

long length = _headerLength + _numRecords * _recordLength;
if (HasEOF(length)) length += 1;
if(fi.Length != length)
{
FindDeletedRecords();
}

The HasEOF is a made up function that tests to see if the last line has the EOF you are mentioning or not. Anyway, also, it would not be a bad idea to cache the deleted rows, perhaps encoded somewhere. The problem is that we can't trust it. If the dbf is edited outside of our software, then the cache will not be correctly updated. However, we could use the length discrepency to validate the cache. So in other words, after doing our EOF test, we know there are 15 rows extra. Our cache has a list of 15 deleted rows, so that looks good, but to verify we need to use the cache index values to jump to those 15 rows and test them. If all 15 rows are marked with the deleted *, then we can move on, using the current cache. If any of them are not deleted, or the total count in the cache doesn't match the expected 15 rows, then the slow process needs to be used to look for deleted rows and rebuild the cache.

Anyway, those are my thoughts on it for now.

Ted

Read the full discussion online.

To add a post to this discussion, reply to this email (DotSpatial@discussions.codeplex.com@discussions.codeplex.com)

To start a new discussion for this project, email DotSpatial@discussions.codeplex.com@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com

Developer
Dec 3, 2010 at 9:12 PM
Edited Dec 6, 2010 at 6:44 PM

There already is an in-memory cache, I was thinking maybe an xml shapefile add-on that could save things we want to cache.

Ted

Developer
Dec 3, 2010 at 9:27 PM

Duh… right… needs to be a file.

From: Shade1974 [mailto:notifications@codeplex.com]
Sent: Friday, December 03, 2010 4:12 PM
To: kellison@geocue.com
Subject: Re: DBF EOF [DotSpatial:237041]

From: Shade1974

There already is an in-memory cache, I was thinking maybe an xml shapefile add-on that could save things we want to cache.

Ted

On Fri, Dec 3, 2010 at 1:35 PM, kellison <notifications@codeplex.com> wrote:

From: kellison

I understand. Those are all very good suggestions. So, the deleted rows cache would be in memory, right? I think that is OK as it is just a list of integers. If size got to be an issue (after deleting tons of features) we could dump the cache if it exceeds a certain size.

I’ll think about it some more and maybe implement next week if I can do it in a low risk fashion.

Kyle

From: Shade1974 [mailto:notifications@codeplex.com]
Sent: Friday, December 03, 2010 10:46 AM
To: kellison@geocue.com
Subject: Re: DBF EOF [DotSpatial:237041]

From: Shade1974

Yes, Kyle, all very good ideas. Adding in at least the tolerance for _headerLength or _headerLength+1 would be good, especially if we could write it like

long length = _headerLength + _numRecords * _recordLength;
if (HasEOF(length)) length += 1;
if(fi.Length != length)
{
FindDeletedRecords();
}

The HasEOF is a made up function that tests to see if the last line has the EOF you are mentioning or not. Anyway, also, it would not be a bad idea to cache the deleted rows, perhaps encoded somewhere. The problem is that we can't trust it. If the dbf is edited outside of our software, then the cache will not be correctly updated. However, we could use the length discrepency to validate the cache. So in other words, after doing our EOF test, we know there are 15 rows extra. Our cache has a list of 15 deleted rows, so that looks good, but to verify we need to use the cache index values to jump to those 15 rows and test them. If all 15 rows are marked with the deleted *, then we can move on, using the current cache. If any of them are not deleted, or the total count in the cache doesn't match the expected 15 rows, then the slow process needs to be used to look for deleted rows and rebuild the cache.

Anyway, those are my thoughts on it for now.

Ted

Read the full discussion online.

To add a post to this discussion, reply to this email (DotSpatial@discussions.codeplex.com@discussions.codeplex.com)

To start a new discussion for this project, email DotSpatial@discussions.codeplex.com@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com

Read the full discussion online.

To add a post to this discussion, reply to this email (DotSpatial@discussions.codeplex.com@discussions.codeplex.com)

To start a new discussion for this project, email DotSpatial@discussions.codeplex.com@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com

Read the full discussion online.

To add a post to this discussion, reply to this email (DotSpatial@discussions.codeplex.com@discussions.codeplex.com)

To start a new discussion for this project, email DotSpatial@discussions.codeplex.com@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com

Developer
Dec 6, 2010 at 6:36 PM

XML would be flexible, allowing us to cache additional information later if needed, but I’m concerned about performance as usual. The reason for doing the caching is for speed and deserializing XML sounds slow to me.

From: Shade1974 [mailto:notifications@codeplex.com]
Sent: Friday, December 03, 2010 4:12 PM
To: kellison@geocue.com
Subject: Re: DBF EOF [DotSpatial:237041]

From: Shade1974

There already is an in-memory cache, I was thinking maybe an xml shapefile add-on that could save things we want to cache.

Ted

On Fri, Dec 3, 2010 at 1:35 PM, kellison <notifications@codeplex.com> wrote:

From: kellison

I understand. Those are all very good suggestions. So, the deleted rows cache would be in memory, right? I think that is OK as it is just a list of integers. If size got to be an issue (after deleting tons of features) we could dump the cache if it exceeds a certain size.

I’ll think about it some more and maybe implement next week if I can do it in a low risk fashion.

Kyle

From: Shade1974 [mailto:notifications@codeplex.com]
Sent: Friday, December 03, 2010 10:46 AM
To: kellison@geocue.com
Subject: Re: DBF EOF [DotSpatial:237041]

From: Shade1974

Yes, Kyle, all very good ideas. Adding in at least the tolerance for _headerLength or _headerLength+1 would be good, especially if we could write it like

long length = _headerLength + _numRecords * _recordLength;
if (HasEOF(length)) length += 1;
if(fi.Length != length)
{
FindDeletedRecords();
}

The HasEOF is a made up function that tests to see if the last line has the EOF you are mentioning or not. Anyway, also, it would not be a bad idea to cache the deleted rows, perhaps encoded somewhere. The problem is that we can't trust it. If the dbf is edited outside of our software, then the cache will not be correctly updated. However, we could use the length discrepency to validate the cache. So in other words, after doing our EOF test, we know there are 15 rows extra. Our cache has a list of 15 deleted rows, so that looks good, but to verify we need to use the cache index values to jump to those 15 rows and test them. If all 15 rows are marked with the deleted *, then we can move on, using the current cache. If any of them are not deleted, or the total count in the cache doesn't match the expected 15 rows, then the slow process needs to be used to look for deleted rows and rebuild the cache.

Anyway, those are my thoughts on it for now.

Ted

Read the full discussion online.

To add a post to this discussion, reply to this email (DotSpatial@discussions.codeplex.com@discussions.codeplex.com)

To start a new discussion for this project, email DotSpatial@discussions.codeplex.com@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com

Read the full discussion online.

To add a post to this discussion, reply to this email (DotSpatial@discussions.codeplex.com@discussions.codeplex.com)

To start a new discussion for this project, email DotSpatial@discussions.codeplex.com@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com

Read the full discussion online.

To add a post to this discussion, reply to this email (DotSpatial@discussions.codeplex.com@discussions.codeplex.com)

To start a new discussion for this project, email DotSpatial@discussions.codeplex.com@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com

Developer
Dec 6, 2010 at 6:51 PM
Edited Dec 6, 2010 at 6:51 PM

XML allows us to store a blob (binary large object) as the value.  So my recommendation is to have an xml setup like:

<DeletedRows>

    <Count>34</Count>

    <ByteSize>32</ByteSize>

    <IndexValues>xxxxxxxxxxx</IndexValues>

</DeletedRows>

This makes it just readable enough to know what is being tracked, allows xml to be used for the addition of different things later, and provides the blob aspect for speed.  Buffer.BlockCopy can be used to convert it from an array of bytes into an array of 32Bit integers in the case above.  Having a ByteSize would allow for switching to long or something else.  It also gives a validation if count * ByteSize != xxx.length(); then you may count the cache as corrupted and discard it.

Ted