Core Data Import Performance

When creating the PointsStalker data importer I started by following a few different importing tutorials and eventually I got it all figured out. As the datasets I was importing grew from my test batch sizes grew from a few thousand to 18,000+ I realized I was having some scaling problems. The rough numbers said that the import speed was decreasing by a factor of 40x to 60x, so the complete imports were taking from 8 to 10 minutes for 18,000 athletes and their points histories. 

After a few months of tolerating the problem I decided to dig into the issue. Since the initial speed was reasonably good I ruled out the import logic as a major problem, but since the performance degraded over time I thought the problem might have been related to how data was being loaded into memory. In response I did some unscientific parametric testing of the data buffer size and the core data save interval, but the various values had nearly no effect on the overall processing time, so I started to look into other potential problems. 

The next point of interest was an NSMutableDictionary that is used to look up the managed object ID for athletes as new points list data is imported. I thought lookups might be taking a disproportionate amount of time because there are so many athletes to search though, but the points importing was in fact faster than athlete importing despite the points import logic being more complicated.

My gut feeling about NSMutableDictionary lookups was confirmed by an article that indicated dictionary lookups should usually be O(1) (constant time). The article also mentioned hashing and creating dictionaries with pre-defined capacity, but concluded that +dictionaryWithCapacity was ineffective. This got me thinking about the performance of adding objects to the NSMutableDictionary and with a little searching I turned up a discussion on Cocoabuilder that explains that as the pre-determined dictionary size is filled to 67% capacity the dictionary is re-hashed and can impact performance. To avoid this the discussion suggests creating the NSMutableDictionary with the underlying CFDictionary class to create a pre-sized dictionary using:

NSMutableDictionary *dict = (NSMutableDictionary*)CFDictionaryCreateMutable(NULL, 1024, &kCFTypeDictionaryKeyCallBacks, &kCFTypeDictionaryValueCallBacks);

This change alone cut the import time to 1.5 minutes. To me, that's a huge decrease! Before the CFDictionary change I was wary of making changes that would required regenerating the database, but having MUCH shorter imports has removed that concern. 

(Maybe now I should go back and check out those data buffers and save interval parameters again now that I've eliminated a major limiting factor)