Mastodon

iCloud Complications, Part 2: De-duplication

Today I’m continuing my series of iCloud/Core Data-related posts with a discussion of duplicate data, how and why it occurs, and what you can do about it in your apps. As with my previous post in this series, today I’m sticking to how things are supposed to work and sometimes actually do.

The problem of duplicate data isn’t actually specific to iCloud. It can happen in any case where people might create data on different devices and where your app wants to sync that data between the devices. Some sync solutions may try to solve the problem for you, but iCloud does not.

Why are Duplicates Possible?

In short: Core Data doesn’t care if you create multiple objects with identical data fields. You can, if you like, create any number of instances where every attribute and relation is exactly the same.

That’s not completely true– there’s always a unique managed object ID, which keeps each instance unique. But object IDs do not sync, so their uniqeness is not relevant to managing duplicates when syncing. Syncing via iCloud (and other services) sends managed object attributes and relationships only– basically, anything you configure in the model editor. The object ID can’t sync because it’s automatically created when an object is inserted in a persistent store. And as I discussed (link:/icloud-complications text:last time), when you sync an instance to a new device, you insert it into a new persistent store. As a result, the one guaranteed-unique detail on a managed object is no longer available.

How Can Syncing Cause Duplicates?

The simple, relatively benign case of duplicate data goes something like this:

  1. Someone using your app creates a new record on a device that’s not connected to the internet at the time. Like, they’re on a plane on one of the airlines I usually end up on that don’t have inflight Wifi.

  2. Later on, with their device still in airplane mode, they create a duplicate record on a different device. Their office Mac, maybe.

  3. When their mobile device comes back online, your syncing system notices both of these new instances and duly syncs them to other devices. Since Core Data doesn’t care that they’re duplicates, they both appear on all of the user’s devices.

In this case you might argue that you should just ignore it, because the potential number of duplicates is low and because the user probably won’t be surprised. It’s not a great solution, but worse things happen in apps all the time. You could do better, though.

There’s a different scenario that can be much, much worse. What happens if someday you decide to migrate from one syncing system to a different one? Maybe you’re using a different sync service now but one day decide iCloud is reliable enough. Or maybe, you have an app using iCloud but its current issues get to be too much to deal with and you go elsewhere. What then? Something like this may well play out:

  1. The user upgrades your app on their iPhone, and all of their existing local data gets migrated to the new service.

  2. The user upgrades your app on their iPad, and all of their existing local data gets migrated to the new service. Except that it’s already there, only Core Data doesn’t care that it’s creating duplicates and now you have two copies of every single instance.

  3. The user migrates your app on their Mac, and now there are three copies of everything.

…repeat as many times as the user has devices that use your app…

Yeah, this happened to me. Unlike the previous scenario, there’s no even half-reasonable way to argue that this is OK.

What Can I Do About It?

The classic approach to not having duplicates in a persistent store is don’t create them in the first place. Before creating a new instance, check whether you would be creating a duplicate, and don’t create that duplicate. Update the existing copy if it looks like a good idea but don’t create a new one. Doing this efficiently is pretty well covered in Apple’s Efficiently Importing Data document, in the Implementing Find-or-Create Efficiently section. (Update: Sadly Apple has taken down this document and I don’t have a copy I can link from here. The nearest thing available in 2022 seems to be Loading and Displaying a Large Data Feed).

It won’t help you with iCloud, though. When iCloud has new incoming data, it imports the new changes to your data store first and tells you about them afterward. You find out about this when NSPersistentStoreDidImportUbiquitousContentChangesNotification gets posted. There is no corresponding will import notification or anything like a should import delegate call that might allow you to veto changes. And since Core Data doesn’t care if you create a duplicate, duplicates are created and you’re left to clean up the mess. In short: find the duplicates yourself.

The hardest part may be deciding what actually constitutes a duplicate entry. It’s usually the case that many or most attributes in an entity description could reasonably have the same value in different instances. The easiest solution is if your entities have some field, maybe hidden from the user, that contains a UUID or some other guaranteed-unique value. This field ends up serving as a unique key for quickly finding duplicates. It’s the only scenario I know of with Core Data where it’s reasonable to create your own unique ID independent of the object ID. But: it only helps for the duplicate storm that comes after changing sync mechanisms. It’s useless at handling a duplicate record created by the user on different devices.

The next-hardest part is figuring out what to do about the duplicates. You probably want to delete all but one copy, but it’s not as simple that:

  • What if the duplicates are not exactly the same? Unless objects must be absolutely identical to be considered dupes, it’s posssible that there are differences in some attributes. Just deleting objects risks losing unique data. You may need to try to merge conflicting data to avoid loss, which is a tricky issue on its own.

  • What if you delete different objects on different devices? Your syncing system will propagate deletes. If you delete different instances on different devices, you risk having those delete propagations remove all instances, not all but one. Whatever scheme you use to delete objects needs to delete the same objects in every case, on every device.

At WWDC 2012’s Using iCloud with Core Data session, Apple presented SharedCoreData, sample code for detecting duplicates. It’s a decent example if you can detect dupes based on a single attribute of an entity. As far as I know it’s only available from Apple in the WWDC 2012 Sample Code bundle. I’m going to go over the salient points of that project as it relates to duplicate removal. Update, May 19, 2022: Ten years later I’m amazed that the sample code bundle link still works. Since I don’t trust that it will continue to work, SharedCoreData is also on GitHub.

In this case the app defines a Person entity and considers two instances to be duplicates if they have the same emailAddress. The first step is finding every case where the same emailAddress appears twice:

NSError *error = nil;
NSManagedObjectContext *moc = [[NSManagedObjectContext alloc] init];
[moc setPersistentStoreCoordinator:_psc];

NSFetchRequest *fr = [[NSFetchRequest alloc] initWithEntityName:@"Person"];
[fr setIncludesPendingChanges:NO];

NSExpression *countExpr = [NSExpression expressionWithFormat:@"count:(emailAddress)"];
NSExpressionDescription *countExprDesc = [[NSExpressionDescription alloc] init];
[countExprDesc setName:@"count"];
[countExprDesc setExpression:countExpr];
[countExprDesc setExpressionResultType:NSInteger64AttributeType];

NSAttributeDescription *emailAttr = [[[[[_psc managedObjectModel] entitiesByName] objectForKey:@"Person"] propertiesByName] objectForKey:@"emailAddress"];
[fr setPropertiesToFetch:[NSArray arrayWithObjects:emailAttr, countExprDesc, nil]];
[fr setPropertiesToGroupBy:[NSArray arrayWithObject:emailAttr]];

[fr setResultType:NSDictionaryResultType];

NSArray *countDictionaries = [moc executeFetchRequest:fr error:&error];

The NSExpression finds not the email addresses but the number of times each address is used. The count function combined with the call to setPropertiesToGroupBy: call mean that the fetch will return an array of NSDictionary, each of which contains an emailAddress and a count of the number of times the address was found.

For those of you who speak SQL, all of this is conceptually almost the same as doing this:

SELECT emailAddress, COUNT(emailAddress) FROM Person GROUP BY emailAddress;

The code then iterates over this array to get only email addresses where count is greater than 1. It gathers these into an array named emailsWithDupes.

Next comes finding the full NSManagedObject for each of these duplicate email addresses. This is pretty straightforward Core Data, fetching every Person using one of the addresses found above:

fr = [NSFetchRequest fetchRequestWithEntityName:@"Person"];
[fr setIncludesPendingChanges:NO];

NSPredicate *p = [NSPredicate predicateWithFormat:@"emailAddress IN (%@)", emailsWithDupes];
[fr setPredicate:p];

NSSortDescriptor *emailSort = [NSSortDescriptor sortDescriptorWithKey:@"emailAddress" ascending:YES];
[fr setSortDescriptors:[NSArray arrayWithObject:emailSort]];

NSArray *dupes = [moc executeFetchRequest:fr error:&error];

This gets every object with a duplicated email address, sorted by address. That’s followed by a loop which removes all but one instance of each object with the same address.

For the SQL users, the above is conceptually the same as:

SELECT * FROM Person WHERE emailAddress IN (...email addresses found above...)
    ORDER BY emailAddress;

The code then runs through this list, choosing winning objects and deleting the losers. An interesting detail is how the code decides which objects to delete:

if ([person.emailAddress isEqualToString:prevPerson.emailAddress]) {
    if ([person.recordUUID compare:prevPerson.recordUUID] == NSOrderedAscending) {
        [moc deleteObject:person];
    } else {
        [moc deleteObject:prevPerson];
        prevPerson = person;
    }
}

Normally you wouldn’t compare UUIDs like this, but in this case it’s a good way to ensure that you get the same result on every device. If you’ve got a field like this, you could use that to look for duplicates instead of something like an email address. Finding duplicate email addresses covers the case where the user has created duplicate instances. Finding them by something like the UUID is better suited to handling the duplicate storm that can result from changing sync mechanisms.

If you find duplicates like this, that is, based on a specific user-visible attribute, make sure that the duplicate rules are clear to the user. You don’t want your users to unknowingly create new data that you’re just going to turn around and delete as a duplicate. If your app will weed out duplicate instances by scanning for duplicate email addresses, don’t let the user go ahead and create them in the first place.

Optimizing the Detection Process

The code above scans the entire data store to see if any duplicates occur, anywhere. At times that’s the best approach. But in many cases it’s possible to speed things up by avoiding unnecessary work:

  • The incoming change notification has separate lists of inserted, updated, and deleted objects. If you’ve scanned for duplicates in the past, you know there won’t be any new duplicates unless at least one new object has been inserted. Skip duplicate detection when only updates and/or deletions arrive.

  • When incoming objects are inserted, optimized duplicate detection based on these objects’ attributes. In this example, don’t find every Person, just find those where the Person has an email address that matches one of the incoming objects. That is, add a predicate to the original fetch request that looks something like:

NSArray *incomingEmailAddresses = // initialize based on the incoming change notification
NSPredicate *predicate = [NSPredicate predicateWithFormat:@"emailAddress IN %@", incomingEmailAddresses];
  • You may well have more than one entity that you need to check for duplicates. Don’t bother checking entities where duplicates would be removed by a deletion rule from some other entity. For example, in this case, suppose the Person entity had a one-to-one relationship to a separate Address entity. If the delete rule for this relationship is cascade, you don’t need to bother looking for duplicate Address instances, because they’ll be deleted along with the duplicate Person.

Next time: I’m waiting…

In the next episode of iCloud Complications I’ll discuss why the documented, recommended approach to using iCloud can lead to nightmarish delays at app launch time, even if everything is working. I’ll suggest some alternate schemes that can help.