Increasing Performance for Complex Join

I have a complicated calculation which goes like this:

Table A: one record per employee per month
Table B: one record per paycheck per employee. Paychecks come in different categories, which is indicated in Paycheck::PaycheckType

The goal:
In Table A, calculate a field arrears which is equal to the # of days since the most recent paycheck, for one specific type of paycheck.

Table A and B are related with a complex relationship:

  • Table A::employeeID = TableB::employeeID
  • Table A::kType = TableB::PaycheckType
  • TableA::Date >= TableB::PaycheckDate
  • Sorted on TableB:: PaycheckDate (reverse)

Then the calculation in Table A is fairly simple:

  • TableA::Arrears = (TableB::PaycheckDate - TableA::Date)

This works because the >= relationship means that TableA only sees paychecks from the past or present (but not future). The sorting (by date, reversed) ensures that TableA only sees the most recent paycheck.

Only problem: this is SLOW - I've figured out that this one calculation and relationship is responsible for about 80% of the time it takes to update the employee table each month. This is not surprising, since the relationship has 3 predicates and is sorted.

Can anyone see a clever way to optimize this?

One idea would be to keep the latest paycheck in the main table so for each employ you will have one date. Another idea (or combined) would be to use a Script schedule each day to adjust the days and have "static" data for the days field

1 Like

I'm not sure that would work, since TableA is a historical table (one record per employee per month) and needs to handle retroactive data changes (e.g. missing paycheck data from the past comes in after the fact).

I was thinking that instead of having the relationship sorted by date (descending) and grabbing the last record, I could use Max(TableB::PaycheckDate) calculation instead. I don't know which would be faster.

Hi @xochi

My suspicion is that the relationship performs slowly due to one particular predicate: The inequality constraint for the dates

Thus, if I were in your shoes, I would first test the above hypothesis by removing that constraint, and just seeing if the relationship performs faster. Without the constraint, of course, the data will be incorrect, but this suggestion is just for the purpose of seeing what effect that particular constraint has on performance. I believe that removing that particular constraint (the date inequality) could cause a dramatic change in performance -- but, I am not positively certain about this.

If removing the constraint does make a huge change in performance, then I would begin considering approaches which can work without that constraint, such as:

  • Is it possible to define month and year fields in the "right hand" table, which could be used to enforce "no future values" because you have some additional knowledge about the feasible range in which the most recent payment would happen?

  • A strategy where you use the relationship without the constraint to grab a small set of data of multiple possible matching candidates, which you then refine to eliminate the future values?

  • Some combination of the above, or some other idea offered by anyone else?

But again: The first thing I would do is to see if it is possible to determine whether that inequality predicate on the date is the precise cause of the slowness in the relationship.

As always, I hope this may help.

5 Likes

Good ideas - please keep 'em coming :slight_smile:

Other thoughts:

  • this special paycheck is rare, so perhaps 95% of employees never have it. Given this, perhaps I could first do a search on the Paycheck table (Table B), then GoToRelatedRecords back to the EmployeeMonth table (A) and only update those records. This might be much faster, but has the downside of more scripting complexity.
  • Or, maybe I can write the formula in a way that bypasses that relationship entirely? Something like this:
SpecialPaycheckArrears = 
Case(
    thisEmployeeNeverHasSpecialPaychecks: "-1";
    // otherwise, do the calculations through the very slow relationship
    ...
)

I believe that FileMaker calculation engine does short-circuiting, right? By that I mean that in a Case() statement once a case evaluates to true, the other cases are ignored.
This would require me to maintain this additional field, but that might be something very easy to update during the Paycheck import process...

1 Like

How about adding a table C: one record per employee per check type.

A and C are related this way:

  • A::employeeID = C::employeeID

C and B are related this way:

  • B::PaycheckType = C::PaycheckType
  • B::employeeID = C::employeeID

You update C::mostRecentCheckDate every time you create or update a record in B, assuming the date is in the present or the past. You put the arrears calculation in C, making a stored calculation, one per check type per employee.

Hope this makes sense and hope this helps.

2 Likes

Yes - that is true.

Another method is to cache the data needed for the 5% as an overnight process so that dynamic calcs can grab the cache.

1 Like

@steve_ssh is correct, it's the date predicate that is the killer, > , < and ≥ and ≤ are always the most expensive predicates, worse with timestamps, but dates are right up there with poor perfromance. The reason simply is an = is an easy check on the index, but any ≥ or ≤ takes more time to traverse the index.

These predicates perform worse the more records you have in your table B over time, so you will note perfromance degrades over time. It actually makes no difference how many records are found through your relationship, the performance is more to do with the overall number of records in the table. more records = larger index = slower evaluation of relationship

Sorting might be playing a smaller role in the performance. Max or Min functions aren't going to help at all because they still require ordering or records to return the result, which is comparable to the sort on the relationship.

One thing you could try is not using a sort at all, and instead using the Last function to get the last record in the relationship. Obviously this is fraught with some danger, and it assumes every record in B is created in chronological order. If you create records out of order then Last may not necessarily be the most recent date and this technique won't work.

In my experience, FileMaker relationships aren't very well optimised. In your case, you would expect the two "=" predicates to evaluate first, reducing the amount of potential results, and the ≥ predicate to occur last on the results, but this is not the case based on what I have seen with this predicate.


Another option is deferring the perfromance hit to an area that is less of an impact. Here is an example of this.

Consider the calc on table A is actually a stored value via auto-enter calc. It still uses the same calculation and relationship. It could be triggered using a special "refresh" field, referenced in the auto-enter purely for triggering purposes. Thus each time "refresh" is set to 1, the auto enter fires.

The key now is ensuring that this auto entered value is kept current. Any point in the solution that has the potential to alter the value of this calculation must instead update the refresh flag. So if a record is added, deleted, or the date is set in table B, the appropriate record in table A is updated, keeping the result relevant.

This is deferring the performance hit to the create/delete/modify actions on records in B, rather than the perfromance hit in evaluating the calc in A.

Often, this approach (while more work) can yield perceived performance benefits. If for example you are running a big report that uses this calculation, you might be required to evaluate it for hundreds or thousands of records in order to produce the report - resulting in a large run time. This stored approach will make the report almost immediate to run. The actual time saved is distributed to the multiple smaller operations such as add/edit/delete which might occur over a period of weeks or months and so has less of an impact.


outside of that, going back to my earlier point about perfromance, you might find this calculation will run faster if you are able to cull the number of records in table B. If it is a large table, consider archiving old data. The fewer records in this table you have the better.

4 Likes

Thank you all for the great feedback and ideas. Here is some followup info:

  • The database is about 10GB total
  • Table A and B each have about 10 million records
  • all relevant fields are regular stored (and indexed if needed) fields using triggered, Auto-enter calculations.
  • the update is a batch process that runs server-side
  • the update took about 8 hours (on a Core i9 mac mini) on FMS 18.
  • On a new M2 mac running FMS20, it runs in about 2 hours
  • Two hours is acceptable, but once you see a 4x speedup you start to wonder how much faster it can be. :slight_smile:

When I get a chance I'll do some tests and report back with more info.

Way back in v14 we processed telephone records. They arrived in 2-4GB CSV. The provider had a nasty habit of aggregating data sets with slightly different columns of data, and of modifying the column sequences.

That caused problems on import and imports were taking many hours. We developed pre-processing scripts in Perl. The main aim of the script was to identify rows of data that did not need secondary processing. Those rows could be imported quickly. It reduced import times enormously.

I'm not sure I understand this correctly, but if I do, what you could possibly do is create an indexed calc field in called something like

TableB::PaycheckDateOnlyIfTypeIsOddball_calc

Then instead of looking for >= on TableB::PaycheckDate, you search on the new indexed calculation.

I'm not sure if this would be faster, but might be worth trying.

I'd also second whomever had the idea of caching this data somewhere/somehow. A nightly script which runs overnight and sets a flag in TableA::HasOddCheckType, with some way to manually trigger it to run on the server.

Another option would be to create a script which runs this script on the server and then returns a list of primary keys of TableB to you, which you then display using a simple join from a global field with all primary keys to the table you want.

Also be sure to make sure all fields you're searching and sorting on are indexed.

In terms of figuring out timing you could also split this into two steps. Remove the sort, and then go to related records and sort and see how long the sort is taking. There may be ways to optimize that.

1 Like

A complexity of this calculation (the reason for using the >= relationship) is that the TableA function needs to only look at paycheck dates in Table B that are from the past

Here is a simplified example (leaving out most other fields):

Table A
One record per employee per month
ID	Date	Arrears
1	2024-01-01	0
1	2024-02-01	1
1	2024-03-01	2
1	2024-04-01	3
1	2024-05-01	0
1	2024-06-01	1


Table B
One record per paycheck 
ID	Date
1	2024-01-01
1	2024-05-01

Here, an employee had a paycheck in January and May. Thus, for February through April the # of months Arrears increases. However in May it resets to zero, then begins climbing again in June.

Another idea I'm considering: is it possible to do this with some sort of numeric / mathematical operation?

Imagine a calculated Key field in Both Tables which is equal to the Employee ID plus the year:

TableB::specialKeyField = ID & "-" & Year(Date)

Then, with a single predicate match I could do this in TableA:

Let(
[
  n = count(TableB::id);   // # of related paychecks in this calendar year
  m = Month(date); // month (1-12) of this record
  arrears = n - m 
] ;
Case(
  arrears < 0; 0; 
  arrears
)

This decidedly does not work - it assumes that arrears are calculated on a calendar year basis (they are not) and fails to handle future dates properly, and fails when there are gaps in the paycheck dates.

However, I'm wondering if some variation of this technique might work?

Another variation to consider, depending on the constraints of how the solution must be implemented:

Do as much as you can (or want to) using a relationship, as long as that relationship does not utilize the inequality predicate.

Then take the process over the finish line by using a scripted Perform Find or a scripted Constrain Found Set.

The reason that I suggest this is because I have noticed that, in some cases, using Find mode to match inequalities has fared much better for me than using a relationship, which has, almost without exception, been a bottleneck.

Understood that having to utilize scripting might be a bummer if there were hopes of doing this all just via calculation engine and relationships.

1 Like

When in doubt, create more data

If you are able to store the arrears integer, (you only need to set a flag on the record at which arrears equals zero,) then you can count the number of subsequent records in that table.

Working the other way around. You could flag records in table A that occur after a "zero" record. These can be found easily (and obviously the flags get cleared during processing).

You should like at Honza's excellent technique here, quite clever stuff

1 Like

@Glitchtraker thank you. The relevant technique from the video is this:

  • instead of using a < or > relationship,
  • use a multi-key match field

For example:

Table A
One record per employee per month
ID	Date	Arrears	cMatchDates
1	2024-01-01	0	2024-01-01
					2023-12-01
					2023-11-01
					2023-10-01
					[... etc... ]

The cMatchDates field would contain a List (e.g. paragraph-separated values) of all the dates you want to match. The relationship is then created like this:
TableA::cMatchDates = TableB::Date

Because you are using an equality operator, the relationship is much faster.

Note: this technique only works when the lists of dates is finite and small (ish). In my case, payroll is monthly, and our rules only care about arrears between 1 and 12 months, which means the multi-key match field would have at most 12 values in it. When I get some time, I'll test this out and report back on performance.

3 Likes

Yup this will be much more performant

I may be missing something obvious, but could you not use a simplified relationship

  • Table A::employeeID = TableB::employeeID
  • Table A::kType = TableB::PaycheckType

and modify the calculation to be something like

  • TableA::Arrears =
    let(
    [
    ~dateA = TableA::Date ;
    ~dateB = Max( TableB::PaycheckDate )
    ];
    if( ~dateA >= ~dateB ; ~dateB - ~dateA )
    )

I don't know if this would be faster, but it might be worth a try...

@apjcs That formula would work, but would only work in the current month. A key design goal is that these calculations need to update properly even if triggered on old data. For example, in a record for 2024-06-01, if there are paychecks in 2024-01-01 and 2024-07-01, the one from the future (2024-07-01) needs to be ignored. Your formula would not handle this case properly.

Another similar way of handling it is to use that same relationship (ID only) and then use a While loop to iterate over the dates, ignoring dates from the future.

Initial testing suggests using a While loop may not be any faster. I suspect this is because we have a lot of payroll data (some people have 10+ years of monthly data) and this is a fairly "fat" (wide) table with many fields. This means that for each record, we end up sending 120+ records over the network, then iterating over each one - which is a lot of computation.

I see. I didn't fully understand the problem. Is there a reason not to use perform script on server to avoid transferring loads of records over the network.

You might consider creating a narrow table that is a one to one relationship with the pay check records but only includes the date file (and any others relevant to this process)

Have you tried ExecuteSQL?