We’ve got a client with a large-ish file: 150 connections at any time, a few hundred thousand records in the main table. We are aware that there are some things we could have designed better - the main table has 200 or so fields for example… (on the other hand we kept stored calculations to a minimum, so at leats that’s not too bad I guess) We’ve got some refactoring time scheduled for alter in the month.
The solution is accessed via CWP (PHP) quite heavily, especially to run scripts and finds, with requests every few seconds. It’s also accessed via the Data API. PSOS is used extensively to offload tasks to the server, as well as script schedules.
Every few days, the server stalls. Requests from FMPro users take ages to be resolved. I cannot see anything from the Top Call Stats that would trigger this; I can see some PSOS scripts / schedules are taking a very long time to complete, but that’s usually at the same time or after I heard a complaint.
So, I wonder if using a worker machine here would alleviate the issue by enabling the server to only deal with PSOS scripts, schedules and data API, while the CWP requests are handled by the worker.
FileMaker Server 18 Guide specifies a worker machine will only help with hosting Web Direct solutions, but somehow I’d hoped it also help with CWP?
Many thanks for your ideas/inputs!
Welcome to the soup!
Did you have a look at the FMS stats?
I’d start a checklist with things to look at:
- scheduled scripts that stall or end with an error, or time out
- memory consumption
- events in logs that precede the stall
Thanks for your answer. That’s what I did. I can add that memory consumption is very stable at 60%. No particular event in logs precede the stall. I can see that finds on the main table take longer, and I can see scripts timing out or taking several seconds to complete as well in the Top Call Stats log, but no single one seems to trigger the stall.
However, the client noted that the server seems to always stall when they’re a bit more busy in stores, so I think the stall is due to the server reaching capacity. CPU usage never drops below 25% when it occurs, but I’d never think 25% was a big number
That’s why I wonder if having a worker machine could help, but having the master not compute as much?
And other stats like events or fmdapi? A small server at a customer of mine reaches 80 to 90% about once a week. It never stalls. What kind of server HW and OS is in use?
There are script calls from several sources (web clients, FMPA, data API) could it be that two scripts may run concurrently and produce a lock situation?
Could you find out which scripts are active at the time of stall (access and event log)?
Events and Access are really busy, but I could find out scripts that take a while at around the time of or after the stall. It didn’t help me much because they’re the PSOS scripts everyone is using all day long in stores… And indeed, scripts are always running concurrently when stores are busy. They might produce a lock situation I guess, as they’re acting on the same 3 tables, but in theory they shouldn’t be on the same records.
The server is hosted on AWS, on a T2.2xlarge instance with Windows 2016 Data Centre installed. We’ve got 32GB of RAM and 8-core CPU, with plenty of space on the drive where the data is stored.
Edit: Not sure if it’s relevant, but the solution uses data separation, and all CWP/DataAPI takes place on the UI file.
@Chloe, the only guess I can make, if no scripts visibly fail, is that with FMS 18 the server received improvement in multi-threading. It is a long shot, but may be there is an issue related to that. Only the platform vendor could tell…
@AndyHibbs, did you observe anything similar to this on your servers?
Yes, and that’s all the more puzzling.
We switched to FMS18 at the very end of September because FMS17 was struggling so much… At that point it stalled pretty much everyday. We thought multithreading would make handling concurrent requests from users/PSOS/CWP easier. It seemed to be the case for a month or so, but now we start seeing the server stall again at busy times.
We’re going to try to throw more processing power at it, moving out of AWS, to see if that makes a difference. But I’m afraid that might not be a sustainable solution.
Many thanks for your help!
You’re welcome! May I suggest that you change the title of your post to something like ‘FMS v18 stalling at high workload and multiple scripts launched’. This may attract the attention of fellow developers who have encountered the same issue.
My initial query was to gather opinions about wether having a worker machine would help or not, thus the title. But we steered away from that so I’ve changed the title for the one you suggested, it sounds better.
Have a good evening (if you’re in Europe)!
Yes, I am located in good old Switzerland .
At the occasion, I reviewed the forum member list and spotted some people with way more FMS experience under their belt than I have. Maybe someone can chime in.
150 concurrent user all the time calling PSoS scripts 8 cores are not enough besides all the other overhead/serving…
Also 32 GB depending on your file size pretty small. Would definitely upgrade to more cores and worker machine etc. If PSoS still timing out then write your own monitor to serialize execution.
apparently DataAPI processing failures have been reported once at fm community where overload just dropped further calls - would watch what’s going on there and get in touch with tech support at CII / FMI
Wouldn’t dropping calls because of overload trigger entries in the events log?
Try turning off FMS 18 restore function
… and make sure you’re running latest Update FMS 18.0.3 which should address some known issues.
Turning off Page Locking and Startup Restoration may take them back to stalling the same as 17 did.
@Chloe - a worker machine may help. It’s hard to tell without seeing the stats, etc.
Processing power can definitely be a bottleneck. Also, check your free space on the server. Remember the restoration log can take an additional 8GB.
Thanks all for the ideas, I’ve taken good note of them.
We’re moving to a 20 Core x 2.5 GHz CPU + 192GB RAM server tonight.
The size of the data file is about 40GB, but I didn’t thought memory was an issue because cache hit was always 100%, and memory usage stayed constant at 60%.
I have 286 GB of free space on the C:/ drive where FMS and the data sits, so that looks plenty. Back ups are stored on a separate drive.
I’ve watched the DataAPI logs and couldn’t see dropped calls in there - but I’ll check again, it’s so easy to miss something when scrolling logs.
We are running FMS 18.104.22.168, but I’ve made sure 18.0.3 was installed on the new server. I don’t really want to switch startup restoration off, I feel like that’s an actually useful feature which we’d want to keep. But that’s an idea to keep.
Many thanks to everyone who chimed in.
Your messages do make me really hopeful that the switch to a more powerful server may finally solve the problem. We’ll see in the next few weeks!
Apologies for slow response, we’ve 2 separate deadlines here for the next 2-days and are burning the midnight and dawn oil at the moment.
We’ve only 1 FMS18 server in a production environment at the moment, plus our development server. The upgrade process from FMS v15 to v16 was so painful, we’ve buried our heads in the sand for a while, but are working towards upgrading other servers in the near future. This has been partly accelerated due to more than one client accidentally upgrading to Mac OS Catalina, which has had severe repercussions in places.
We’ve had no identifiable problems with either FMS18 installations, but neither are under a heavy load.
We continue to have to restart the FMSE on our busiest v16 server; sometimes it will go for a couple of weeks without a problem and then fall over twice in a day or so. We have some PSOS, but mostly scheduled scripts. This is running on a Windows 2012 virtual machine with 6 vCPUs and 15Gb RAM and very fast SSD.
The only thing I can add to the advice given, is that although the acknowledged priority is to provide enough vCPUs, our experience on this has been that lack of RAM was a bigger culprit for the server actually failing or stalling. This is buried somewhere in the original community forum, but goodness knows where it would be now.
Looking forward to catching up on the expanding posts here soon, but back to the mouseface now.
No problem Andy,
Thanks for your contribution, and good luck with the deadlines!