DPM 2016 with MBS – issues again

Hi folks!

It was a long time since I published my post about fix of slowness DPM 2016 with Modern Backup Storage [Solved] Slow/hangs ReFs/DPM 2016 with Modern Backup Storage. Unfortunately we spent few more months to solved it completely. Both teams from Microsoft (DPM and Storage) worked on this case.

Basically I got same problems what I’ve had before:

  • DPM jobs never completed in time
  • Some of DPM jobs are failed
  • DPM servers hang (no RDP, no WinRM, but ping is up)

We have considered to revert our backup servers to DPM 2012 R2, but there are a lot of drawbacks:

  • We will get again LDM limitation as we had before and it was the reason why we decided to migrate to for DPM 2016 with Modern Backup Storage
  • DPM 2012 R2 is old product
  • You will need not only to reinstall DPM, but operating system as well.

I don’t want to dive into details and steps of our communication with Microsoft, just post some recommendations which are helped us:

Increase RAM.

If you will ask me or Microsoft, how much RAM  do I need – nobody can answer. It’s really depends on your workload. If you backup only few VMs or Exchange database, maybe 32GB will be enough for you. In my case there are hundreds of VMs and Exchange databases, it’s completely different story. ReFS likes a lot of RAM 😊

Back to our case. Our servers were purchased with 24 GB/32 GB RAM. It perfectly fine for DPM 2012 R2, but is not enough for DPM 2016 with Modern Backup Storage. So we doubled RAM size up to 64 GB on some servers. Maybe it’s overkill, but we can’t assume correct RAM size.

Disable Storage Calculation

I don’t suggest you disable storage calculation before RAM increase. In my case servers became unstable and I have to enabled storage calculation again. Follow this link for details – https://docs.microsoft.com/en-us/system-center/dpm/dpm-release-notes?view=sc-dpm-1801

Disable storage calculation – Program Files\Microsoft System Center 2016\DPM\DPM\bin\Manage-DPMDSStorageSizeUpdate.ps1 StopSizeAutoUpdate

Enable storage calculation – Program Files\Microsoft System Center 2016\DPM\DPM\bin\Manage-DPMDSStorageSizeUpdate.ps1 StartSizeAutoUpdate

Configure WMI handle count

Check your Windows events logs and probably you will see some WMI operations are failed. Increase WMI handle count up to 8 GB / 12 GB (default setting for Windows 2016 is 4 GB), this article explains everything in details – https://blogs.technet.microsoft.com/askperf/2014/08/12/wmi-how-to-troubleshoot-wmi-high-handle-count/

I suggest to increase WMI handle count even you don’t see any errors in logs.

Optional: set ReFS and DPM registry keys

I think with larger RAM size you don’t need to tune ReFS and you can live with default settings.

But I will post my reg settings, just remember your environment is different than mine and you need to adopt them

ReFS

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem] “DisableDeleteNotification”=dword:00000000
“FilterSupportedFeaturesMode”=dword:00000000
“LongPathsEnabled”=dword:00000000
“NtfsAllowExtendedCharacter8dot3Rename”=dword:00000000
“NtfsBugcheckOnCorrupt”=dword:00000000
“NtfsDisable8dot3NameCreation”=dword:00000002
“NtfsDisableCompression”=dword:00000000
“NtfsDisableEncryption”=dword:00000000
“NtfsDisableLastAccessUpdate”=dword:00000001
“NtfsDisableLfsDowngrade”=dword:00000000
“NtfsDisableVolsnapHints”=dword:00000000
“NtfsEncryptPagingFile”=dword:00000000
“NtfsMemoryUsage”=dword:00000000
“NtfsMftZoneReservation”=dword:00000000
“NtfsQuotaNotifyRate”=dword:00000e10
“RefsDisableLastAccessUpdate”=dword:00000001
“ScrubMode”=dword:00000001
“SymlinkLocalToLocalEvaluation”=dword:00000001
“SymlinkLocalToRemoteEvaluation”=dword:00000001
“SymlinkRemoteToLocalEvaluation”=dword:00000000
“SymlinkRemoteToRemoteEvaluation”=dword:00000000
“UdfsCloseSessionOnEject”=dword:00000003
“UdfsSoftwareDefectManagement”=dword:00000000
“Win31FileSystem”=dword:00000000
“Win95TruncatedExtensions”=dword:00000001
“RefsEnableInlineTrim”=dword:00000001
“RefsEnableLargeWorkingSetTrim”=dword:00000001
“RefsNumberOfChunksToTrim”=dword:00000004
“RefsDisableCachedPins”=dword:00000001
“RefsProcessedDeleteQueueEntryCountThreshold”=dword:00002710

DPM

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Configuration\DiskStorage] “DisableReFSStorageComputation”=”1”
“DuplicateExtentBatchSizeInMB”=dword:00000064

 

What is final result?

Servers are not hang anymore. No more failed jobs. Most of the jobs are completed in time, but some of them delays for few hours. Maybe I have to play with reg keys to improve situation. But it’s much better than before.

You may also like

5 Comments

  1. We are using DPM 2019 & Windows server 2019 with a REFS setup. We have not applied any of the registry changes you have indicated in this post however have done the ones from your previous post.

    We have a 12 TB VM and a 23 TB VM that we backup with our DPM server. From the looks of it after a lot of testing we can get to about 14TB’s backed up before the DPM server crashes and reboots. The reason for this is the DPM server demands more and more memory as the backup progresses, it will actually consume every bit of memory the server has until nothing is left and than it crashes, not very smart. The DPM server has around 100GB of memory to work with. It seems like under our current configuration to backup 30+ TB across 2 VM’s we will require 200GB+ from the DPM server. What I find interesting is if we only backup the 11TB VM it does complete and the DPM server is stable until the next night of backups kick in and there are many smaller ones. Since the 11TB backup consumed around 70-90GB of ram and it never releases it the next round of backups consistently will crash the server as it seems like there is no memory left to work with. So while we can bring the server up to 200GB of memory I do believe this will allow us to backup the 30TB’s of VM’s I am not confident the DPM server will release the memory when it’s complete with the job’s as we have already experienced. Do you know if any of the changes above target this issue or have any recommendations? Thanks

    1. Hi Brian!
      Thanks for your question.
      Well, I don’t use any registry customization anymore, just Windows default.
      You have to check following steps:
      – install latest drivers and firmware of RAID controller
      – make sure RAID Write cache configured correctly (Write-Back or Always Write back)
      – make sure all HDDs in RAID are fine, no bad blocks, problems with controllers, etc.
      – make sure network link has enough capacity. We upgraded ours from 1Gb to 10Gb, I’m pretty sure it helped a lot. Imagine the situation when you have lot of tasks in queue, because of slow network link. Of course you will experience the memory leak then.
      – don’t use big DPM volumes, as far as I know by MS recommendation it’s ok to have 30 TB. We use 40-50TB, basically no difference. But it’s always good practice to have 2-4 DPM volumes to balance storage load.
      – sometimes DPM storage reset helped. On some our servers we have to completely reformat ReFS drives and currently we have no issues with 64GB RAM. I can’t promise nothing will happen in the future.

      Your situation is practically the same like we had when a) servers had 24-32 GB RAM b) we didn’t reformat ReFS. Large VMs is always a problem for us too. Can you make it with smaller vhdx?

      In last MS recommendations you can find SSD cache to optimize ReFS operations. We didn’t try it.

      Best regards,
      Dmitry

  2. Thanks for your feedback Dmitry.

    The DPM server has 120GB of Ram and it’s running out of memory at around 16TB into the backup of the larger 23TB VM. The DPM server memory usage is going up every few minutes as the backup gets further along. It will use everything the server has until there is nothing left, than crashes, can repeat this over and over.
    The server is brand new, plenty of horsepower using Dell MD1400’s which I have used with many DPM implementations in the past without issues. Server has 20GB network team and connects to a 40GB backbone, no bandwidth issues here.

    We are going to double the memory in the server to 256GB, seems ridiculous however our calcuations indicate it should provide the room it needs to finish the backup of 30-40TB before running out of memory and crashing. That still does not resolve the issue that the server never releases the memory after the backup is finished and as it stands a reboot is the only thing that seems to release it. This memory demand only seems necessary during the initial snapshots of the large VM’s, after that its only small changes so memory demand is marginal at most. It’s possible a restore would also require the memory, have not tested.

    1. Hi Brian!
      I never backup very large vmS, maximum was 6-8 TB.
      I suggest you to try Veeam backup software. I heard they are solved these issues.

      Best regards,
      Dmitry

Leave a Reply to Dmitry Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.