calling "await UnitOfWork.CompleteAsync()" in a Hangfire background job appears to be stuck #11629

sedulen created one year ago

Good afternoon,

I'm on an older version of ABP (v4.10), and while I am working on upgrading to a newer version, I'm running into a new and rather challenging issue.

I have been using Hangfire for my background jobs for years now, and only very recently, I have 1 job in particular that seems to be getting stuck.

The background job is extremely simple. It takes a list of files that a user has selected, packages them up into a zipfile, and then sends a user a link to download the zipfile and sends the user who submitted the request a notification that the zipfile was sent.

In an effort to troubleshoot the issue and ensure I don't send emails with the zipfile being constructed, I am explicitly beginning and completing UoW using UnitOfWorkManager

Here is my code (logging removed for readability)

  public class BuildZipfileBackgroundJob : AsyncBackgroundJob&lt;BuildZipfileArgs&gt;, ITransientDependency
  {
    // ...  properties & constructor removed to condense the code...
    
    protected override async Task ExecuteAsync(BuildZipfileArgs args)
    {
        if (args.CreateNewPackage)
        {
            using (var uow = _unitOfWorkManager.Begin())
            {
                using (_unitOfWorkManager.Current.SetTenantId(args.TenantId))
                {
                    using (AbpSession.Use(args.TenantId, null))
                    {
                        await _zipFileManager.BuildZipFileAsync(args.RequestId);
                        await uow.CompleteAsync();
                    }
                }
            }
        }
        if (args.Send)
        {
            using (var uow = _unitOfWorkManager.Begin())
            {
                using (_unitOfWorkManager.Current.SetTenantId(args.TenantId))
                {
                    using (AbpSession.Use(args.TenantId, null))
                    {
                        await _zipFileManager.SendZipFileAsync(args.RequestId);
                        await uow.CompleteAsync();
                    }
                }
            }
        }
    }
  }

As I've been trying to troubleshoot what is happening, I have added Logger statements, and what I am seeing is that the first await uow.CompleteAsync() is never completing. My logging stops and I never see any more logging after that for this job.

Additionally, if other Users queue up more requests for this background job, the issue continues to group, and more Hangfire jobs become clogged.

I have read some about the Configuration.UnitOfWork.Timeout setting, and right now I'm not changing that value in my startup configuration.

What I'm seeing is that the job hangs for hours, so Hangfire thinks the job is an orphaned job, and queues another instance of the same job, which further compounds the issue. Ultimately, Hangfire queues up the same job ~5-6x which then causes problems with Hangfire polling it's JobQueue table, and that breaks the entire Hangfire queue processing, leaving jobs enqueued but never processing.

What I'm struggling with is why is await uow.CompleteAsync() getting stuck and never completing. It seems like there is a transaction lock or deadlock that could be causing the problem, but I've been really struggling to figure out the root cause (and resolution).

Given my version of ABP (v4.10.0), I'm thinking perhaps it's the use of AsyncHelper. Since my class inherits from AsyncBackgroundJob, the job execution is:

        public override void Execute(TArgs args)
        {
            AsyncHelper.RunSync(() => ExecuteAsync(args));
        }

I don't think I have a way around this, and I have many other background jobs that inherit from AsyncBackgroundJob that do not have this problem.

I've thought about using UnitOfWorkOptions when calling .Begin(), and adjusting the Timeout and the Scope attributes, but I feel like I'm just swatting blindly.

My environment is deployed in Azure, using Azure Storage for files and Azure SQL for the RDBMS.

Any ideas or suggestions you can offer would be greatly appreciated! Thanks, -Brian

4 Answer(s)

0

ismcagdas created one year ago
Support Team

Hi,

Is it possible to share the source code of BuildZipFileAsync method ? This might not be related to UoW.
0

sedulen created one year ago

Hi @ismcagdas ,

Thank you for the reply. Unfortunately, it's not possible to share the source code.

The explanation of the code is that the args.RequestId is a record in a table. There is a second table that identifies the list of documents to be zip'd up as part of this request. So the BuildZipFileAsync method takes the requestId, gets the list of documents to be zip'd up for this request, and then iterates over the list.

For each document, get the Stream of the document from our storage provider (Azure Blob Storage) and copy the stream into the zipfile as a new entry.

The method also compiles a "readme.txt" that lists all of the documents and some additional metadata about each file.

Ultimately that zipfile is retained as another Stream, which is sent back to our storage provider for persistent storage, and then the id of that new file is what is passed on to the notification so that the user notification can reference the zipfile.

The only additional code that I have in place is for handling the streams. Since these zipfiles could contain hundreds of files, I didn't want to deal with potential memory pressure issues. So try to never hold onto a file as a MemoryStream. Instead I use FileStreams in a dedicated temporary folder on that node. I have this wrapped in a StreamFactory class. I do associate the streams that are in-use with the UnitOfWork, so when that UntOfWork is disposed, I ensure that those FileStreams are 0'd out if I can, and that they are properly disposed of.

This StreamFactory code & strategy has been in place for years, and is used extensively throughout my application, so if this were the root cause of my issue, I would expect to be seeing similar issues in other features of my application. So I'm doubtful that this is it, but I also don't want to rule anything out.

I did some testing within my company yesterday with regards to concurrency, where we triggered ~10-15 of these requests all at once, and we did not observe any issues. So far this issue has only appeared in production. I have 2 non-production Azure-hosted environments plus I have my local laptop environment which I can expose to other users through ngrok.io, and we can't reproduce this issue anywhere else, which leads me further towards something environmental.

For what is actually being committed in that await uow.CompleteAsync(); statement, at the RDBMS level, I am inserting 1 new record into 1 table, and updating another record in another table with the ID (FK). (Basically - add a new document, and then tell the "request" what document represents the zipfile that was just generated). So the SQL workload should be extremely lightweight.

Recognizing my older versions, I'm looking at upgrading ABP from v4.10 to v4.21, but in the collective release notes, I'm not seeing anything that would affect this. I'm also looking at upgrading Hangfire from v1.7.27 to v1.7.35, but again I don't see anything that would affect this.

I had been running ABP v4.5 for a very long time, working on a plan to upgrade to v8.x "soon". Earlier this year, I ran into a SQL connectionPool starvation issue in PROD, and determined that it was caused by using AsyncHelper.RunSync. Increasing the minThreadPool size resolved the immediate issue, and that's where I decided to upgrade from v4.5 to v4.10, as you did great work at that time to reduce the usage of AsyncHelper.RunSync.

I am going to continue to look at the health of our Azure SQL database, to see if we have anything interfering that may cause the UnitOfWork to hang. I am also going to continue to look at ConnectionTimeout, CommandTimeout, and TransactionTimeout settings, as well as my Hangfire configuration. Honestly I don't mind if the zipfile creation fails. I can implement retry logic here for resillience. What bothers me is that the Transaction seems to hang indefinitely, where Hangfire thinks the running job has been orphaned, and queues another instance of the same job and this hanging job locks/blocks other instances of the same job.

Thanks again for the reply. Let me know if anything else comes to mind.

-Brian
0

ismcagdas created one year ago
Support Team

Hi,

For each document, get the Stream of the document from our storage provider (Azure Blob Storage) and copy the stream into the zipfile as a new entry.

I think the problem might be related to this part. It is hard to make a prediction without seeing the source code.
0

sedulen created 9 months ago

@ismcagdas ,

To follow up on my post from last month, I have resolved the issue by reviewing & refactoring my AsyncBackgroundJob code. I was still using a legacy implementation that called .Execute , which wrapped calling .ExecuteAsync in AsyncHelper.RunSync

Now my background jobs call .ExecuteAsync directly and Hangfire handles the await implementation natively. This issue has not occurred since deploying that update to production.

Cheers! -Brian

Have an answer to this question? Log in and write your answer.