A while back I wrote about an issue with COM and .Net Interops. Well, a related issue has arisen. The specifics still seem a fuzzy and a solution hasn’t been implemented yet.
The most reproducible scenario goes like this: User A logs in and opens the web app. User B also logs in and opens the web app. User A decides not to complete the form and logs out. User B continues to use the web application, when it refreshes to get a list of codes web app dies. Obviously, this is very disconcerting and didn’t get tested a lot.
There’s one testor and one developer. Each person opens one session and runs through the app. In this mode the error is difficult to reproduce, though we suspect it is possible. I banged on the app very hard this one and could not make it fail. However, the scenario above is very simple and causes lots of problems.
From a technical point of view, it seems very difficult for the two sessions to interfere with one another. Yet, everything seems to confirm this. When session B initializes it gets a reference to a CodeManager and holds that at the class level. It seems that when session A closes all its references it also closes the CodeManager reference of session B. Session B only gets a CodeManager when it initializes, the first time the page is displayed. After that it assumes the reference is good and that is incorrect in this case. The call to get codes generates an RCW error.
A few of solutions have been tried at this point. If there aren’t enough CodeManager references make sure there are extras. That worked in the past. But not this time. I saw it generate the error then session B logged out and released 3 CodeManagers.
Another solution is to move the scope of the CodeManager variable from class to method. This works, but increases the number of CodeManagers created and destroyed. This is not a strong point of .Net/COM. In fact, some of the existing calls should be re-evaluated to move the calls from method to class scope.
The solution that seems to work the best is to detect the error with try/catch and re-reference the CodeManager like when the object is initialized. From then on code would execute as it expected. This looks like the best solution, but it requires refactoring the code to put suspect calls to CodeManager in a new function. This new function needs to call itself again if the first time fails due to an RCW error and throw an exception if it doesn’t. It’s also not elegant or attractive looking in the code.
We’ll see what happens.
Update 1 (2/7/07):
Today was spent exploring solutions to this problem that would 1) resolve it 2) in the least amount of code changes to reduce the amount of testing necessary. The try/catch method about definately resolves the issue. However, in this case it requires more refactoring than I like in a patch.
The solution I went with was to move the variable declaration and disposal from class level to method level. Testing showed that this was still susceptible to .Net losing it’s COM object. Amazingly, I saw one instance where a variable was declared, the next line called a method, and .Net generated and RCW error. Another happening was IIS crashing with a pop up box from the operating system about memory error. So, in some cases simply moving to a method level variable was insufficient to address the issue.
What always seemed to work was using a full reference to the root object like “codeList = MasterCOM.CodeManager.GetCode()”. The problem with this solution is that it makes an additional COM reference, but leaves .Net without an object to dispose. Every call to GetCode() increases the reference count.
My solution was to use both methods. Switch from class to method variables. Don’t use them. Use the full reference from the root object. Instead of releasing one instance of the method variable use FinalRelease to run multiple times if necessary. This isn’t the FinalRelease from .Net 2.0, but a custom one that Releases all but 1 reference. That reference is disposed when the root object is disposed.
So, when an object wants a code list it creates 2 CodeManagers. 1 is released. The next object wants a code list. 2 CodeManagers are created and both are released. The application closes the root object and releases 1 CodeManager. All the blessed COM object counts balance (equal 0 on application close). And no RCW errors are reported.
Update 2 (2/9/07):
The patch seems to be running very well. Before the patch. Load testing showed ~80 errors in 30 minutes. Some due to timeouts, which mean more errors could have happened in the same amount of time. After the patch 2 errors due to timeouts. It’s very exciting when a critical defect just evaporates. In the end, I don’t understand why it works. Hopefully, we’ll get to spend a few days on that.