.NET Core application hosted from PaasV1

What?! Why?! These days!? You are probably wondering …

There are many service owners running their services on Azure PaasV1 – aka Cloud services. There are several reasons why it is needed, e.g. compliance requirements.

If you are in the similar space and want to leverage the power of .NET Core runtime read on.

It’s not possible write worker roles on .NET Core. By default, PaasV1 hosts the work role inside WaWorkerHost process which is running on full .NET runtime. If we want to leverage .NET Core, we need to use another path.

Let’s explore the path

The trick is using ProgramEntryPoint from Azure ServiceDefinition file. You can read more about the whole schema here. It’s enough just to add the desired application into the package and then execute it. Azure worker hosting is able to tracks the process.

ProgramEntryPoint

.NET Core publish command tool is able to export the whole application into a folder with all files needed to run it. What’s more! .NET Core 3.0 preview 5 comes with a possibility to publish just 1 file containing whole application. That’s great, isn’t it?

Ok, we have .NET Core application and it’s running within PaasV1. We need to integrate the application into Azure PaasV1 hosting runtime, in other works leveraging Microsoft.WindowsAzure.ServiceRuntime. In .NET Core, as Nuget package! Not possible.

But there is a way.

There is a package called Unofficial.Microsoft.WindowsAzure.ServiceRuntime.Msshrtmi  filling this need. It’s basically a package containing managed wrappers around native modules and full of P/Invokes. AFAIK, it was not possible to use such so called mixed (C++/CLI) assemblies from .NET Core directly. It looks like the situation has changed with .NET Core 3.0 preview 6 on Windows OS.

That’s all.

Show me the code!

.NET Core application

In the sample application we just initialize Role manager interop and we are able to read the configuration from cscfg, register the callbacks for role status checks, shutdown, etc.

.NET Core publish

This command publish the whole .NET Core application into one executable file.

 

Azure service definition

See ProgramEntryPoint tag starting the application.

 

Result

Lines with the prefix WaWorkerHost.exe Information: 0 : NetCoreConsoleApp comes from .NET Core application. We are able to read the configuration settings, response to role status checks, reacts to shutdown callbacks and more.

Complete sample

Whole sample is downloadable from https://github.com/kadukf/blog.msdn/tree/master/.NETCore/AzureCloudServiceNetCore

 

.NET Core 3.0 rocks! Happy coding 😉

Advertisements

Isolated ASP.NET Core MVC testing

The tests should be isolated so that we can run them in any order and in parallel way.

Introduction

Recently I was working on a feature which was targeting with ASP.NET Core MVC. I was able to test it on class-level. These tests were pure unit tests and all was fine.

In order to validate that it works correctly, I wanted to be sure that it works as expected when it is integrated into ASP.NET Core MVC, especially that it works in conjunction with Controllers. So I decided to expand the scope of my test. I call this integration unit test.
I found ASP.NET Core to be quite powerful for writing isolated tests spanning MVC handlers, formatters, etc. together with Controllers.

 

Coding

Let’s have a custom formatter which we need to test.

Let’s see how to use it in the startup class:

Let’s write an integration test for it.

First we need a controller with will be used for testing. The aim of the controller is to return the input data.

Test

Let’s write the test finally! The goal is to test that the POST request w/o specifying content-type is processed the same way as it would be wen using content-type: application/json.

 

All works fine until there are several such tests with different controllers and routes. Just a side note, when using ASP.NET Core MVC, there is Microsoft.AspNetCore.Mvc.Controllers.ControllerFeatureProvider class responsible for discovering the controllers to be used.

ASP.NET Core is greatly customizable and it covers also controllers discovery process.

 

Test Isolation

In order to have highly isolated integration tests we need to discover specific controllers for our tests. This is possible by configuring another MVC component Microsoft.AspNetCore.Mvc.ApplicationParts.ApplicationPartManager. Let’s see:

I created it as an extension methods so I can use them from the tests.

Final version

Now, the integration test is fully isolated and contains only controllers needed for the test.

 

Summary

As usual, the whole sample is located at https://github.com/kadukf/blog.msdn/tree/master/.NET/ControllerFeatureProvider

Happy coding!

CosmosDB change feed support for manual checkpoint

It was quite long time I wrote last post about the change feed processor. Let’s continue the series. This time, let’s look into a manual checkpoint feature.

Basics

Let’s refresh the knowledge. CosmosDB change feed processor SDK is a library for dispatching the document-related changes (inserts, updates) as a stream of documents. The documents can be split into multiple partitions (determined by partition key). The change feed processor is able to read the changes from all partitions and the reading is executed in the batches. In order to move forward with reading the documents, the change feed need to track the last processed batch. The tracking process is in the form of storing the continuation token from the last processed batch – that’s done using “checkpoint” method. The checkpoints are kept in the lease documents.

Automatic checkpoint

By default, the change feed processor checkpoints the processed batches automatically. Updating the lease document takes some time and costs some request units (RUs). In order to fine-tune it, it’s possible to set the frequency of this process.

Let’s look at the change feed processing logic for one partition in the pseudo steps:

  1. read the change feed from the last known “continuation token”
  2. dispatch the documents for processing by calling IObserver.ProcessChangesAsync
  3. wait for the observer to process whole batch
  4. if it’s time to checkpoint (based on the configured checkpoint frequency), update the lease document with the continuation token from the last processed batch of documents. That’s so called “checkpoint” step.
  5. repeat from step 1

The whole process is shown on the following picture:

In most cases, the above algorithm works without issues. But do you see a possible problem?

Stuck partition problem

It’s the step 3. The automatic checkpoint algorithm waits until whole batch is processed and then checkpoints. While this waiting no other documents are read and hence processed. It could happen that, e.g. in the batch with 10 documents, there is one document whose processing takes several minutes. In such case, whole change feed partition processing is “paused” and you can’t do anything about it. Such situation can be easily detected when change feed estimator is used (for more information see my previous post). See the following graph showing the remaining work on such stuck partition:

I simulated that the processing of some document takes several hours. This stopped reading from the change feed (the number of polling reads per second dropped to zero). The remaining work started to grow. If there would be a defined SLA agreement stating that the events are scheduled for handling within 2 seconds from the time they were inserted into the collection, then the system would not able to meet it.

Solution – manual checkpoint

The solution is to use the manual checkpoint. It’s possible to turn off that the automatic checkpoint process. See the following code:

By doing this, the result lease document is never automatically updated. See the following:

That’s not good. After the process restart, the change feed would start back from the beginning. We need to checkpoint manually. That’s possible using CheckpointAsync method exposed on the interface IChangeFeedObserverContext. But when? We can not wait for whole batch to be processed and then call CheckpointAsync. The simple solution could be like this:

  1. read the change feed from the last known “continuation token”
  2. dispatch the documents for processing by calling IObserver.ProcessChangesAsync
  3. schedule processing all documents, this will create as many Tasks as we have documents,
  4. register scheduled Tasks and IChangeFeedObserverContext instance into manual periodic checkpointer
  5. repeat from step 1

See the sample code of the observer:

The code above is for demonstrating purposes. It integrates the periodic partition checkpointer into the observer. The checkpointer is started when the observer is opened and stops when the observer is closed. In between, when the batches are dispatched for processing, the processing tasks are scheduled and the whole batch is enqueued into the checkpointer. The periodically checks for the last processed batch in the queue. For the running example check the sample here.

After running the sample, the result is:

And the console would look like:

Prons and cons of the solutions

The above solution solves the issue with stuck partition, that’s great (advantage). The change feed processor is reading the changes while the previously read documents are still being processed. But if the reading documents and scheduling the work is much faster than its actual processing then it’s possible that the instance will be overloaded (disadvantage). It’s necessary to limit the number of concurrent document processings, e.g. using Semaphore.

That’s all for now. In the future posts I’ll describe the pitfalls of the change feed processing in general.

CosmosDB change feed processor 2.1 released (monitoring enhancements)

There is new version of CosmosDB Change feed processor library released, version 2.1 and two main enhancements are:

  1. Upgrade to Microsoft.Azure.DocumentDB 2.0 nuget package
  2. Extensions to the monitoring

 

Microsoft.Azure.DocumentDB upgrade

The change feed library was built against latest SDK 2.0. Short! New SDK version brings several improvements and one of the most important change was the new multiplexing connection model. More about it in future blog post.

 

Extensions to the monitoring

When you take change feed processing seriously, I mean, production ready, you need to be sure that the processing of the feed is working as expected. You need to be sure that “the feed is flowing” which means the feed is in progress and as soon as there is a document change (insert, replace), the change feed processor receives it.

 

I wrote about the monitoring in my previous post. The whole monitoring is built on so called “remaining work estimator” which estimates the number of the documents till the end of the feed. There was an improvement introduced into the library.

It’s possible to get the estimation per partition now! Why does it matter? Because it give you better visibility into the system. You are able to see what partition is left behind and how far.

Let’s see how to create the remaining work estimator instance:

And let’s see how to use it:

And that’s the result:

 

 

Real life scenario

I’m working on the event sourcing system powered by CosmosDB (I’ll be writing about it in the future posts) and we are heavily dependent on the change feed. We need to be sure the system works 24/7. We need to be prepared for failures also when using change feed. In this case when a partition processing is stuck.  So, we are monitoring change feed on several levels.

 

 

Who process what

We monitor what process consumes what partition. In other words, we are able to say what process is consuming what partitions. In practice, we record the following metric: data center, role, instance id, database account, partitions.

 

What’s the estimated work left to process

We have a runner which runs periodically and reports what is the estimated work left per account/partition. That’s the input for the graphs and alerting. It is built on top of the estimator shown in this post. If any estimated work hits a limit, it means we have a problem with “stuck” partition.

 

See the graph of the simulation from our test environment:

 

The red dotted line is an alert level. Once we are alerted, we are able to see what partitions are stuck. Because we record who process what, we are able to find out the instance which was processing the partition last time and diagnose the issue.

 

That’s all for now, happy monitoring!

 

 

Previous posts:

  1. New Azure CosmosDB change feed processor released!
  2. Logging in CosmosDB change feed processor
  3. CosmosDB change feed monitoring
  4. CosmosDB change feed processor – leases and work balancing strategy

CosmosDB change feed processor – leases and work balancing strategy

The work distribution is the one of the core change feed processor features. The purpose of the change feed processor is to receive and dispatch all the documents from all partitions of the feed collection for further processing. In case of the high load, it’s necessary to scale-out the number of processors, so that the documents are processed within desired time SLA.

As you can see, the change feed processing system has to deal with it. The system has to distribute the work between instances somehow.  That’s the aim of the so called “partitions balancing strategy”. The change feed processor library contains a default implementation of this strategy “EqualPartitionsBalancingStrategy”.

 

 

EqualPartitionsBalancingStrategy

This strategy tries to distribute the work evenly between all active hosts. The algorithm is based on the lease documents.

Just a recap. (Taken from my first post in this serie)There are two collections in the game:

  • Data collection – it’s known also as a feed collection. It’s a collection of the data over which change feed is processed. It could be created as a partitioned collection so that data are split into separated physical locations.
  • Lease collection – it’s a collection where the change feed keeps the pointer of last processed document per partition.

 

Let’s see an example of such lease document:

{
  "id": "account.documents.azure.com_EH9XAA==_EH9XAP7EOwA=..2",
  "PartitionId": "2",
  "Owner": "unique_owner_id",
  "ContinuationToken": ""2468759"",
  "properties": {},
  "timestamp": "2018-08-01T20:43:54.9476294Z"
}


There will be as many lease documents created as there are partitions in the feed collection.

Id

Each lease document has “id” which is in the following form:

{prefix}{feed database account}_{feed database resource id}_{collection resource id}..{partition range ID}

The prefix is used to differentiate the change feed processor groups if they use same lease collection. The rest of the parameters in the “id” is self-explanatory.

 

Owner

That’s a string which is uniquely identifying the processor instance. Make sure this string is unique per each processor instance. Otherwise the same documents will be processed as many time as you have the processor instances. We used to name it the following way: {deployment id}_{processor name}_{instance id}

 

Continuation token

That’s the pointer of the last successfully processed document change on the feed.

 

Timestamp

That’s the epoch timestamp of the last change feed checkpoint time or heartbeat time. This timestamp is used by the work balancing strategy to determine whether the lease is expired or not.

If the lease is expired, then it’s available to be taken over by any change feed processor instance.

If the lease is not expired, it means that the instance is active and the feed is processed.

The aim of the strategy is to distribute the work evenly between all active instances.

 

Properties

That’s the free form dictionary of string to string. The properties are the form how the strategy executions communicate between each other. It give you a possibility to create more advanced strategies, e.g. when shifting the load between “A/B”  deployments.

The properties can be manipulated by the strategy execution and then they will be persisted.

 

Configuration – Lease expiration interval

The strategy is configured by the ChangeFeedProcessorOptions.LeaseExpirationInterval. It’s the interval for which the lease is taken by the particular owner. It’s an interval determining how often the strategy is executed.

So, if the lease is not renewed or checkpoint-ed within this interval, it will be identified as expired and ownership of the partition will move to another change feed processor instance.

If the processor is stopped, the lease is freed-up. It means that the partition won’t be processed until some processor instance takes it (which depends on lease expiration interval).

There are special situations when you need to enhance the default work balancing strategy. Imagine you stop the instance. The partitions processed by the stopped instance are not processed until some other instance takes it over. This “gap” depends on the expiration interval. So, if you have strongly time-limited SLA, you will be required to write the custom strategy and leveraging the properties. Ping me if you need more information about it.

Bear in mind, this is really rare condition and almost in all cases you will be more than happy with default strategy. We run quite heavy system using change feed processor with default strategy without issues.

 

Conclusion, consequences

In case the change feed processor instance crashed and then starts within lease expiration time, it will take back all previously owned leases immediately after start.

In case the instance is gracefully stopped, the owner is reset which means the lease is free to be owned by other change feed processor instance.

In case the number of change feed processor instances is scaled-out, the new ideal work balance is calculated based on the new active instances. The new instance(s) will try to take the ownership of partitions by 1 per each execution cycle configured by lease expiration interval.

Keep in mind that the leases are stored in the lease collection. The collection has its own throughput (== money). In case you set low expiration interval, you can end up with the throtlling issues because leases are scanned once per lease expiration interval by each instance. This can cause that the leases are not taken for longer time and their processing will be delayed.

 

Previous posts:

  1. New Azure CosmosDB change feed processor released!
  2. Logging in CosmosDB change feed processor
  3. CosmosDB change feed monitoring