Author Topic: Real-time OS drivers and their scheduling (Read 7741 times)

superticker · « **on:** November 11, 2006, 03:36 AM »

This is a continuation of another thread: "What's better: modern built-in motherboard sound chip or old sound card?"

I know NT isn't real-time, but 50ms for an IRQ to be handled sounds ludicrous. And AFAIK, data processing isn't done directly in the IRQ handler, instead some state information is saved and passed down as an IRP, and the IRQ handler itself finishes quickly. Iirc Linux does somewhat the same by having "high" and "low" parts of their IRQ handlers.
-f0dder (November 10, 2006, 04:51 PM)

In a real-time driver, there is a high, middle, and low part.

The highest (hardware interrupt) part strictly services the hardware, and schedules the software interrupts and their service priorities (usually 0-63). It may schedule several software interrupts since different parts (I/O initialization, I/O continued service, I/O completion) may require different priorities. It will also grab the data and cache it if there are any. It's usually about 15 instructions or less. Of course, its reentrant coding.

The middle tier routines will service the data. This also must be reentrant coding, so 95% of the system calls can't be made from this level. Obviously, no C library calls can be made from this level either since the C library isn't reentrant. If necessary, this level will schedule a completion routine to be executed at the next level.

For the lowest tier (completion routines), the OS does save all processor registers automatically so there's high context-switch overhead entering this tier. The good news is that your code does not have to be reentrant, so all the system calls are available to you as well as the C library.

It's interesting to note, but the service rates of each tier are highest, 3000 interrupt/sec; middle 300 interrupts/sec; and lowest 30 interrupts/sec. Note that the maximun service rate of the lowest tier is the same in both the real-time OS as well as the conventional OS. That's because both have the same context-switch overhead at this level because both are saving/restoring all the registers.

For (real-time) rate monotonic scheduling, we want each completion routine to have its own unique priority so there's deterministic (ordered) execution. That's why real-time OSes (RTOSes) have so many priorities.

Windows is sluggish at handling interrupts. I've had problems with National Instruments multifunction I/O cards giving me 50mS service rates and National says there's nothing they can do about it. I admit these laboratory machines have a lot of I/O going on in them though. That's why National offers a 486 processor with a Far-Lap OS (RTOS) for real-time control needs on Windows. Edit: I just realized this was a driver-service problem with several Windows 95 machines. The "native" Windows 2000 driver model should perform much better.

Hadn't heard about real-time NT, are you sure you're not thinking of NT embedded?
-f0dder (November 10, 2006, 04:51 PM)

We are definitely talking about the same product. In 2000, it was called Real-time Windows NT, but now Microsoft is calling it Windows Embedded. I just visited their website http://msdn.microsof...mbedded/default.aspx
It's a scalable version of Windows such that you can scale its memory foot print, which is important. I think it's still over 500K though when really scaled down, but my information is old on this spec (1997).

Just because something is embedded doesn't mean it has to be hard real-time.
-f0dder (November 10, 2006, 04:51 PM)

I agree. It is possible to do hard real-time in software, but I honestly believe hard real-time tasks are better done in hardware today because design tools for FPGAs are so easy to use now. In addition, some SoC chips (Excalibur) incorporate both a processor as well as an FPGA all on the same chip, so doing both firmware and a gate array design does not increase chip count.

Iirc there's also just one scheduler in the whole of NT, used for both usermode and kernelmode stuff - although there's a distinction between usermode and kernelmode threads. The "scheduler" also isn't a separate modular part, it's interweaved in most of the NT kernel because of it's particular design.
-f0dder (November 10, 2006, 04:51 PM)

If that's true, then that's really bad design. Please tell me that's not true. In the application layer, you have two things to deal with you don't have in the driver layer. One is protection switches (with the Memory Management Unit, MMU), and the other is semaphore testing and processing--which is really messy and big overhead--in a scheduler. Some would also include resource awareness (what resources are tied up by awaiting processes), but I'm counting that case under semaphore management here.

In contrast, the driver scheduler has none of this overhead. That makes it really lean and mean, which is something we really want in all OSes. The typical OS implementation (and I think Linux works this way), is the let the high overhead application layer scheduler run as a driver-level task in the lowest priority, 63. All other driver-level tasks run between priorities 0-62 such that when they complete, then the high-overhead scheduler runs.

As for priority levels, there's 32 of them, with one being REALTIME. While that priority isn't strictly "realtime" by computer science terms,...
-f0dder (November 10, 2006, 04:51 PM)

I follow what you're saying, but I wouldn't look at it that way. All first tier (hardware interrupt) driver tasks must all complete first. Afterwards, all second tier driver tasks must compete and there's no special priorities for these. After that, then priories 0-31 for the main scheduler get attention where priority 0 is the real-time completion routine (which I "think" is swapable like anything else in the application layer, but maybe there's an exception here). The point is Windows places it's completion routines in protected mode, which means more context-switch overhead (with the MMU) but they would be easier to write and debug than if they were in the driver layer.

Unlike Windows, most OSes require you to reload the entire OS if you enlarge the driver for any reason. This makes developing in the driver layer inconvenient. Although placing the completion routine in the application layer means more context-switch overhead (MMU register switches for protected mode), it is handier for development.

Most RTOS application designs don't even have MMU hardware, so doing completion routines in the third tier of the driver layer makes sense since the application layer isn't protected anyway.

f0dder · « **Reply #1 on:** November 11, 2006, 08:07 AM »

Well, purge 9x from your mind - it's ancient and outdated, and is basically a DOS extender on steroids, with large parts of it carried directly from 16bit win3.x, large mutex-protected non-preemptible sections etc. (And yes, 32bit apps on 9x still end up using large 16bit portions; try tracing any GDI call with a kernelmode debugger if you don't believe me

).

9x VXD driver model is pretty different from the NT driver model, whereas the "WDM" driver model is mostly some PnP and power-saving additions to the original NT model as far as I can tell.

NT kernel design isn't to farr off your description of a RTOS design - a fundamental difference, though, is that NT doesn't make any hard guarantees. But there's IRQLs (which determines which interrupts are allowed to be serviced, as well as which level of kernel calls that are safe), completion routines, etc.

I've never seen any mention of "real time NT", and afaik even for NT4 it was called "NT embedded" - but I could be wrong. other people have experimented with realtime NT, though, and there's at least one company that have realtime addons for NT.

Now as for scheduling... the NT scheduler schedules threads. Not "processes" or "drivers", but threads - although obviously it does deal with some process data, since thread priority is a mix of base process priority and thread priority, the process owning the current foreground window gets a slight boost, etc.

Generally the scheduler will not scheduler a lower-priority thread if there are higher-priority threads on the ready queue, but do consult "Inside Windows"/"Windows Internals" - it's a pretty good read.

Lots of stuff is pageable in the NT kernel by default, which IMHO is a bad idea - with the amount of RAM machines have had for the last many years, keeping base kernel + drivers locked is a good idea performance-wise. Even back when I had less RAM, "DisablePagingExecutive=1" meant that my system "recovered" faster after quitting some intensive game.

There are kernel-mode threads as well as user-mode threads, for example the "lazy dirty-page writer" and "zero-page provider" are kernel-mode threads.

Iirc an interrupt happens in the context of whatever active thread - which means that often a r3->r0->r3 context switch is needed, while a full register/state preservation and CR3 reloading isn't. This also means that at that IRQL, you can make very few assumptions on which operations are safe to perform.

What I meant about the NT scheduler being "all over the kernel" is based on things like threads blocking on a resource. Instead of having a centralized scheduler that continously checks events to see if they've been triggered and then wake up threads, basically each operation that can cause a thread to unblock will set off the notification itself.

superticker · « **Reply #2 on:** November 11, 2006, 11:01 AM »

Well, purge 9x from your mind - it's ancient and outdated, and is basically a DOS extender on steroids,...
-f0dder (November 11, 2006, 08:07 AM)

Yes, I agree. I remember the lab supervisors griping about all the sporadic buffer overrun errors on the data acquisition boards when they pushed their Windows 95 machines. The fundamental problem is that the driver service rates were too non-deterministic for their data streams. I gave them two choices: (1) Buy new acquisition hardware with larger FIFO buffers ($1600 each) that would better withstand the bursty Windows service rates, or (2) replace the Win95 OS with Windows NT. I was pushing solution (1) because I wasn't too sure solution (2) would work in the long run for them.

It doesn't pay to push a system up to its design limits; otherwise, reliability suffers. They eventually understood that.

... there's at least one company that have real-time addons for NT.
-f0dder (November 11, 2006, 08:07 AM)

There's two companies. One approach actually tries to preserve the Win32 API, but supplements its with additional scheduling calls. If I'm remembering right, both approaches modify the driver model for those hardware devices that require real-time service.

Now as for scheduling... the NT scheduler schedules threads. Not "processes" or "drivers", but threads - although obviously it does deal with some process data, since thread priority is a mix of base process priority and thread priority, the process owning the current foreground window gets a slight boost, etc.
-f0dder (November 11, 2006, 08:07 AM)

Your terminology may be different than mine. With my nomenclature, scheduling a process means doing a protection switch (Registers in the Memory Management Unit, MMU, are changed. A process may even get mapped out of memory.). In contrast, scheduling a thread requires no protection switch, so time slicing these have much lower context-switch overhead.

Servicing a software interrupt requires no protection switch either because all driver code remains mapped in addressable memory and is not protected from other drivers. In other words, all drivers are mapped together with a single pair of mapping registers. Now Windows NT can dynamically load drivers, which means the new driver (just loaded) won't be next to the others. That's going to require an extra dedicated pair of mapping registers. But most OSes can't do this and won't support hot-swapping of hardware. Windows and VxWorks are the only exceptions. (VxWorks is commonly used in communcations equipment because of its hot-swapping feature.)

But honestly, I don't know how the Windows NT scheduler works. It may work differently than other OSes when it comes to threads and drivers. Perhaps threads are protected from each other, but if so, then why call them threads and why have protection switch overhead if the "threaded" code doesn't need it?

Individual Windows processes are separately mapped from each other as expected. So a protect switch is expected when time slicing between them.

There are kernel-mode threads as well as user-mode threads, for example the "lazy dirty-page writer" and "zero-page provider" are kernel-mode threads.
-f0dder (November 11, 2006, 08:07 AM)

Everything in kernel mode would use the same pair of MMU mapping registers, so no protection switch is needed. As a result, it does make sense to use threads for all kernel task switching to reduce context-switch overhead.

Iirc an interrupt happens in the context of whatever active thread - which means that often a r3->r0->r3 context switch is needed, while a full register/state preservation and CR3 reloading isn't. This also means that at that IRQL, you can make very few assumptions on which operations are safe to perform.
-f0dder (November 11, 2006, 08:07 AM)

So you must write reentrant code for both levels of the Windows driver? Does this mean that the completion routine (that's running as a real-time task in the application layer) is the only place non-reentrant code is allowed?

f0dder · « **Reply #3 on:** November 11, 2006, 11:40 AM »

In NT, a process is basically a container which holds one or more threads, various objects the process uses (files, sockets, pipes, mutexes, ...), the process memory map and some other stuff.

A thread is register set, as well as a stack, an structured exception chain, and a few other things. (Hm, I guess I should look up whether the thread register set includes CR3 (page table pointer), or if that's taken from the process block).

Each process has a process info block, and each thread has a thread info block.

So the scheduler only schedules threads, since that's the only schedulable entity in the system. It does take various factors into consideration though, and the scheduler isn't only driver by the timer interrupt. If scheduling a thread within the same process, it's about as simple as updating registers, but when the thread lives in a another process, the pagetables etc. has to be updated as well. It's a bit more complicated than this though, especially because of multi-CPU systems.

Servicing an IRQ will allways go to kernelmode, and so will subsequent processing, so you'll always have ring transition overhead (unless the IRQ happens in the context of thread that was already in kernelmode - NT is pretty much fully pre-emptible). But this doesn't involve updating the page table, since no *process* switch needs to be done.

Usermode processes(!) are isolated against eachother, and the kernel is isolated from usermode processes as well. You can't say that *threads* are isolated against eachother, since isolation happens on a process boundary.

Btw, linux supports dynamic load/unload of drivers (or "kernel modules" as they call them) as well, and I'm pretty sure that other OSes do as well - I'd be very surprised if anything with a microkernel design doesn't, actually.

superticker · « **Reply #4 on:** November 12, 2006, 12:27 AM »

In NT, a process is basically a container which holds one or more threads, various objects the process uses (files, sockets, pipes, mutexes, ...), the process memory map and some other stuff....

Each process has a process info block, and each thread has a thread info block. So the scheduler only schedules threads, since that's the only schedulable entity in the system.
-f0dder (November 11, 2006, 11:40 AM)

I follow your point of view now, and it does make sense. My only concern with this approach is that some processes (with fewer threads) may get starved for execution more than others (with many threads) since the scheduler fails to take the process membership of the threads into account when time slicing. I'm wondering if that's a good thing or a bad thing? Usually starvation is considered bad, but what about a scheduling bias favoring processes with more threads ready to run? This is a harder question. $:-\$ ... And I need time to think about this. Perhaps as long as a process isn't totally starved, it's okay to give it less consideration if it has only one thread ready to run.

One of the requirements of an RTOS is the ability to substitute your own scheduler. Some calls even let you pick the scheduler on their parameter list. As a result, the scheduler needs to be autonomous from the rest of the OS. Since the Windows scheduler is so intertwined with OS operations, I doubt you could replace it.

User-mode processes(!) are isolated against each other, and the kernel is isolated from user-mode processes as well.
-f0dder (November 11, 2006, 11:40 AM)

In most OSes, everything in "kernel mode" (which includes the drivers and the kernel/monitor) are mapped together such that execution can move from one place to another without the overhead a protection switch. (Yes, that means a bad driver can crash the kernel.) Does Windows work the same way?

I knows Microsoft signs drivers today, which means they shouldn't crash certain system parts.

BTW, Linux supports dynamic load/unload of drivers (or "kernel modules" as they call them) as well, and I'm pretty sure that other OSes do as well - I'd be very surprised if anything with a microkernel design doesn't, actually.
-f0dder (November 11, 2006, 11:40 AM)

Well, you run out of MMU register pairs for each separately protected module that must be constantly mapped into memory. How many MMU mapping registers does the Pentium processor have?

Now daemon services, even if they are swapped in, don't necessarily always need to be mapped in if you're running out of mapping registers. In other words, daemon services can share mapping registers. But critical modules of the microkernel always need to be mapped in to avoid the overhead of protection switches. So the question is, are there enough dedicated mapping registers for all the microkernel modules and dynamically loading drivers? An you need to have a few "sharable" mapping registers left over for the application layer and its service daemons.

I haven't ported a protected-mode OS before, so I've never had to write any MMU management code before. In the embedded systems world, we try to avoid protect mode if it's a small closed system (such as the engine controller in your car).

f0dder · « **Reply #5 on:** November 12, 2006, 09:32 AM »

Okay, I've browsed a bit through "Inside Windows 2000" and tried to summarize just a little part of Thread Scheduling from Chapter 6.

A thread runs for an amount of time called a "Quantum". A thread will always be interrupted at the end of it's Quantum, and NT will then check if there's a higher-priority thread that needs to be scheduled, or if the current thread's priority needs to be reduced (there's some dynamic priority levels in NT).

A thread isn't guaranteed to run for it's entire quantum, though - it can even be pre-empted before it's quantum starts. This is because higher-priority threads are always given preferance.

NT schedules strictly at thread granularity, no consideration is given to what process the thread belongs to. - so yes, an app with 10 threads could "starve" an app with 2 threads. So you need to do responsible design; don't have a bunch of CPU-intensive threads on a single-CPU system. Having multiple threads isn't a problem as long as most of the threads are blocking for events, though, since those won't even be considered for scheduling.

In most OSes, everything in "kernel mode" (which includes the drivers and the kernel/monitor) are mapped together such that execution can move from one place to another without the overhead a protection switch. (Yes, that means a bad driver can crash the kernel.) Does Windows work the same way?

Yup, everything kernel-mode is basically lumped together in the high part of the address space (upper 2GB, unless a boot.ini switch is added to make the split 3GB for usermode and 1GB for kernelmode). So the kernel part, afaik, stays mapped in every process.

Well, you run out of MMU register pairs for each separately protected module that must be constantly mapped into memory. How many MMU mapping registers does the Pentium processor have?

x86 doesn't work that way

You have a register (CR3) that points to a page table (physical memory address). Each process has it's own CR3 value. The page table is a multi-level hierarchy that maps linear->physical addresses, including some additional info like access rights (since pages are 4k and must start at 4k boundaries, there's 12 bits for per-page stuff). There's also some extensions that allow for 2MB and 4MB pages, but 4kb is by far the most common and useful granularity.

That was a very rough breakdown

NT doesn't do "swapping", it does "paging" - ie., it swaps individual pages in and out, instead of full processes.

superticker · « **Reply #6 on:** November 12, 2006, 11:32 AM »

NT schedules strictly at thread granularity, no consideration is given to what process the thread belongs to. - so yes, an app with 10 threads could "starve" an app with 2 threads. So you need to do responsible design; don't have a bunch of CPU-intensive threads on a single-CPU system. Having multiple threads isn't a problem as long as most of the threads are blocking for events,...
-f0dder (November 12, 2006, 09:32 AM)

And I think I'm okay with that design because it simplifies the scheduler, which means it works faster. The only downside I see is if someone wrote some malware (like a packet sniffer) with lots of threads that would create a denial-of-service effect on your computer. But that would bring your attention to the malware, and the malware designer wouldn't want to do that.

In most OSes, everything in "kernel mode" (which includes the drivers and the kernel/monitor) are mapped together such that execution can move from one place to another without the overhead a protection switch.... Does Windows work the same way?
Yup, everything kernel-mode is basically lumped together in the high part of the address space (upper 2GB, unless a boot.ini switch is added to make the split 3GB for usermode and 1GB for kernel mode).
-f0dder (November 12, 2006, 09:32 AM)

That brings up a new topic. Is it possible to have more than 4GBytes of memory in Windows and still use the 32-bit version of the OS? For example, could you mapped one 4GB (32-bit) block for just the kernel (not that you would really need/want to) and the other 4 GB block for the user mode so that you have a 8GB machine running 32-bit Windows? I realize this design may not be ideal, but is it possible? (Go ahead and move this new topic to another thread if the answer requires some discussion. Maybe it won't.)

Well, you run out of MMU register pairs for each separately protected module that must be constantly mapped into memory. How many MMU mapping registers does the Pentium processor have?
x86 doesn't work that way

You have a register (CR3) that points to a page table (physical memory address). Each process has it's own CR3 value. The page table is a multi-level hierarchy that maps linear->physical addresses, including some additional info like access rights (since pages are 4k and must start at 4k boundaries, there's 12 bits for per-page stuff).
-f0dder (November 12, 2006, 09:32 AM)

I'm trying to decide as long as these page tables stay in L1 cache, if this is an acceptable solution? My initial thinking is that if you had dedicated mapping registers (without any memory contingencies between processor, address decoder, and address mapper), you could have more parallel operations (instruction fetching & effective address compution). But, in truth, some pipelining and segmenting of the L1 cache could be used to avoid this potential conflict.

My only comment is, as you make you pipeline longer, you suffer more pentalities (like pipeline refilling) on branch instructions. I do know the Pentium does look-ahead address compution on branch instructions--and maybe it needs to for this reason.

I guess I favor the dedicated register design for the MMU. It's cleaner and you don't have to worry about several subunits fighting over the same L1 cache for their parallel activities. You could set aside (segment) part of the L1 cache for address mapping info, but then you would have a messy form of the dedicated MMU register design.

NT doesn't do "swapping", it does "paging" - ie., it swaps individual pages in and out, instead of full processes.
-f0dder (November 12, 2006, 09:32 AM)

There's a followup question on this at https://www.donation...index.php?topic=6142

f0dder · « **Reply #7 on:** November 12, 2006, 11:40 AM »

That brings up a new topic. Is it possible to have more than 4GBytes of memory in Windows and still use the 32-bit version of the OS? For example, could you mapped one 4GB (32-bit) block for just the kernel (not that you would really need/want to) and the other 4 GB block for the user mode so that you have a 8GB machine running 32-bit Windows? I realize this design may not be ideal, but is it possible? (Go ahead and move this new topic to another thread if the answer requires some discussion. Maybe it won't.)

You can have more than 4GB yup, but iirc you need one of the server versions of windows to use it. Also, traditionally each application will only be able to use around 2 gigs of address space (or 3 gigs with that boot.ini switch) - although some "Address Window Extensions" have been added so you can map "windows" to physical RAM.

More than 4GB ram support was added already with the Pentrium Pro yeeeaaars ago, by the way

I'm trying to decide as long as these page tables stay in L1 cache, if this is an acceptable solution? My initial thinking is that if you had dedicated mapping registers (without any memory contingencies between processor, address decoder, and address mapper), you could have more parallel operations (instruction fetching & effective address compution). But, in truth, some pipelining and segmenting of the L1 cache could be used to avoid this potential conflict.

Well, it obviously works

. There's something called TLB - Translate Lookaside Buffers. Basically some extra caching for page table entries. TLB flushes/misses are relatively expensive.

I guess I favor the dedicated register design for the MMU. It's cleaner and you don't have to worry about several subunits fighting over the same L1 cache for their parallel activities. You could set aside (segment) part of the L1 cache for address mapping info, but then you would have a messy form of the dedicated MMU register design.

A bit cleaner and perhaps more efficient, but less flexible. MMU registers sound good for embedded devices, but IMHO isn't that great an idea for a generic operating system such as NT. Especially not if you take Terminal Services into account

Author Topic: Real-time OS drivers and their scheduling (Read 7741 times)

superticker

Real-time OS drivers and their scheduling

f0dder

Re: Real-time OS drivers and their scheduling

superticker

Re: Real-time OS drivers and their scheduling

f0dder

Re: Real-time OS drivers and their scheduling

superticker

Re: Real-time OS drivers and their scheduling

f0dder

Re: Real-time OS drivers and their scheduling

superticker

Re: Real-time OS drivers and their scheduling

f0dder

Re: Real-time OS drivers and their scheduling