Before I get any deeper into this (oops too late) I think the best thing is to establish a common baseline so everybody (including me) knows what I'm talking about. Like anything technical it's essential everybody is speaking the same language so a lot of it will be pretty rudimentary. Also, see my signature.
Basic Terminology- Frame: A frame is the smallest group of samples you should need to be concerned with. Don't think pictures (like video frames) but rather data frames like in networking. Each group of samples has a header both to provide metadata and for muxing and decoding.
- Video Frame: Every video frame, regardless of what standard is used for encoding, contains all the samples for 1 entire picture. The terms are basically interchangeable.
- Audio Frame: The number of samples in an audio frame is determined by the relevant encoding standard. Any further details will be handled by the relevant DirectShow filters so this is already more than you probably need to know.
- Stream: A stream is a sequence of video or audio samples.
- Elementary Stream: This is a stream of video or audio frames. Some files appear to be containers (eg MP3) but are really just elementary streams with additional information tacked on.
- Raw Stream: A raw stream consists of nothing but samples. It typically has no file header and there are no frames. I only mention this because H.264 does not use frames (except in the video sense obviously) so except for special circumstances (which we shouldn't have to worry about it) should always be in a container.
- Mux: Multiplexing multiple streams together is typically referred to as muxing. This is also similar to multiplexing in a data network except that the frames have to be in sync in terms of timing. Typically this means alternating between 1 video frame and multiple audio frames. This should be handled automatically but I figured it was worth explaining.
- Container: When you mux streams together you put them into a container. The container has its own header to store metadata about the streams so they can be separated correctly later.
- Demux: Predictably demultiplexing the individual streams from a container is better known as demuxing.
- Encoding: Just like any other type of information video and audio have to be encoded in some standardized data format for processing by a computer. Encoding and format are not the same thing (see container) but they're commonly used interchangeably. If you're going to do that just make sure you're clear that's what you mean. If you want to be extra clear it's safer to call it encoding, standard, or encoding standard but even I'm not that much of a language nazi.
- Encoding Standard: Some encoding standards are inseparable from the software used to create them. For example QuickTime refers to both the encoding standard and the encoder. However the most common encoding standards are entirely separate from the software itself. It makes things easier if you can separate them in your head. For example there are encoders called DivX and XviD but both encode video according to standards defined in MPEG-4 Part 2. It is not DivX or XviD video, but rather MPEG-4 ASP or MPEG-4 SP video.
- Definition: Definition is the accuracy of a sample or group of samples. In other words the amount of detail captured. Once a stream is captured the maximum definition is set. It can never be increased but can be decreased.
- Resolution: Resolution is the precision of a sample or group of samples. In other words the amount of detail encoded (stored) for a sample or group of samples. Increasing the resolution does not increase definition but decreasing the resolution permanently decreases the definition.
- Interpolation: If you're familiar with this mathematical term that's all this is. Otherwise just think of it as a mathematical educated guess. It's used to create new information, typically for upscaling to a higher resolution.
Picture GroupsOne way to reduce the size of a video stream is to avoid duplicating details which don't change from one frame to the next. This is particularly relevant to screen capture where it's common for several sequential frames to be exactly the same and many others to be nearly so. This section describes how this can be done by grouping pictures together. To best understand this information I recommend you read through the descriptions, attempt to follow the explanation which follows, and then repeat as many times as necessary.
- I-frame: An Intra picture, more commonly known as an I-frame, is a full picture equivalent to a normal image file.
- Delta Frame: Delta frame is the generic name for any frame which describes changes compared to one or more other frames. It cannot be decoded by itself. Any other frames it references must be decoded first. These may be Intra frames or other Delta frames - often both.
- P Frame: A Predicted picture, more commonly called a P-frame, uses only references to previous frames. These are the most efficient delta frames in terms of encoding and decoding efficiency but the least efficient in terms of file size.
- B Frame: A Bidirectional picture, or B-frame, describes changes relative to both previous and future frames. They are (on average) the most efficient in terms of file size but the least efficient in terms of encoding and decoding efficiency.
- Keyframe: In the most basic terms a Keyframe is an I frame where video decoding can begin without referencing any previous frames. Although keyframe and I-frame are sometimes used interchangeably, not every I-frame is automatically a keyframe. Also be aware that whether a particular I-frame is also a keyframe may be a function of a particular application and not determined exclusively by the properties of the video stream.
- GOP: A Group Of Pictures, or GOP, is a sequence of frames beginning with an I-frame which is followed by 1 or more P-frames and/or B-frames and sometimes also additional I-frames.
- Open GOP: A GOP is considered open if it includes one or more delta frames which reference frames from preceding and/or subseqent GOPs.
- Closed GOP: A closed GOP is entirely self-contained. No frame in a closed GOP references a frame from another GOP. At the very least a closed GOP cannot end with a B-frame.
This is an example of a fairly simple GOP structure, like what you might find in a MPEG-2 file. At the top is the order these frames will play. At the bottom is the order they will be encoded, decoded, and stored. Most of the I and P frames are encoded (and must be decoded) out of order because otherwise the encoder or decoder won't have the necessary information for the preceding B frames.
It's also worth mentioning that the first GOP (00-05) ends with a B frame so it cannot be closed while the second (07-09) ends with a P frame so it can be.
For capturing you may use P frames (depending on your choice of encoder primarily) but never B frames because they're not suitable for realtime encoding. When you reach the final step of encoding for upload (or most other distribution methods) you will rely heavily on B frames to retain maximum quality at a minimal file size.
Containers- Matroska: Matroska is a free (as in speech, beer, and patent encumbrance) universal multimedia container. If Matroska doesn't support it you almost certainly shouldn't be using it. The only down side is a lack of support in video editing software and possibly a lack of support by online video services although YouTube is happy to accept Matroska uploads. If you want to use one of the big commercial editors Matroska will give you problems. Matroska files typically have a MKV extension for video or muxed video and audio. It can also be used for just audio although MKA is more common.
- WebM: WebM is Google's open source media container. Actually it's just a subset of Matroska. However unlike Matroska you can't put just any stream you want into it. It's intended specifically for VP8 video which is a successor to the VP7 codec Flash Video was based on. Google bought On2, the company behind VP8, a couple years back to create an open source competitor to H.264. That hasn't happened yet. This one is more interesting than useful.
- MP4: As the name suggests MP4 is the official MPEG-4 container. You can put H.264 or MPEG-4 ASP (DivX, XviD, and the like) video in this container and MP3, AAC, or AC3 audio. Other than storing H.264 streams (because they're a pain without a container) I don't have much use for it because I use LPCM or FLAC audio in my YouTube videos. If you prefer MP3 or AAC it's not a bad choice. Theoretically you can also put MPEG-2 video in it except I don't know about software support for it and I don't bother with MPEG-2 any more. The file extension should always be MP4 but you will see M4V used for video only files or M4A for audio (thank Apple for that one).
- RIFF: RIFF is a ancient and generic container format for any type of data. Because it is not specific to any particular type of data it uses chunks instead of frames. Unlike a frame, a chunk does not have a header of its own. Instead it is referenced in 1 or more indexes at the beginning of the RIFF file. Each index amounts to master header for a group of chunks.
- WAV: The WAV container is an implementation of RIFF specifically designed for audio streams. Although it can hold various types of audio you shouldn't be using it for anything except LPCM. For practical purposes it's enough to know that a WAV file can be treated like an elementary audio stream.
- AVI: The AVI container is another specialized application of the RIFF format. It stands for Audio Video Interleave. Interleaving is just another way to describe muxing. Like all RIFF-based formats AVI files store the metadata (equivalent to frame headers) in monolithic indexes at the beginning of the file. There is one index for each stream so in our case there would be 2 - 1 each for video and audio.
The relatively primitive nature of AVI's chunk-based approach makes it unsuitable for B-frames because VfW was designed around the assumption that frames are stored in display order. It can be done, but it is always a hack. That's why VfW has generally been ignored by x264 developers.- AVI 1.0: The original AVI specification, commonly referred to as AVI 1.0, is the only type of AVI file you can work with via the equally ancient VfW interface. It is officially limited to a file size of 2GB, although with some trickery AVI 1.0 files up to 4GB can be created and read. You may notice that these sizes exactly match the limitations of the FAT16 and FAT32 file systems respectively.
- AVI 2.0/OpenDML: AVI 2.0 files use a Matrox extension of the AVI 1.0 standard called OpenDML which removes the 2GB/4GB file size limitation. There are other technical differences, but at the end of the day that's the problem of your DirectShow filters. As long as we're going through DirectX (which makes use of VfW components but not VfW itself) this should be the only type of AVI files to worry about.
- ASF: This is Microsoft's proprietary container for Windows Media (WMV and WMA) if you capture to these formats you'll use it for initial storage and then you'll convert it to something else and switch containers. Almost nobody uses ASF because almost nobody uses WMV and WMA is almost exclusively used in Windows Media Center. The extension, predictably enough, is ASF.
Video Standards- AVC/H.264: MPEG-4 Part 10 defines an advanced video encoding standard better known as H.264 (the ITU designation), MPEG-4 AVC, or simply AVC (Advanced Video Coding). It is far and away the best video encoding standard in terms of quality vs. bitrate (file size) but due to the complexity required is also relatively CPU intensive to encode, and potentially also to decode. It also has a lossless profile which is particularly suited to low motion video like typical screen captures. H.264 streams are raw, rather than elementary so they should always be stored in a container. Although they can be stored in MPEG PS or MPEG TS containers MP4 or Matroska are typically used.
- WMV 9: Windows Media Video is a family of video encoding standards of which only WMV 9 is of any particular interest as it is designed for everything from screen capture software to streaming. The VC-1 standard (aka WMV3, WVC1 or SMPTE 421M) used on some Blu-ray discs is a subset of WMV 9. At relatively high bitrate the quality is comparable to H.264. Being a proprietary Microsoft technology, WMV files are stored almost exclusively in their ASF container.
- CamStudio Lossless: CamStudio Lossless is an encoding standard implemented by the VfW codec of the same name. There is also a decoder, but no encoder, built into FFmpeg. It is highly optimized for realtime encoding (eg screen capture) of low motion video. Within those constraints it offers unbeatable efficiency (small file sizes). The one caveat for this codec is that it can be tricky to decode in AviSynth. For some reason AviSynth doesn't handle the way CamStudio Lossless handles duplicate frames properly. To bypass this you simply need to use the FFmpeg-based FFMS2 source filter rather than VfW to decode it with. It can be stored in either AVI or MKV containers.
- UT Video: UT Video is one of the newest lossless codecs. I haven't used it personally but for general purpose use it is reputed to be as good as it gets. I can almost guarantee it won't hold a candle to CamStudio Lossless for screen captures but if it's more "normal" video rather than super low motion like the typical screen capture it is apparently in a class by itself. Even if you don't use it for capturing it should be a great choice for intermediate tasks like editing. Being a codec means you'll need to put it in either an AVI or Matroska container.
- FFV1: FFV1 is FFmpeg's lossless video standard. It can be encoded either directly with FFmpeg or via VfW using ffdshow (don't look for a release version it's in perpetual beta). For normal or high motion video it's one of the most efficient but it is fairly CPU intensive. For screen capture probably most suitable as an intermediate format for editing. However it is not considered as fast or efficient as UT Video so I'd go with that one first. You will need to store it in either an AVI or Matroska container.
- MSU: MSU is a commercial lossless codec that's free for personal use. People seem to either love this one or hate it - either it works flawlessly or it chews up CPU cycles and takes forever. Tests I've seen seem to indicate it's efficiency (output filesize) is better than FFV1 but not as good as UT Video. I'll probably never bother to try it out just because I prefer open source alternatives which the rest of the lossless codecs on this list are. Once again this is an AVI or Matroska container.
- HuffYUV: HuffYUV I include more for sentimental reasons than anything else I suppose. It's the original open source lossless codec and was originally written by Ben Rudiak Gould who also created the first version of AviSynth. It should be less CPU intensive than FFV1 or MSU, but still not as good as UT Video. Efficiency wise definitely at the bottom of the list. However it does have the advantage of being available either as a standalone codec or part of ffdshow. Again, AVI or Matroska container.
Audio Standards- LPCM/PCM: LPCM stands for Linear Pulse Code Modulation which may also be referred to as just PCM or uncompressed. This is a sort of universal and simple way of storing audio used for everything from Betamax to CD to DVD and Blu-ray. There is no standard LPCM elementary stream but LPCM in a WAV container can be treated like one.
- FLAC: Free Lossless Audio Encoding or FLAC is an open source encoding standard for losslessly compressing LPCM audio. Unlike LPCM FLAC does use elementary streams which can contain a variety of (mostly CD Audio related) metadata. It can be stored by itself in a FLAC container. It can also be stored in a Matroska container either by itself or muxed with other streams.
- MP3: MPEG-1 Layer 3 Audio is losslessly compressed audio typically found in an elementary stream which also has a sort of secondary header added for tag metadata. It can also be muxed into pretty much any container although typically it's found in MKV, MP4, or AVI files.
- AAC LC: Advanced Audio Coding Low Complexity is part of the MPEG-2 standards family. At very low bitrates (128kbps or less) it tends to have slightly superior sound to MP3 at the same bitrate. At higher bitrates they are more or less equivalent. It is often referred to simply as AAC. Apple uses this encoding standard for iTunes downloads. There is no elementary stream format so raw AAC streams are typically found in MP4 containers by themselves. They can be muxed into MP4, MPEG PS, MPEG TS, or Matroska containers.
- WMA: Windows Media Audio is a family of audio encoding standards which include a lossy standard more or less comparable in quality to MP3 as well as a lossless one. I could give you a lot more details but WMA really isn't worth the effort.
Windows Specific- Uncompressed Video: In Microsoft land uncompressed video is defined as RGB 4:4:4 (every pixel includes all three color components). Uncompressed video can be encoded or rendered directly without any additional components.
- Uncompressed Audio: Microsoft defines uncompressed audio as LPCM at any bit depth and any samplerate. Uncompressed audio can be encoded or rendered directly without any additional components.
- Compressor: A component used to encode video or audio to a format other than the ones listed above as uncompressed.
- Decompressor: A component for decoding any video format except those listed above as uncompressed.
- Splitter: This is Microsoft speak for demuxer. Functionally it's exactly the same thing.
- Renderer: A component used to send uncompressed video or audio to your display or speakers.
- VfW: VfW is Microsoft's ancient attempt to copy Quicktime. Most things you can do via VfW are better handled either through DirectShow or a standalone executable.
- Codec: In VfW Compressors and Decompressors for a given format are typically included in a single Codec. However it's still possible to have a component that's just a compressor or a decompressor.
- DirectShow: This is the DirectX multimedia framework which replaced VfW. While it is a huge improvement in terms of playback, it isn't always reliable for random access.
- Filter: Rather than monolithic components like the codecs in VfW, DirectShow uses discrete components called filters. Each filter performs a single, specialized task like opening or writing a file splitting streams from a container, decoding or encoding a stream, or rendering uncompressed video or audio. DirectShow filter files have an extension of AX.
- Pins: The inputs and output of DirectShow filters are called pins. To send information from one filter to another you connect the output pin from the first filter to the input pin on the second. Depending on the filter an input or output pin could also connect to a file or a device.
- Filter Graph: In DirectShow the chain of filters used for a given set of operations is called a graph. A graph is built automatically by video playing or processing programs. It consists of the filters themselves and the connections between the various pins.
- GraphEdit: Alternatively you can use a free Microsoft tool called GraphEdit to either open a file to see what filters are involved or manually build a graph by selecting your own filters and connecting them yourself. You can even save a graph to open with various tools just like opening a file. I actually use an open source tool called GraphStudio which offers the same basic interface but more features. In either case these are good tools for troubleshooting DirectShow problems. It's probably one of the better pieces of software Microsoft has ever produced.
- MFC: Media Foundation Classes is essentially DirectShow except that it's locked down so they don't have all those pesky volunteer developers turning their crappy closed ecosystem into something infinitely more useful and interesting like they did with VfW and DirectShow. Just remember that no matter how much an MFC filter looks like something from DirectShow they're intentionally designed not to work with it.
ColorThis seems like a simple enough subject. When you look at an object in the real world like your keyboard what you're seeing is whatever frequencies of light aren't being absorbed. Instead they're being reflected back. That's subtractive color. When you look at a computer monitor you are actually seeing light being shined directly at you. That's additive color.
Each pixel on your monitor is actually 3 different dots, one red, one green, and one blue. This is called RGB color space. The intensity of each one determines the final color of the pixel and can be expressed in a value from 0-255. That's 8 bits per color or 24 bits per pixel making it 24 bit color.
Video uses a different color scheme. Instead of RGB it uses YUV. The Y represents luma (light and dark) while the U and V represent chroma (color). Technically the chroma is only red and blue because your eyes are more sensitive to green and therefore it's more or less included in the luma. This is the YUV color space
That also means the UV components don't need to be at full resolution because you won't be able to tell the difference. In fact they are usually only 1/4 the resolution which saves a lot of space. So now instead of 24 bits per pixel each block of 4 pixels uses 32 bits for luma but only 16 bits for chroma for a total of 12 bits per pixel. In other words half the original size.
Additionally video colors have a different gamut than computer colors. For video, whether RGB or YUV, the values range from 16-235 instead of 0-255. That means not every color your monitor displays correlates to a legal color for video. These illegal values are called out of gamut colors. What this ultimately means is that the colors in your final video will be different from the colors on your desktop. It shouldn't be a big deal but it's easy to think you're doing something wrong when that happens or that there's some way to "fix" it. You're not and there isn't.
What you should try to make sure of, though, is that you're not converting between RGB and YUV (actually YV12 once you include the chroma downsampling) more than once. If all you do is capture and encode that shouldn't be a problem. If you do any kind of editing in between you should figure out what colorspace your editing tools use and compare that to your capture codec. If your editing tools operate in RGB color space you should make sure you capture RGB. If they use YUV you can capture either RGB or YUV as long as you only perform the conversion once.
Other Useful Software- AviSynth: This is one of the most useful and flexible video tools ever created. It's a script-based editor that uses its own custom scripting language. It's extensible with plugins and can even be linked directly to programatically. Even though it's GPL licensed there are also exceptions written in to allow developers to link to AviSynth.dll without worrying about whether they have to open their code as well.
It basically works like this. You write a simple script specifying a source file or
files to open and any processing instructions you may have and then save it with an extension of AVS. You can then open that script with just about any program that can open an AVI file and it supplies the output of your script as uncompressed video and audio frames. That sounds a little complicated - and it certainly can be - but it can also be as simple as 1 or 2 lines:
AviSource("D:\Videos\MyFile.avi")
ConvertToYV12()
I won't demonstrate how much more complicated it can get because that would be cruel and probably not something you'll ever use. - FFmpeg: This is the holy grail of open source video and audio programs because it allows you to decode just about any file you can come up with and also bundles numerous impressive open source encoders including x264 and FFmpeg's own FFV1. It is also a nest of potential patent litigation so you have to be careful about where your server is located if you choosed to distribute it with a program. In most cases it's better to simply link to a website where somebody is providing a download from beyond the reach of the various trade groups.
- x264: H.264 is the most important video format in the world for the forseeable future and besides being free and open source x264 also happens to be the best x264 encoder in the world. It is a command line tool but there are lots of front ends available and honestly the command line isn't all that intimidating because the built-in presets are probably all you'll ever need. As with FFmpeg though, beware patent traps and the trolls who guard them.
- AviDemux: This open source and cross platform video editor uses FFmpeg libraries to do all kinds of basic editing. It can also be run from the command line, making it potentially a good tool for preparing captured video for upload.
- ffdshow: This is a package of DirectShow filters using open source libraries (predominantly FFmpeg) and also includes a VfW interface for them as well. You may not need it but I've used it for years.
- mkvtoolnix: This is the official toolkit for Matroska files. You can put streams in, take streams out, join streams together, and numerous other things.
- LAV Tools: Yet another FFmpeg-based toolkit. This one provides individual DirectShow filters implementing FFmpeg features rather than a single package like ffdshow. One of the more useful of these filters is a Matroska splitter. You can't open MKV files using DirectShow without one.
- Preferred DirectShow Filter Tweaker: When you have more than one DirectShow filter installed to handle the same type of file each one has a priority. Sometimes you need to change the priorities of one or more filters to make sure DirectShow uses the one you want. This program can do that for you.
- Media Player Classic - Home Cinema: This DirectShow based media player can handle just about any file you throw at it. You can also get all of its built-in DirectShow filters as standalone AX files.
[/list]