Let me describe the "battlefield" of my task:
What I need to implement is a TextOverVideo feature. The Text itself goes via network and is to be rendered on the recipient side with Direct3D renderer. AFAIK, it is commonly used in game development to create your own texture with letters/numbers and draw this items. Because our application must support many languages, we ought to use a standard. That's why I've been working with ID3DXFont interface but I've found out some unsatisfied limitations.
What I've faced is a lack of scalability. E.g. if user is resizing video window I have to RE-create D3DXFont with new D3DXFONT_DESC while he's doing that. I think it is unacceptable. That is why the ONLY solution I see (due to my skills) is somehow render the text to a texture and therefore draw sprite with scaling, translation etc.
So, I'm not sure if I go into the correct direction. Please help with advice, experience, literature, sources...
Your question is a bit unclear. As I understand it, you want easily scalable font.
I think it is unacceptable
As far as I know, this is standard behavior for fonts - even for system fonts. They aren't supposed to be easily scalable.
Possible solutions:
I'd forget about it. As long as user is happy about overall performance, improving font rendering routines (so their behavior looks nice to you) is not worth the effort.
--EDIT--
About ID3DXRenderTarget.
EVen if you use ID3DXRenderTarget, you'll need ID3DXFont. I.e. you use ID3DXFont to render text onto texture, and then use texture to blit text onto screen.
Because you said that performance is critical, you can delay creation of new ID3DXFont until user stops resizing video. I.e. When user starts resizing video, you use old font, but upscale it using texture. There will be filtering, of course. Once user stops resizing, you create new font when you have time. you probably can do that in separate thread, but I'm not sure about it. OR you could simply always render text in the same resolution as video. This way you won't have to worry about resizing it (it still will be filtered - along with the video). Some video players work this way.
Few more things about ID3DXFont. There is one problem with ID3DXFont - it is slow in situations where you need a lot of text (but you still need it, because it supports unicode, and writing texturefont with unicode support is pain). Last time I worked with it I optimized things by caching commonly used strings in the textures. I.e. any string that was drawn more than 3 frames in the row were rendered onto D3DFMT_A8R8G8B8 texture/render target, and then I've been copying that string from texture instead of using ID3DXFont. Strings that weren't rendered for a while, were removed from texture. That gave some serious boost. This solution, however is tricky - monitoring empty space in the texture, removing unused strings, and defragmenting the texture isn't exactly trivial (there is nothing exceptionally complicated, but it is easy to make a mistake). You won't need such complicated system unless your screen is literally covered by text.