You could to all out and use XML or something similar, but if I were to make this text/script for something like how you would use it, I'd probably make my own standard and parser for it.

You could take the tag concept of HTML and simply have it token-ize blocks and the parser will act as a state machine, kinda like so...

Code:
<text>Some text</text>
<color>color name or RGB(A) value</color>
Line1<br>Line2
<img>texture_or_image_name</img>

...or for the more complex content generation...

<text-x>x location of text drawing</text-x>
<text-y>y location of text drawing</text-y>
<draw-x>x location of image drawing</draw-x>
<draw-y>y location of image drawing</draw-y>
<draw-r>set red color offset for image drawing</draw-r>
<draw-g>set green color offset for image drawing</draw-g>
<draw-b>set blue color offset for image drawing</draw-b>
<draw-a>set alpha offset for image drawing</draw-a>
<reset>set attribute to default value</reset>
I think you get the idea. You could make it as complicated or as simple as you need. If you want the parser to allow only 1 image and 1 text block, (sort of like a standard character dialog box) you could have it ignore extra text or img blocks to safeguard the parser from tripping up.

If you have a few different display 'templates' or configurations you want to use on screen, you could send some values to the parser before you assign the text/script to display your next script.

For example if you wanted to have these displays:

- A character dialog box with 1 image and 1 text block you could put...

Code:
// set_dialog_display(num of images, num of text blocks)
set_dialog_display(1,1);
- A player stats display box with 2 images and 2 text displays you could instaed put...

Code:
set_dialog_display(2, 2);
You then have to make your game draw each display based on which one you picked, and so on...