I fundamentally don't get the idea of "machine learning" when it comes to processing audio or video.
In the first place, there is an objectively accurate output - whoever mixed the recording intended it to look and sound a particular way, and I'm dubious that stepping away from that is ever a good idea.
Even if we accept the idea that modifications to the recording may be subjectively more enjoyable to the audience, the keyword there is "subjectively": how does the TV know what the audience does or does not want to see, given that it isn't necessarily the same as what any other audience wants to see?
I don't see what rules the machine-learning can apply in order to determine whether any given form of processing is "better" than any other. And as for the idea that different video or audio processing should be applied depending on the content... why would the dialogue in a movie and the newsreader's vice in a news broadcast need to be processed differently?
None of this makes any sense to me.