It wouldn't surprise me if it had something to do with how it's "learning" from the training data. Current "AI" models at best have a very rudimentary ability to correlate words with images of objects that are "alike", but they do not demonstrate the ability to understand what the object is supposed to be, let alone how it is supposed to interact with other objects or why.
When it comes to hands in particular, think of how many different ways a hand can be depicted, not just from orientation relative to the viewer but also from posture and gesture. It's intuitive to us that a hand is a versatile body part whose purpose is to manipulate objects in one's environment (if not also a means to communicate), but to an "AI" it's just a wide range of wildly different shapes that somehow correlate with an arbitrary "hand" descriptor. With such variability in how a hand can appear, all the "AI" can really do is essentially guess at what visually defines a hand according to the training data it was provided, which in most models ends up being a composite average that dives deep into the Uncanny Valley.