Search code examples
powershellnlpcorpus

Why is the text in the files I am concatenating in Powershell coming out altered?


Sorry if this doesn't make much sense, I'm not much of a programmer.

I am using PowerShell to concatenate all of the files within a folder into a single larger file, however when I do this, the text itself comes out 'corrupted'.

I have a folder of Ancient Greek texts that all end with a .tess extension, these files come from https://github.com/cltk/grc_text_tesserae/tree/master/texts (I'm not sure how this extension works, but it opens fine in Notepad). I used:

Get-Content *.tess | Set-Content greekcorpus.tess

However, the text would come out scrambled. For example:

Σιδὼν ἐπὶ θαλάττῃ πόλις

Comes out as:

Σιδὼν ἐπὶ θαλαÌττῃ ποÌλιÏ"

Anyone know what could be going wrong? Thanks!


Solution

  • This should do the work :

    Get-Content *.tess -Encoding UTF8