Well, I've searched, there is a lot of questions about names, but I couldn't find any solution for the case I'm looking for.
I set the text with jQuery into a var
when there is a cast in the description.
So I'm trying to get the cast of movies. The problem is that the text content also has categories that are also separated with commas, and may be wrongly detected as the cast.
Therefore if I deny when the text has more than 3 letters capitalized I can filter the correct cast in the returned text.
How the cast are usually disposed inside the text:
John Smith, Mary Jane, Neo, Trinity, Morpheus, Mr. Anderson
Sometimes the last name is not present which is the reason complicating things. This way it confuses with categories. If I could set a bunch of words from a denied list maybe it would be better than deny capitalized letters which is very usual to be under the categories.
var castExists = $('span.post-bold:contains("Cast")');
var cast = "";
if (castExists.length) {
cast = $("div.post-message").text();
var reg = /^(?!\s)([a-z ,A-Z.'-]+)/gm;
var getCast = reg.exec( cast );
if (getCast !== null) {
cast = getCast[0].toString().trim();
}
else {
getCast = '';
}
}
Title: Movie Title
Production: Something
Year: 2021
Categories: Drama, HORROR, Sci Fi, TV Show, Action
Cast: John Smith, Mary Jane, Neo, Trinity, Morpheus, Mr. Anderson
While Title:, Year:, Cast: , etc are under the span.post-bold
tag, everything is inside the div.post-message
For example:
<div class="post-message">
<span class="post-bold">Title</span>
: Movie Title
<span class="post-bold">Year</span>
: 2021
<span class="post-bold">Categories</span>
: Drama, HORROR, Sci Fi, TV Show, Action
<span class="post-bold">Cast</span>
: John Smith, Mary Jane, Neo, Trinity, Morpheus, Mr. Anderson
</div>
As it depends how an user created, the order of things may be different.
Here, the last regex I was trying to write, but which wasn't working
([A-Z][a-z]{1,}( |, )([A-Z][a-z]{1,})?)+
Update:
I created this link on regex101.com on regex101 with the examples, as I saw in the comments appears I wasn't so clear on the question. This way I think people have better chance to help me. The ones with names should get, the ones with categories must not.
PS: I set the regular expression Mohammad told me in the comments on the link.
I don't think using a regex that finds three consecutive capital letters to filter out the cast line is the way to go.
Firstly, based on your example, this will not always work, as your example has a line with categories that does not have a word with three consecutive capital letters.
Secondly, if you are looking at actual names, it is very possible that you will also wrongly filter out lines with names, if e.g. someone like Rodney L Jones III is among the cast (take a look at these interesting wrong assumptions that programmers make about names).
Instead, you can just extract the cast line by first finding a span that contains Cast
(using filter()
) and then get the text of the next node (using nextSibling.nodeValue
). I also used substring()
to trim characters at the beginning (remove the colon and the spaces) and trim()
to remove the newline from the end of it:
$("div.post-message").contents().filter(function() { return ($(this).text() === 'Cast');
})[0].nextSibling.nodeValue.substring(3).trim();
This gives you the cast line:
John Smith, Mary Jane, Neo, Trinity, Morpheus, Mr. Anderson
This you can then simply split at each comma (assuming that there are no names with commas within them):
var actors = getCast.split(', ');
for (var i = 0; i < actors.length; i++) {
console.log(actors[i]);
}
Output:
John Smith
Mary Jane
Neo
Trinity
Morpheus
Mr. Anderson