I'm trying to scrape news articles from Google News using Node.js. The issue I am passing is that the links provided by the RSS feed. They give us this type of link which is a Google Rss Link which redirects to the original article.
Example: Link provided by the RSS feed - https://news.google.com/rss/articles/CBMiwAFBVV95cUxNc1hWZ0hlVFNubnVpeWcyMWcwOExOOW0wSlNrRWdTdWtPZlhkZ0dROTdnRnlkNFZ5VnpUSHJyYzlpWkpFeVlORnlWRnFGRmVHLTlYWTN3YmVPUjlrcTRVWGo3Qk9rd1pTX2hkM05xSEtOc1NLNXZFSXVIYjdORjdOT21QUDZyV2VwaHltaVRiQXI3ZkJYSW1PX2RLYWZhWmZ0RFY4cGh5NGFmX3RNRk5sNzlLWW14c3gyNTFGdmkzRkk?oc=5
Which redirects to - https://apnews.com/article/munich-zelenskyy-russia-ukraine-stubb-finland-putin-trump-vance-a96cd82f8011ce75570d45fe45e41625
I attempted to use Axios to follow the redirects and extract the final URL (apnews link) using response.request.res.responseUrl, but this approach doesn't work for Google News links. The responseUrl always remains the same as the original Google News URL.
I used puppeteer to do the same thing, but it is too slow and unnecessary for this. Other than that, I aim to grab the original link, the image from Opengraph, and the description from opengraph. So I was wondering if there is a faster way than puppeteer and using axios or some other library.
async function getRedirectUrl(googleUrl: string): Promise<string> {
try {
const response = await axios.get(googleUrl, {
maxRedirects: 5,
validateStatus: function (status) {
return status >= 200 && status < 303;
}
});
console.log(response.request.res.responseUrl)
return response.request.res.responseUrl || googleUrl
} catch (error) {
console.log("Error following redirect:", error)
return googleUrl
}
}
I have already answered this here using python with Requests
& BeautifulSoup
.
Here's a javascript equivalent using Axios
& Cheerio
:
const axios = require('axios');
const cheerio = require('cheerio');
async function getArticleUrl(googleRssUrl) {
const response = await axios.get(googleRssUrl);
const $ = cheerio.load(response.data);
const data = $('c-wiz[data-p]').attr('data-p');
const obj = JSON.parse(data.replace('%.@.', '["garturlreq",'));
const payload = {
'f.req': JSON.stringify([[['Fbv4je', JSON.stringify([...obj.slice(0, -6), ...obj.slice(-2)]), 'null', 'generic']]])
};
const headers = {
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
};
const postResponse = await axios.post('https://news.google.com/_/DotsSplashUi/data/batchexecute', payload, { headers });
const arrayString = JSON.parse(postResponse.data.replace(")]}'", ""))[0][2];
const articleUrl = JSON.parse(arrayString)[1];
return articleUrl;
}
const rss = 'https://news.google.com/rss/articles/CBMiwAFBVV95cUxNc1hWZ0hlVFNubnVpeWcyMWcwOExOOW0wSlNrRWdTdWtPZlhkZ0dROTdnRnlkNFZ5VnpUSHJyYzlpWkpFeVlORnlWRnFGRmVHLTlYWTN3YmVPUjlrcTRVWGo3Qk9rd1pTX2hkM05xSEtOc1NLNXZFSXVIYjdORjdOT21QUDZyV2VwaHltaVRiQXI3ZkJYSW1PX2RLYWZhWmZ0RFY4cGh5NGFmX3RNRk5sNzlLWW14c3gyNTFGdmkzRkk?oc=5'
getArticleUrl(rss).then(url => console.log(url));