I need to create an API that web scraps GitHub's repos getting the following data:
I'm using Node with TypeScript so, to get the most out of it, I decided to create an interface called FileInterface, that has the four attributes mentioned above.
And of course, the variable is an array of that interface:
let files: FileInterface[] = [];
Let's take my own repo to use as an example: https://github.com/raphaelalvarenga/git-hub-web-scraping
So far so good.
I'm already pointing to the HTML's files section with request-promise dependency and storing them in a Cheerio variable so I can traverse through the "tr" tags to create a loop. As you might think, those "tr" tags represent each files/folders inside of a "table" tag (if you inspect the page, it can easily be found). The loop will fill a temp variable called:
let tempFile: FileInterface;
And at the end of every cycle of the loop, the array will be populated:
files.push(tempFile);
In GitHub repo's initial page, we can find the file names and their extension. But the size and total of lines, we can't. They are found when clicking on them to redirect to the file page. Let's say we clicked in README.md:
Ok, now we can see README.md has 2 lines and 91 Bytes.
My problem is, since this will take a long time, it needs to be an async function. But I can't handle the loop in Cheerio content inside the async function.
Things that I've tried:
Below I'll put the code I created so far. I'll upload the repo in case you want to download and test (the link is at the beginning of this post), but I need to warn that this code is in the branch "repo-request-bad-loop". It's not in the master. Don't forget because the master branch doesn't have any of this that I mentioned =)
I'm making a request in Insomnia to the route "/" and passing this object:
{
"action": "getRepoData",
"url": "https://github.com/raphaelalvarenga/git-hub-web-scraping"
}
index-controller.ts file:
As you can see, it calls the getRowData file, the problematic one. And here it is.
getRowData.ts file:
I will try to help you, although I do not know typescript. I redid the getRowData function a bit and now it works for me:
import cheerio from "cheerio";
import FileInterface from "../interfaces/file-interface";
import getFileRemainingData from "../routines/getFileRemaningData";
const getRowData = async (html: string): Promise<FileInterface[]> => {
const $ = cheerio.load(html);
const promises: any[] = $('.files .js-navigation-item').map(async (i: number, item: CheerioElement) => {
const tempFile: FileInterface = {name: "", extension: "", size: "", totalLines: ""};
const svgClasses = $(item).find(".icon > svg").attr("class");
const isFile = svgClasses?.split(" ")[1] === "octicon-file";
if (isFile) {
// Get the file name
const content: Cheerio = $(item).find("td.content a");
tempFile.name = content.text();
// Get the extension. In case the name is such as ".gitignore", the whole name will be considered
const [filename, extension] = tempFile.name.split(".");
tempFile.extension = filename === "" ? tempFile.name : extension;
// Get the total lines and the size. A new request to the file screen will be needed
const relativeLink = content.attr("href")
const FILEURL = `https://github.com${relativeLink}`;
const fileRemainingData: {totalLines: string, size: string} = await getFileRemainingData(FILEURL, tempFile);
tempFile.totalLines = fileRemainingData.totalLines;
tempFile.size = fileRemainingData.size;
} else {
// is not file
}
return tempFile;
}).get();
const files: FileInterface[] = await Promise.all(promises);
return files;
}
export default getRowData;