Search code examples
csocketsget

Recieving multiple html content while calling recv() function?


I am writing a http client to receive html from website.

this is the code:I am just adding the piece of code that contains the logic related to socket and hence the initialization of strings(char[]) and functions are missing

scanf("%s",&URL);
int c_socket = socket(AF_INET, SOCK_STREAM, 0);

struct sockaddr_in urladdress;
urladdress.sin_family = AF_INET;
urladdress.sin_port = htons(PORT);
urladdress.sin_addr.s_addr = inet_addr(URL);

connect(c_socket, (struct sockaddr*) &urladdress, sizeof(urladdress));

char REQUEST[] = "GET / HTTP/1.1\r\n\r\n";
char response[512];
int size_recv,total_recv = 0;
std::string content = " ";
send(c_socket, REQUEST, sizeof(REQUEST), 0);

while((size_recv = recv(c_socket, response, sizeof(response), 0)) > 0 && content[content.length()]!='\n')
{
    content += response;
    memset(response ,0 , sizeof(response));
}
close(c_socket);
printf("%s",content.c_str());

while receiving html I get multiple html content that after html code is completed I again get some part of the same html again and it is mostly not complete ,seems like the server is sending more than one file.

Something like this:

<!-- header -->
<html> something </html>
<!-- header -->
<html> someth

I think it is due to successive calls made to recv() function to get all data needed.As you can see I have put in place the condition in while loop to automatically stop receiving data anymore when it reaches the end but it is not stopping.

I don't know whether it is expected or not, and I have to put some other logic in place to stop more calls to recv() and if yes then what logic. Is it that I have to write something to format data so that it contains only one html body like deleting everything after </html> tag is found.

All the posts that I have found till now explains that it is expected that all data is not received at once so I am compelled to call recv() multiple times.But they don't seem to say anything about receiving more than one html body and writing some logic to stop.


Solution

  • TCP is a stream based protocol, which means a single read can correspond to multiple messages or to a partial message.

    You need to read the Content-Length header to know how many bytes you are supposed to read. If you happen to get more bytes then you asked for, you need to buffer those bytes and save them for the next message you read.