Skip to content

Conversation

@juanpablosalas
Copy link
Collaborator

As suggested by Jason, I included the last updated and created dates for files retrieved by the git scraper. This is done using the git python library and the first and last commit associated with each file.

Copy link
Collaborator

@lucalavezzo lucalavezzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @juanpablosalas, minor comments, but then we can merge. I would like to pull this into #377 and build some things on top of it.

clone_from_url = url.replace("gitlab", f"{self.git_username}:{self.git_token}@gitlab")
elif "github" in url:
clone_from_url = url.replace("github", f"{self.git_username}:{self.git_token}@github")
clone_from_url = url#.replace("github", f"{self.git_username}:{self.git_token}@github")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-trivial change in behavior, I think: didn't this allow you to pull from private repos?

last_updated_at = self._get_last_updated_date(repo=repo, file_path=markdown_path)
resource = ScrapedResource(
url=current_url,
content=text_content,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could think about putting the file name, date, repo, etc. in the text_content to make it easier for things like BM25 to find files via query. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants