so, I have just opened an issue on GitHub.
Could you please take a look when you have a moment?
https://github.com/langchain-ai/langchain-experimental/issues/73
so, I have just opened an issue on GitHub.
Could you please take a look when you have a moment?
https://github.com/langchain-ai/langchain-experimental/issues/73
@OnAnd0n I did check your proposed solution and problem that you have identified.
Why don’t we make combine_sentences use a fixed-size window with edge padding (clamp or reflect indices), instead of shrinking windows at document boundaries.
It removes the start/end “dead-zone” bias. Also, It keeps embedding calls at n (same as now), so runtime/cost stays nearly identical. Plus It’s a small, local change in one function.
@keenborder786
Thank you for the clamping suggestion.
unfortunately, I found that the issue of failing to separate s1 and s2 still persists even after applying the clamping method.
But, By applying the Disjoint Context Comparison approach alongside your suggested Clamping method, we can effectively address the boundary detection issues at the start of the document.
While Clamping ensures a stable window size, combining it with a Disjoint approach eliminates Inclusion Bias by isolating the information at each boundary.
Here is how the combined logic works with a buffer_size of 2:
Boundary 0 (s1 | s2):
Left (Pre): {s1, s1} (Clamping applied)
Right (Post): {s2, s3}
Boundary 1 (s2 | s3):
Left (Pre): {s1, s2}
Right (Post): {s3, s4}
Boundary 2 (s3 | s4):
Left (Pre): {s2, s3}
Right (Post): {s4, s5}
Boundary 3 (s4 | s5):
Left (Pre): {s3, s4}
Right (Post): {s5, s6}
Boundary 4 (s5 | s6):
Left (Pre): {s4, s5}
Right (Post): {s6, s6} (Clamping applied)
By integrating these two approaches, we can accurately capture the semantic shifts between s1 and s2.
Furthermore, since each sentence is embedded only once, the computational overhead and costs are significantly reduced.
Applied Methods
Modified Functions
Deprecated Function
I am attaching an image showing the improved results after applying the proposed logic.
Would it be okay to proceed with the Pull Request (PR) in this manner?
Also, could you recommend whom I should tag or request for a review?
(I am mindful that github-maintainers all have very busy schedules and want to ensure the PR is directed to the right person so it doesn’t get overlooked.)
Excellent, yes you can open a PR in here GitHub - langchain-ai/langchain-community: Community-maintained LangChain integrations · GitHub and mention @mdrxy
Also if possible @OnAnd0n can you mark your reply as a solution as it helps the community.
Thank you for the guidance!!!
Just to double-check, my changes are currently within the langchain-experimental package. Should I still open the PR in the langchain-community repository, or was that a link for the experimental repo?
Oh yes, open it in langchain-experimental package.